Your Career Platform for Big Data

Be part of the digital revolution in Switzerland

 

View all Jobs

DataCareer Blog

Accessing datasets in a structured form through an API can often simplify the life of a data analyst - especially if the same data series are used repeatedly. Unfortunately, many public data sources such as the Federal Statistical Office (BFS) do not provide data access through an API ( STAT-TAB makes life a bit easier, but is not fully automated). While opendata.swiss offers a great way to explore available public datasets. The high number of different topics, datasets and data contributors makes standardization highly difficult. Hence, when pulling the data, it comes in its original structure and often requires in depth processing (See our blog on fetching data from opendata.swiss in Python . For people interested in data capturing regional development, Novalytica offers an open data API covering several hundred data series in the area of local level data that contain a regional relation (e.g. municipality, ms-region, canton etc.). This offers the advantage that the data comes in a structured format and is always updated - thus, once an automated analysis or report is prepared, it automatically pulls the most recent data. The free subscription includes more than 200 series that cover the areas socio-demographics, economic development, real estate and politics. The Premium subscription offers more detailed local level datasets that are not from public sources, for example real estate prices and rents, job openings, Airbnb listings, investor returns, bankruptcies and company foundings as well as several machine learning based industry indicators. Let’s try it out: After signing up on the website, within 8 hours, I received my account to the free subscription as well as a full list of available data series and their respective API keys. On their website, they provide example scripts for R and Python , what makes it easy to get started. I am interested in plotting the development of the share of singles over the last five years for the 5 largest Swiss cities. Let's first load the required Python libraries: In [1]: import requests import pandas as pd from pandas.io.json import json_normalize import seaborn as sns ; sns . set () import matplotlib.pyplot as plt from matplotlib.ticker import PercentFormatter username = 'USER' password = 'PASSWORD'   I first send a token request for authentication. With the obtained token, I can then send all further requests. In [2]: url_tokens = 'http://nova-db.com/api/v1/tokens' url_series = 'http://nova-db.com/api/v1/series/' #TOKEN REQUEST r = requests . post ( url_tokens , auth = ( username , password )) r . status_code # 200 for successful response response = r . json () # Prepare token for bearer authentication token = response [ 'token' ] token_bearer = 'Bearer ' + token print ( token_bearer ) headers = { 'Authorization' : token_bearer , }   Bearer 1tA7S4XIq1HIe4OuFAeOR28AOSBhvH4h   Now, we can define the series of interest. In our case the share of singles with the key 'civilstatus_single'. We pull all municipalities and then exclude everything but the 5 cities. In [3]: # Series of interest series = 'civilstatus_single' # Optional search parameters. Set to "None" to get all. params = {} #DATA REQUEST s = requests . get ( url_series + series , params = params , headers = headers ) s . status_code # 200 for successful response result = s . json () #reformat to dataframe df = pd . DataFrame . from_dict ( json_normalize ( result ), orient = 'columns' ) df [ 'regionnr' ] = pd . to_numeric ( df [ 'regionnr' ], downcast = 'float' ) df [ 'value' ] = pd . to_numeric ( df [ 'value' ], downcast = 'float' ) df . head () Out[3]:     freq period regionname regionnr topic value variable 0 a 2010 Aeugst am Albis 1.0 sociodemo 0.419408 civilstatus_single 1 a 2011 Aeugst am Albis 1.0 sociodemo 0.425654 civilstatus_single 2 a 2012 Aeugst am Albis 1.0 sociodemo 0.429668 civilstatus_single 3 a 2013 Aeugst am Albis 1.0 sociodemo 0.424949 civilstatus_single 4 a 2014 Aeugst am Albis 1.0 sociodemo 0.424196 civilstatus_single   As we are only interested in the 5 main cities, we select a subset of the data frame. It is good to keep in mind that the dataset does not use any special characters often used in German (ä, ö, ü) or French (é, è, â etc.). In [4]: cities = [ 'Bern' , 'Basel' , 'Geneve' , 'Lausanne' , 'Zuerich' ] df = df . loc [ df [ 'regionname' ] . isin ( cities )] In [5]: df [ 'value' ] = df [ 'value' ] * 100 fig = plt . figure ( figsize = ( 8 , 5 )) p = sns . lineplot ( x = "period" , y = "value" , hue = "regionname" , data = df ) p . set ( xlabel = 'Year' , ylabel = 'Share singles' ) p . set_title ( "Share single in Swiss cities" ) p . yaxis . set_major_formatter ( PercentFormatter ()) plt . show ()     We observe that the share of singles is rising - what is not fully unexpected. Interesting are the differences in level by city and that the change is almost constant over time for all cities. Thus, the API let us pull, analyse and plot many data series on municipality level within minutes. As soon as new data is available, we can simply rerun the script and obtain the analysis within seconds. If you are often working in Excel, also check out the Nova Excel Add-In, that allows you to access the same database within Excel: novalytica.com/excel Get more information on the API at: novalytica.com/api li>a { padding-top: 15px; padding-bottom: 15px; } } .navbar .container { position: relative; max-width: 1130px !important; } @media (min-width: 768px) { .container { width: 750px; } } @media (min-width: 992px) { .container { width: 970px; } } @media (min-width: 1200px) { .container { width: 1170px; } } .container { padding-right: 15px; padding-left: 15px; margin-right: auto; margin-left: auto; } @media (min-width: 768px) { .navbar>.container .navbar-brand, .navbar>.container-fluid .navbar-brand { margin-left: -15px; } } .navbar-nav>li>a { padding-top: 10px; padding-bottom: 10px; line-height: 20px; } @media (min-width: 768px) { .navbar-nav>li>a { padding-top: 15px; padding-bottom: 15px; } } .dropdown-menu { position: absolute; top: 100%; left: 0; z-index: 1000; display: none; float: left; min-width: 160px; padding: 5px 0; margin: 2px 0 0; font-size: 14px; text-align: left; list-style: none; background-color: #fff; -webkit-background-clip: padding-box; background-clip: padding-box; border: 1px solid #ccc; border: 1px solid rgba(0, 0, 0, .15); border-radius: 4px; -webkit-box-shadow: 0 6px 12px rgba(0, 0, 0, .175); box-shadow: 0 6px 12px rgba(0, 0, 0, .175); } .dropdown-menu { background-color: #00a7de; } .body__inner .container .container { width: 100%; padding-left: 0px; padding-right: 0px; } .prompt { min-width: 11ex; margin-left: -13ex !important; padding: 6px 6px 6px 0px; line-height: 1 !important; display: none !important; } div.cell { margin: 0 !important; padding-left: 0px; } .border-box-sizing { outline: none; } .blog__full-article .static-pages__blog .blog__content div, .blog__full-article .static-pages__blog .blog__content p { margin: 0; } #notebook-container { padding: 0; min-height: 0; -webkit-box-shadow: none; box-shadow: none; } div#notebook { padding-top: 0px; } div.input_area>div.highlight { padding-left: 6px; margin: 1px !important; } .cell div.input { margin-bottom: 5px !important; } .text_cell_render h1 { text-align: left; } .anchor-link { display: none; } div.output_area .rendered_html table { margin-top: 10px; } .text_cell_render, .text_cell.rendered .rendered_html{ padding-left:0px; } .inner_cell {margin-left: 5px;} @media (min-width: 541px) { .navbar-collapse.collapse { display: none !important; } } @media(max-width: 800px) { div.output_subarea { overflow-x: auto; padding: 0.4em; -webkit-box-flex: 1; -moz-box-flex: 1; box-flex: 1; flex: 1; max-width: calc(100% - 2ex); } .prompt { margin-left: 0px !important; } } @media (max-width: 991px) { .navbar-collapse.collapse { background-color: #8c8585 !important; } } .navbar { min-height: 77px; font-size: 16px; border: none; background: none; z-index: 20; background: #fff; border-radius: 0; margin-bottom: 0; background: transparent; position: absolute; top: 0; left: 0; width: 100%; } @media all and (max-width: 992px) { .navbar { min-height: 72px; } } .navbar-collapse.collapsing { position: absolute; } @media (min-width: 994px) { .navbar-collapse.collapse { display: block !important; } } .blog__content h1, .blog__content h2, .blog__content h3, .blog__content h4 { padding: 30px 0 10px; } div.output_wrapper { margin-bottom: 15px !important; } -->
For the past few years, tasks involving text and speech processing have become really hot-trendy. Among the various researches belonging to the fields of Natural Language Processing and Machine Learning, sentiment analysis ranks really high. Sentiment analysis allows identifying and getting subjective information from the source data using data analysis and visualization, ML models for classification, text mining and analysis. This helps to understand social opinions on the subject, so sentiment analysis is widely used in business and politics and usually conducted in social networks. Social networks as the main resource for sentiment analysis Nowadays, social nets and forums are the main stage for people sharing opinions. That is why they are so interesting for researches to figure out the attitude to one or another object. Sentiment analysis allows challenging the problem of analyzing textual data created by users on social nets, microblogging platforms and forums, as well as business platforms regarding the opinions the users have about some product, service, person, idea, and so on. In the most common approach, text can be classified into two classes (binary sentiment classification): positive and negative, but sentiment analysis can have way more classes involving multi-class problem. Sentiment analysis allows processing hundreds and thousands of texts in a short period of time. This is another reason for its popularity - while people need many hours to do the same work, sentiment analysis is able to finish it in a few seconds.  Common approaches for classifying sentiments Sentiment analysis of the text data can be done via three commonly used methods: machine learning, using dictionaries, and hybrid. Learning-based approach Machine learning approach is one of the most popular nowadays. Using ML techniques and various methods, users can build a classifier that can identify different sentiment expressions in the text. Dictionary-based approach The main concept of this approach is using a bag of words with polarity scores, that can help to establish whether the word has a positive, negative, or neutral connotation. Such an approach doesn't require any training set to be used allowing to classify even a small amount of data. However, there are a lot of words and expressions that are still not included in sentiment dictionaries. Hybrid approach As is evident from the title, this approach combines machine learning and lexicon-based techniques. Despite the fact that it's not widely used, the hybrid approach shows more promising and valuable results than the two approaches used separately. In this article, we will implement a dictionary-based approach, so let's deep into its basis. Dictionary (or Lexicon)-based sentiment analysis uses special dictionaries, lexicons, and methods, a lot number of which is available for calculating sentiment in text. The main are: afinn bing nrc All three are the sentiment dictionaries which help to evaluate the valence of the textual data by searching for words that describe emotion or opinion.  Things needed to be done before sentiment analysis Before starting building sentiment analyzer, a few steps must be taken. First of all, we need to state the problem we are going to explore, understand its objective. Since we will use data from Donald Trump twitter, let’s claim our objective as an attempt to analyze which connotation his last tweets have. As the problem is outlined, we need to prepare our data for examining. Data preprocessing is basically an initial step in text and sentiment classification. Depending on the input data, various amount of techniques can be applied in order to make data more comprehensible and improve the effectiveness of the analysis. The most common steps in data processing are: removing numbers removing stopwords removing punctuation and so on. Building sentiment classifier The first step in building our classifier is installing the needed packages. We will need the following packages that can be installed via command install.packages("name of the package") directly from the development environment: twitteR dplyr splitstackshape purrr As soon as all the packages are installed, let's initialize them. In [6]: library(twitteR) library(dplyr) library(splitstackshape) library(tidytext) library(purrr) We're going to get tweets from Donald Trump's account directly from Twitter, and that is why we need to provide Twitter API credentials. In [26]: api_key <- "----" api_secret <- "----" access_token <- "----" access_token_secret <- "----" In [27]: setup_twitter_oauth(api_key,api_secret,access_token,access_token_secret) [1] "Using direct authentication" And now it's time to get tweets from Donald Trump's account and convert the data into dataframe. In [114]: TrumpTweets <- userTimeline("realDonaldTrump", n = 3200) In [115]: TrumpTweets <- tbl_df(map_df(TrumpTweets, as.data.frame)) Here how our initial dataframe looks: In [116]: head(TrumpTweets)   text favorited favoriteCount replyToSN created truncated replyToSID id replyToUID statusSource screenName retweetCount isRetweet retweeted longitude latitude It was my great honor to host Canadian Prime Minister @JustinTrudeau at the @WhiteHouse today!🇺🇸🇨🇦 https://t.co/orlejZ9FFs FALSE 17424 NA 2019-06-20 17:49:52 FALSE NA 1141765119929700353 NA <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> realDonaldTrump 3540 FALSE FALSE NA NA Iran made a very big mistake! FALSE 127069 NA 2019-06-20 14:15:04 FALSE NA 1141711064305983488 NA <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> realDonaldTrump 38351 FALSE FALSE NA NA “The President has a really good story to tell. We have unemployment lower than we’ve seen in decades. We have peop… https://t.co/Pl2HsZbiRK FALSE 36218 NA 2019-06-20 14:14:13 TRUE NA 1141710851617034240 NA <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> realDonaldTrump 8753 FALSE FALSE NA NA S&amp;P opens at Record High! FALSE 43995 NA 2019-06-20 13:58:53 FALSE NA 1141706991464849408 NA <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> realDonaldTrump 9037 FALSE FALSE NA NA Since Election Day 2016, Stocks up almost 50%, Stocks gained 9.2 Trillion Dollars in value, and more than 5,000,000… https://t.co/nOj2hCnU11 FALSE 62468 NA 2019-06-20 00:12:31 TRUE NA 1141499029727121408 NA <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> realDonaldTrump 16296 FALSE FALSE NA NA Congratulations to President Lopez Obrador — Mexico voted to ratify the USMCA today by a huge margin. Time for Congress to do the same here! FALSE 85219 NA 2019-06-19 23:01:59 FALSE NA 1141481280653209600 NA <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> realDonaldTrump 20039 FALSE FALSE NA NA To prepare our data for classification, let's get rid of links and format dataframe in a way when only one word is in line. In [117]: TrumpTweets <- TrumpTweets[-(grep('t.co', TrumpTweets$'text')),] In [118]: TrumpTweets$'tweet' <- 'tweet' TrumpTweets <- TrumpTweets[ , c('text', 'tweet')] TrumpTweets <- unnest_tokens(TrumpTweets, words, text) And this is how our dataframe looks now: In [119]: tail(TrumpTweets) tweet words tweet most tweet of tweet their tweet people tweet from tweet venezuela In [120]: head(TrumpTweets) tweet words tweet iran tweet made tweet a tweet very tweet big tweet mistake It's obvious that dataframe also contains various words without useful content. So it's a good idea to get rid of them. In [121]: TrumpTweets <- anti_join(TrumpTweets, stop_words, by = c('words' = 'word')) And here's the result: In [122]: tail(TrumpTweets) tweet words tweet harassment tweet russia tweet informed tweet removed tweet people tweet venezuela In [123]: head(TrumpTweets) tweet words tweet iran tweet mistake tweet amp tweet record tweet congratulations tweet president Much better, isn't it? Let's see how many times each word appears in Donald Trump's tweets. In [124]: word_count <- dplyr::count(TrumpTweets, words, sort = TRUE) In [125]: head(word_count) words n day 2 democrats 2 enjoy 2 florida 2 iran 2 live 2 Now it's time to create some dataframe with sentiments that will be used for tweets classification. We will use bing dictionary although you can easily use any other source. In [126]: sentiments <-get_sentiments("bing") sentiments <- dplyr::select(sentiments, word, sentiment) In [127]: TrumpTweets_sentiments <- merge(word_count, sentiments, by.x = c('words'), by.y = c('word')) Above we did a simple classification of Trump's tweets words using our sentiment bag of words. And this is how the result looks: In [128]: TrumpTweets_sentiments words n sentiment beautiful 1 positive burning 1 negative congratulations 1 positive defy 1 negative enjoy 2 positive harassment 1 negative hell 1 negative limits 1 negative mistake 1 negative scandal 1 negative strong 1 positive trump 1 positive Let's look at the number of occurrences per sentiment in tweets. In [131]: sentiments_count <- dplyr::count(TrumpTweets_sentiments, sentiment, sort = TRUE) In [132]: sentiments_count sentiment n negative 7 positive 5 We also may want to know the total count and percentage of all the sentiments. In [133]: sentiments_sum <- sum(sentiments_count$'n') In [134]: sentiments_count$'percentage' <- sentiments_count$'n' / sentiments_sum Let's now create an ordered dataframe for plotting counts of sentiments. In [135]: sentiment_count <- rbind(sentiments_count) In [136]: sentiment_count <- sentiment_count[order(sentiment_count$sentiment), ] In [137]: sentiment_count sentiment n percentage negative 7 0.5833333 positive 5 0.4166667 And now it's time for the visualization. We will plot the results of our classifier. In [144]: sentiment_count$'colour' <- as.integer(4) In [145]: barplot(sentiment_count$'n', names.arg = sentiment_count$'sentiment', col = sentiment_count$'colour', cex.names = .5) In [146]: barplot(sentiment_count$'percentage', names.arg = sentiment_count$'sentiment', col = sentiment_count$'colour', cex.names = .5)     Conclusion Sentiment analysis is a great way to explore emotions and opinions among the people. Today we explored the most common and easy way for sentiment analysis that is still great in its simplicity and gives quite an informative result. However, it should be noted that different sentiment analysis methods and lexicons work better depending on the problem and text corpuses. The result of the dictionary-based approach also depends much on the matching between the used dictionary and the textual data needed to be classified. But still, user can create own dictionary that can be a good solution. Despite this, dictionary-based methods usually show much better results than more compound techniques.
Introduction Nowadays PostgreSQL is probably one of the most powerful relational databases among the open-source solutions. Its functional capacities are no worse than Oracle’s and definitely way ahead of the MySQL. So if you are working on apps using Python, someday you will face the need of working with databases. Luckily, Python has quite a wide amount of packages that provide an easy way of connecting and using databases. In this article, we will go through the most common packages for dealing with PostgreSQL and show the way of using one of them. Python modules for working with PostgreSQL The way you access the database via Python environment usually depends on personal choices and development specificities. One of the most common and easiest ways is to establish connections using special Python drivers among which the most popular are psycopg2, PyGreSQL, py-postgresql, and pg8000. psycopg2 psycopg2 is probably one of the most popular packages for interaction with PostgreSQL from the Python environment. Written on C programming language with libpq wrapper, it provides a wide range of operations for database manipulations. psycopg2 provides both client-side and server-side cursors as well as asynchronous communication and notifications and a “COPY TO/ FROM” support. It should be noted that the package provides most of Python data types support with their adaptation to match PostgreSQL data types. Among the key features of this package are the support of multiple connections and connection objects, various methods of transaction and its management, pool of connections, auto filtering and async queries and cursors objects. Plus, columns from the database are returned via Python dictionary with their names. Among the possible weaknesses of the package, we can note a lack of documentation and mostly outdated code examples. PyGreSQL PyGreSQL is the first PostgreSQL adapter and one of the open-source Python modules for working with PostgreSQL databases that is actively developing. This package embeds the PostgreSQL query library to provide easy use of all of the PostgreSQL database manipulation features from Python environment. Despite the fact that this module provides great opportunities for working with databases, some users found its problems with working with cursors objects. py-postgresql py-postgresql is a Python3 module that provides wide abilities to interact with PostgreSQL including a high-level driver to make working with databases deeper. Among the main disadvantages of this package are the fact that it works only on the Python3 environment and doesn’t have direct support for high-level async interfaces. pg8000 The pg8000 is a pure Python module for dealing with PostgreSQL interactions that has wide documentation and is actively developing. It should be noted that this module complies with Python Database API 2.0 that basically provides the user a broader reach of DB connectivity from Python environment. Being pure Python, the module allows users using it in AWS Lambda functions without any extra work. For now, this module is actively reviewing and developing and therefore some bugs can be faced while working with it. Despite the fact that many more drivers for PostgreSQL interacting are available, some of them are outdated or have limits in usage. Still, the listed above provide users great opportunities for various databases manipulations. Working with queries We covered some of the most common packages for connecting PostgreSQL to the Python app and now it’s time to see it in use. We chose for working the psycopg2 module and continue reviewing connecting and working with databases with its operations as an example. It is assumed that PostgreSQL is already installed and run on your machine. 1. Installing package. Run conda install psycopg2 if you are using anaconda or pip install psycopg2 via pip package manager. 2. Once the package was downloaded, let’s check the accessibility to the database.  To check the connection, we will write a simple script. Be sure, that you provided own specific database name, username, and password. Save the script and run it with Python test.py command. As a result, we will get an empty array ([]) symbolizing that connection works properly. Now it’s time to explore some basic operations with databases. First, let’s add to our database some tables we will work with. Usually data stores in tables in .csv format so why not start exploring PostgreSQL interactions from adding some data to database from .csv table? For the next examples, we will use parts of the datasets that can be found on Kaggle:  Chicago census data; Chicago public schools. First, we establish connection and create tables with the corresponding table names. connect_str = "dbname=yourdbname user=yourname host='localhost' " + "password='yourpwd'" conn = psycopg2.connect(connect_str) cursor = conn.cursor() cursor.execute(""" CREATE TABLE census(id integer PRIMARY KEY,  COMMUNITY_AREA_NUMBER float,  COMMUNITY_AREA_NAME text,  PERCENT_OF_HOUSING_CROWDED float,  PERCENT_HOUSEHOLDS_BELOW_POVERTY float,  PERCENT_AGED_16+_UNEMPLOYED float,  PERCENT_AGED_25+_WITHOUT_HIGH_SCHOOL_DIPLOMA float,  PERCENT_AGED_UNDER_18_OR_OVER_64 float,  PER_CAPITA_INCOME integer,  HARDSHIP_INDEX float) """) In such a way another table will be created. After this, we can copy data from corresponding .csv files to our tables. with open('CENSUS.csv', 'r') as f: next(f) cursor.copy_from(f, census, sep=',') conn.commit() The commit() method is needed to make changes to our database persistent. It should be noted that while copying data from .csv tables, we do not need a header, so we skip it. The same operation will be done to another table. And now we can write some simple queries. Let’s find out which Community Areas in CENSUS table start with letter ‘B’. sql = ‘SELECT community_area_name FROM census WHERE community_area_name LIKE %s’ community_area_name = ‘B%’ cursor.execute(“sql, (community_area_name)”) cursor.fetchall() Let’s list the top 5 Community Areas by average College Enrollment. sql =  ‘SELECT community_area_name, AVG(college_enrollment) AS avg_enrollment FROM schools GROUP BY community_area_name limit 5’ cursor.execute(“sql”) cursor.fetchall() Now, let’s find the Per Capita Income of the Community Area which has a school Safety Score of 1. sql = SELECT community_area_number, community_area_name, per_capita_income FROM census WHERE community_area_number IN ( select community_area_number FROM schools WHERE safety_score = '1' )’ cursor.execute(“sql”) As you have already noticed, we used fetchall() method. The package has some quite useful - among many other useful methods - options to retrieve a single row from a table (fetchone()) or all the rows according to query (fetchall()). As soon as all the manipulations with the database are finished, the connection must be closed. It can be done with two methods: cursor.close() and conn.close(). Conclusion Establishing connection and interaction with PostgreSQL from the Python environment with the help of various drivers becomes easy and quick. Every module has its own pros and cons that basically allows each user to choose the suitable one. The psycopg2 module used for the example provides an easy and clear way to establish a connection and set queries giving opportunities for using flexible queries.
View all blog posts

Data Academy

Programmieren mit R für Einsteiger

R ist eine der führenden Lösungen für Data Science. Der Kurs führt in die Programmiersprache und in das Open Source Software Umfeld ein.

  • Dauer: 2 Tage
  • Unterlagen als PDF und Code
  • 2h Projekt-Beratung
  • Inkl. Mittagessen und Kaffee
  • Preis: CHF 990

Datenanalyse in Python für Einsteiger

Python erfreut sich einer grossen Beliebtheit. In diesem Kurs werden Sie mit den Grundlagen von Python vertraut gemacht.

  • Dauer: 2 Tage
  • Unterlagen als PDF und Code
  • Inkl. Mittagessen und Kaffee
  • Preis: CHF 990

Explorative Datenanalyse & Visualisierungen

Daten zu säubern, anzupassen und zu visualisieren sind die Grundelemente jeder Analyse. Dieser Kurs führt umfassend in die explorative Analyse mit R-Anwendungen ein.

  • Dauer: 1 Tag
  • Unterlagen als PDF und Code
  • 2h Projekt-Beratung
  • Inkl. Mittagessen und Kaffee
  • Preis: CHF 690

View all Courses