Blog

Blog Categories

In this Jupyter Notebook we will retrieve data from the European Central Bank (ECB). The ECB publishes through the European Open Data Portal, which we discussed in the previous tutorial . Before diving into the code, please take a quick look at the following websites, to get a feel for what we will be dealing with. EU portal: https://data.europa.eu/euodp/en/data/publisher/ecb ECB SDMX 2.1 RESTful web...
The random forest algorithm is the combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. It can be applied to different machine learning tasks, in particular, classification and regression. Random Forest uses an ensemble of decision trees as a basis and therefore has all advantages of decision trees, such as high accuracy,...
  For many machine learning problems with a large number of features or a low number of observations, a linear model tends to overfit and variable selection is tricky. Models that use shrinkage such as Lasso and Ridge can improve the prediction accuracy as they reduce the estimation variance while providing an interpretable final model. In this tutorial, we will examine Ridge and Lasso...
  The EU Open Data Portal gives access to open data published by EU institutions, agencies  and  other bodies. Around 70 EU institutions, bodies or departments use the platform to make over 12,500 datasets available. In this Jupyter Notebook we will retrieve data from open data portal " http://data.europa.eu/euodp/en/home ". The portal is based on the open source project CKAN. CKAN stands for...
  opendata.swiss is the Swiss authorities’ portal for open data. Currently, 65 governmental organizations (many federal agencies, but also cantonal agencies, SBB and Post) provide access to 6,278 datasets. The portal offers an easily searchable catalogue of available datasets. Manually downloading the datasets can be cumbersome and the retrieval of data through the API can save time.   In this Jupyter...
Are you looking for real world data science problems to sharpen your skills? In this post, we introduce you to four platforms hosting data science competitions. Data science competitions can be a great way for gaining practical experience with real world data, and for boosting your motivation through the competitive environment they provide. Check them out, competitions are a lot of fun! Kaggle Kaggle is the best known platform...
GBM is a highly popular prediction model among data scientists or as top Kaggler Owen Zhang describes it: "My confession: I (over)use GBM. When in doubt, use GBM." GradientBoostingClassifier from sklearn is a popular and user-friendly application of Gradient Boosting in Python (another nice and even faster tool is xgboost). Apart from setting up the feature space and fitting the model, parameter tuning is a crucial task...
  Currently, Python and R are the dominating data science tools and Python will probably even be taking the lead (at least based on the latest KDNuggets survey ). When did the two open source players manage to become the leading platforms for analytics, data science, and machine learning, leaving behind established players such as Matlab or SAS? Here are some insights from Google Trends. Looking at the years 2009 - 2013 in the...
Image recognition has been a major challenge in machine learning, and working with large labelled datasets to train your algorithms can be time-consuming. One efficient approach for getting such data is to outsource the work to a large crowd of users. Google uses this approach with the game “Quick, Draw!” to create the world’s largest doodling dataset, which has recently been made publicly available . In this...
Much has been written on the most popular software and programming languages for Data Science (recall, for instance, the infamous “Python vs R battle”). We approached this question by scraping job ads from Indeed and counting the frequency at which each software is mentioned as a measure of current employer demand. In a recent blog post , we analyzed the Data Science software Swiss employers want job applicants to know (showing...
Social science researchers collect much of their data through online surveys. In many cases, they offer incentives to the participants. These incentives can take the form of lotteries for more valuable prices or individual gift card codes. We are doing the latter in our studies here at CEPA Labs at Stanford. Specifically, our survey participants receive a gift card code from Amazon.     However, sending these gift card...