Blog

Blog Categories

The more carefully you process the data and go into details, the more valuable information you can get for your benefit. Data visualization is an efficient and handy tool for gaining insights from data. Moreover, you can make the data far more understandable, colorful and pleasant with the help of visualization tools. As data is changing every second, it is an urgent task to investigate it carefully and get the insights as fast as...
   Random Forest is a powerful ensemble learning method that can be applied to various prediction tasks, in particular classification and regression. The method uses an ensemble of decision trees as a basis and therefore has all advantages of decision trees, such as high accuracy, easy usage, and no necessity of scaling data. Moreover, it also has a very important additional benefit, namely perseverance to overfitting (unlike...
   A common and very challenging problem in machine learning is overfitting, and it comes in many different appearances. It is one of the major aspects of training the model. Overfitting occurs when the model is capturing too much noise in the training data set which leads to bad predication accuracy when applying the model to new data. One of the ways to avoid overfitting is regularization technique. In this tutorial, we...
  Google became the main starting point for our online activities. The search engine processes about 40,000 searches every second or 3.5 billion searches per day. It records what people are interested in, what they worry about or where they want to travel. In a unique manner, the search engine captures trends in interests and behavior. Hidden racisms, sexual orientation or ad returns - check out the work by Seth...
Are you looking for real world data science problems to sharpen your skills? In this post, we introduce you to four platforms hosting data science competitions. Data science competitions can be a great way for gaining practical experience with real world data, and for boosting your motivation through the competitive environment they provide. Check them out, competitions are a lot of fun! Kaggle Kaggle is the best known platform...
The open-source project R is among the leading tools for data science and machine learning tasks. Given its open-source framework, there are continuous contributions and new package libraries with new features pop up frequently. Currently, the CRAN package repository features 12,525 available packages. This post takes a look at the most popular and useful packages that have set the standards for solving data manipulation, visualization, and...
  Currently, Python and R are the dominating data science tools and Python will probably even be taking the lead (at least based on the latest KDNuggets survey ). When did the two open source players manage to become the leading platforms for analytics, data science, and machine learning, leaving behind established players such as Matlab or SAS? Here are some insights from Google Trends. Looking at the years 2009 - 2013 in the...
Mobile phone data has a vast scope. Our phones track our location, record social activities by listing who we call or message, and know what we like or what we’re looking for by collecting data on our online behavior and use of apps. The recent Mobile User Demographics Challenge on Kaggle (by the Chinese platform TalkingData ) offers some insight into the volume and precision of the information available on mobile...
Much has been written on the most popular software and programming languages for Data Science (recall, for instance, the infamous “Python vs R battle”). We approached this question by scraping job ads from Indeed and counting the frequency at which each software is mentioned as a measure of current employer demand. In a recent blog post , we analyzed the Data Science software Swiss employers want job applicants to know (showing...