Blog > R

Data Science Software: We screened 28,000 data job ads. Here is the software you need to know.

Data Science Software: We screened 28,000 data job ads. Here is the software you need to know.

Much has been written on the most popular software and programming languages for Data Science (recall, for instance, the infamous “Python vs R battle”). We approached this question by scraping job ads from Indeed and counting the frequency at which each software is mentioned as a measure of current employer demand. In a recent blog post, we analyzed the Data Science software Swiss employers want job applicants to know (showing that Python has a slight edge in popularity over R). In this post, we look at worldwide job ads and analyze them separately for different job titles. We included 6,000 positions for Data Scientists, 7,500 for Data Analysts, 2,200 for Data Engineers and 1,700 for Machine Learning specialists, as well as a population of tens of thousands of data professionals on LinkedIn. Our leading questions: Which programming languages are most popular in general, as database technologies, as machine learning libraries, and for business analysis purposes? How well does the employers’ demand for software skills match the supply, i.e. the skills data professionals currently possess? And are there software skills in high demand that are rare among data professionals? (Yes!) Check out the results below. A detailed description of our methods can be found in this post.

The programming language most in demand: Python

Being mentioned in over 60% of current job ads, Python is the top programming language for Data Scientists, Machine Learning specialists and Data Engineers. R is highly popular in Data Science (62%) job openings, but less so in the areas of Machine Learning and Data Engineering. Java and Scala are important in Data Engineering, where they rank second and third with 55% and 35% frequencies. In Machine Learning, Java and C/C++/C# are the second most popular languages, at about 40% each. Less programming skills are required for Data Analysts, with Python and R reaching 15%.

 

Figure 1: Most popular programming languages for data-related job descriptions

The leading database technology: SQL

SQL is the most in demand database skill for Data Scientists and Analysts, mentioned in 50% of current job ads. For Data Engineering, SQL, Hadoop and Apache Spark are about equally popular, with roughly 60% each. The same is true for Machine Learning, but at a much lower level of about 20%.

 

Figure 2: Most popular database technologies

 

Most popular machine learning library: Tensorflow

For Machine Learning jobs, Tensorflow is leading the list with 20%, followed by Caffe, Scikit-learn and Theano at about 9%. In Data Science job ads, Scikit-learn is at 7% and slightly more frequent than Tensorflow.

 

Figure 3: Most popular Machine Learning libraries

Business-related software in high demand

Other frequent requirements are commercial software tools for statistics, business intelligence, visualization and enterprise resource planning (ERP). In this space, SAS and Tableau are the top mentions, followed by SPSS.

 

Figure 4:  Software for statistics, business intelligence and enterprise resource planning

 

Scala programmers are a rare breed among Data Scientists

How well do these employer requirements match the available skills of Data Scientists? To obtain a rough estimate of the number of professionals knowledgeable in top-ranked programming languages, we searched LinkedIn for the different languages and filtered the results for people with the job title of Data Scientist. Figure 5 shows this Data Scientist population in relation to the number of Data Science job ads demanding the skill. By a large margin, Python and R turn out to be the most widespread programming skills in the Data Scientist population. Interestingly, MATLAB, C/C#/C++ and Scala are roughly equally frequent in job ads, but the number of professionals possessing these skills varies greatly. Scala is the rarest skill by far, while MATLAB expertise is fairly common, being five times as frequent in the Data Scientist population as Scala. Also, knowledge of the Apache Ecosystem (Hadoop, Spark and Hive) is still relatively rare among Data Scientists.

Figure 5: Software skills – current supply (proxy: LinkedIn) vs. demand (proxy: Indeed)  

 

In conclusion, which programming languages should you focus on? Python and R remain the top choices, and mastering at least one of them is a must. If you decide to learn or refine additional languages, pick Scala, which many Data Scientists expect to be the language of the future (Jean-François Puget analyzed the impressive rise of Scala in search trends on Indeed)

A further language holding significant potential is Julia. Julia is mentioned in 2% of current Data Science job ads on Indeed, but its trendline is highly promising.

 

 Post by David Kradolfer

 

 Back