Blog > Software skills

How to measure employer demand of data science software

01/06/2017

One approach to estimate and track employer demand of data science software is to analyze which skills are asked for in job ads. We did this using job ads on Indeed and showed which data science software skills are most in-demand in Switzerland and worldwide. In this post, we describe the methods these analyses are based on. We worked with R, as it offers convenient packages facilitating the task.

Searching for jobs on Indeed

There are two ways to search for jobs on Indeed, either directly on their website or using the API (for which registration is required). We used the API together with the helpful jobbR package. First, we filtered the job ads for data science related keywords such as Data Scientist, Data Analyst, Big Data or Machine Learning. For each of these search results, relevant information such as the company, job title, location and a link to the job ad could then be stored in a data frame, using jobbR. The search for the aforesaid terms also produced jobs – e.g., programmer, business analyst and scientist positions – only marginally related to the jobs we intended to analyze. In order to discard these from the analysis, we filtered our data frame so as to only include jobs containing “data” or “machine learning” in the job title, using the stringr package. The adjusted data frame left us with 28’732 jobs from 59 countries. Most jobs are located in the US (13’095), followed by Great Britain (2’921), Germany (1’840), France (1’425), and India (1’240). To analyze the requirements for different jobs, we further filtered for specific job titles, which gave us 5’988 “Data Scientist”, 7’497 “Data Analyst”, 2’197 “Data Engineer” and 1’716 “Machine Learning” positions.

Analyzing the job descriptions for software skills

The job descriptions needed for the analysis of in-demand software skills are not accessible using the Indeed API. We therefore scraped the descriptions using the rvest package and appended them to the data frame. Relying on stringr again, we searched for 80 software products common in data related jobs ads. For names such as Python or Hadoop this task is straightforward, as they don’t appear in normal language. For names such as Scala, which is contained in words like “scalable” or “escalate,” we also searched for false positives and subtracted the count. Another difficulty is posed by one-word names such as R and C. We solved this by searching for the letter surrounded by spaces or punctuation marks such as commas, periods or parentheses. For R, the following terms were thus used: ' R ', ' R,', ' R)', '(R,', ' R.', '\nR ', ' R\n', '(R/', 'RStudio', 'R-studio', 'R/Python', 'Python/R'). There was no need to distinguish between C, C# and C++, as job ads usually ask for any of these languages. For SQL and Java, the search had a similar issue, with occurrences of NoSQL and JavaScript being included as well. We subtracted the counts of the latter from the former’s (but didn’t subtract SQL variants that are based on SQL, such as MySQL and PostgreSQL). SAP also poses some difficulty, as many job ads just require general SAP knowledge while others are more specific and ask for SAP HANA or Business Objects, say. We did not exclude these terms from the SAP results, which thus contain all SAP software products.

Finally, we counted the number of job ads each software is mentioned in. For better comparability of the job titles, we transformed these counts to percentages in the worldwide data. We then used ggplot to plot the results.

Back