[This article was first published on R | r4stats.com, and kindly contributed to R-bloggers], (You can report a problem about the content on this page here) Want to share your content on R-Blogger? Click here if you have a blog, or click here if you don’t.
I recently updated my comprehensive analysis of the popularity of data science software. This update covers perhaps the most important section, which measures popularity based on the number of job ads. I repeat it here as a blog post, so you don’t have to read the whole article.
One of the best ways to measure the popularity or market share of software for data science is to count the number of job advertisements that highlight the knowledge of each as a requirement. Job ads are packed with information and backed by money, so they’re probably the best measure of how popular each software is now. Job demand change plots give us a good idea of what will be more popular in the future.
Indeed.com is America’s largest job site, besting its collection of job ads. As their co-founder and former CEO Paul Forster put it, Indeed.com includes “all jobs from over 1,000 unique sources, including major job boards — Monster, CareerBuilder, HotJobs, Craigslist — as well as hundreds of newspapers. , association, and company websites.” Indeed.com also has great search capabilities.
Searching for jobs using Indeed.com is easy, but searching for software in a way that ensures fair comparison across packages is challenging. Some software is used only for data science (eg, scikit-learn, Apache Spark), while others are used in data science jobs and, more broadly, in report-writing jobs (eg, SAS, Tableau) is done. General-purpose languages (eg, Python, C, Java) are used heavily in data science jobs, but are required by most jobs that have nothing to do with data science. To level the playing field, I developed a protocol for data scientists to only focus on exploring each software within the jobs. Details of this protocol are described in a separate article, How to Search for Data Science Jobs. All of the results in this section use those procedures to ask the necessary questions.
I collected the job numbers on October 5, 2022 discussed in this section. To measure the percentage change, I compare it with the data collected on May 27, 2019. One might think that a sample in a day may not be very stable, but they are. Data collected in 2017 and 2014 using the same protocol are correlated r=.94, p=.002. I sometimes double check some calculations after a month or two and always get the same figures.
The number of jobs covers a very wide range, from zero to 164,996, with an average of 11,653.9 and an average of 845.0. The distribution is so heterogeneous that placing them all on the same graph makes the values difficult to read. So, I divided the graph into three, each with a different scale. A plot with a logarithmic scale would be an option, but when I asked some mathematically clever people how many packages there compared to such a plot, they were so far off that I abandoned that approach.
Figure 1a shows the most popular tools, with at least 10,000 jobs. SQL leads with 164,996 jobs, followed by Python with 150,992 and Java with 113,944. Then comes a set from C++/C# at 48,555, which gradually drops to 38,125 in Microsoft’s Power BI. One of Power BI’s major competitors, Tableau, is in that set. Next comes R and SAS, both of which are about 24K jobs, with R slightly ahead. Finally, we see that a set is slowly falling from MATLAB at 17,736 to Scala at 11,473.
Figure 1A. Number of data science jobs for more popular software (>= 10,000 jobs).
Figure 1b includes tools for which there are 250 to 10,000 jobs. Alterix and Apache Hive are at the top with about 8,400 jobs. Databricks has a significant jump of 6,117, then a much smaller drop from there to Minitab at 3,874. Then we see another big drop in JMP at 2,693 after which things gradually ease down to MLIB at 274.
Figure 1B. Number of jobs for less popular data science software tools with 250 to 10,000 jobs.
The least popular set of software, those with fewer than 250 jobs, are displayed in Figure 1c. It starts with DataRobot and SAS’ Enterprise miner, both close to the 182. It is followed by Apache Mahout with 160, WEKA with 131 and Theano with 110. Below RapidMiner, there is a slow decline until we finally reach zero in WPS Analytics. The latter is a variant of the SAS language, so advertisements are always likely to list SAS as a required skill.
Figure 1c. Number of jobs for software with less than 250 ads.
Several tools use a powerful but easy to use workflow interface: Alteryx, KNIME, Enterprise Miner, RapidMiner, and SPSS Modeler. The scale of their count is too wide to make a decent graph, so I have compiled those values into Table 1. There we see that Alteryx is extremely impressive, having 30 times as many jobs as its closest competitor, KNIME. The latter is about 50% higher than Enterprise Miner, while RapidMiner and SPSS Modeler are comparatively smaller.
SoftwareJobsAlteryx8,566KNIME281Enterprise Miner181RapidMiner69SPSS Modeler17Table 1. Job counts for workflow tools.
Let’s take a similar look at packages whose traditional focus was on statistical analysis. They have all manner of machine learning and artificial intelligence, but their reputation is still mainly in statistics. We saw earlier that when we consider the full range of data science jobs, R was slightly ahead of SAS. Table 2 shows jobs with only the word “statistician” in their description. There we see that SAS comes out on top, albeit with such a small margin on R that you can see the reverse depending on the day you collect the new data. Both are more than five times as popular as Stata or SPSS, and ten times as popular as JMP. Minitab is seen as the only contender in this arena.
SOFTWARE JOBS ONLY FOR “Statistician” SAS1040R1012Stata176SPSS146JMP93Minitab55Statistica2BMDP3Systat0NCSS0Table 2. The search term “statistician” and the number of jobs for each software.
Next, let’s look at the change in jobs now (October 2022) from the 2019 data, which focused on software with at least 50 job listings in 2019. Without such limits, software that grew from 1 job to 5 jobs in 2019. There will be an increase of 500% in 2022 but still there will be little interest. The percentage change ranged from -64.0% to 2,479.9%, with a mean of 306.3 and a mean of 213.6. With apparent job growth of 2,479.9%, IBM Watson, and Databricks, were the two extreme outliers, at 1,323%. They were both so large compared to the rest that I left them out in Figure 1d to prevent compressing the rest of the values beyond legibility. Rapid growth of Databricks has been seen elsewhere. However, I’ll take IBM Watson’s figure with a grain of salt because its growth in revenue is nowhere near what Indeed.com’s job figures indicate.
The rest of the software is shown in Figure 1d, where their job market is “heating up” or growing, shown in red, while those that are cooling are shown in blue. The main conclusion drawn from this figure is that almost the entire data science software market has grown over the past 3.5 years. At the top, we see Alterix up 850.7%. Splunk (702.6%) and Julia (686.2%) follow. To my surprise, FORTRAN has grown from 195 jobs to 1,318, with an increase of 575.9%! My supercomputing colleagues assure me that FORTRAN is still important in their field, but HPC is certainly not growing at that rate. If any readers have thoughts about why this might be the case, please leave your thoughts in the comments section below.
Figure 1d. Percentage change in job listings from March 2019 to October 2022. Only software with at least 50 jobs in 2019 is shown. IBM (2,480%) and Databricks (1,323%) are excluded to maintain legibility of the remaining values.
Both SQL and Java are growing at a rate of around 537%. Downstream from Daitaiku, the rate of growth slows steadily until we reach mllib, which saw almost no change. Job ads declined in only two packages, with WEKA -29.9%, THENO -64.1%.
This concludes my analysis of software popularity by jobs. You can read my ten other approaches to this task at https://r4stats.com/articles/popularity/. Many of them are based on old data, but I plan to update them in the first quarter of 2023, when much of the needed data becomes available. To be notified of updates like this, subscribe to this blog, or follow me on Twitter: https://twitter.com/BobMuenchen,
The post Data Science Software Popularity Update first appeared on r4stats.com.