Amazing as it may sound, the term Data Science the way it is understood today does not have a very long history. In fact the term “data scientist” became a bona fide job title not so long ago in 2008. The credit to coining the term goes to Jeff Hammerbacher at Facebook and DJ Patil at LinkedIn coined the term to capture the emerging need for interdisciplinary skills across analytics, engineering, and product. Today even as the demand for data scientists has blossomed and more and more people have taken to Data science profession, the term still throws lots of questions in the minds of aspirants and employers alike.
In this article, I’ve summarized 10 of the most frequently asked questions about data science asked on the web and different social media and have also tried to find the most accurate possible answers for these:
1. What exactly Data Scientists do?
A data scientist is one who combines knowledge of coding and applications, modelling, statistics, analytics and math to uncover insights in data. In more simple terms it involves building hypothesis, defining data requirement and extracting data, running it through an analytics platform and then creating visualizations of the data with recommendations to existing business issues. A Data scientist, thus, knows enough programming to pull relevant data from multitude of sources and knows enough statistics and advanced mathematical modelling to derive insights discover patterns and communicate findings. The following diagram explains how Data Science is placed at the cross sexction of these various inter-related disciplines:
According to Anjul Bhambhri, vice president of big data products at IBM, “A data scientist is a scribed as “part analyst, part artist.. is somebody who is inquisitive, who can stare at data and spot trends.”
A data scientist is different from a data analyst who uses languages like SQL to summarize data and communicate those findings to upper management. A data scientist is expected to build or deploy models which are what makes data science a science.
The following infogrpahic originally published in the WSJ explains the typical activities of a data scientist.
2. What are the key skills of a Data Scientist?
Some of the important skills that one needs to have to become a successful Data scientists are listed below(source : Udacity). While some of the skills are prerequisites and you must have to take up Data Science as a career, some other skills can be acquired.
Data Science Tools: You need to have a good hands-on working familiarity with statistical programming language, like R or Python, and a database querying language like SQL.
Basic Statistics: Data Scientists are expected to exhibit at least a basic understanding of statistics such as, statistical tests, distributions, maximum likelihood estimators, etc.
Machine Learning: You should have or acquire some level of familiarity with machine learning methods like k-nearest neighbors, random forests, ensemble methods etc.
Multivariable Calculus and Linear Algebra: multivariable calculus or linear algebra forms the basis of a lot of data science techniques and hence a basic comfort with these will be a good skill to have despite the fact that there are a bunch of out of the box implementations in sklearn or R that can do the job for you.
Data Munging: Often times, the data one is working with is going to be messy, it’s really important to know how to deal with imperfections in data.
Data Visualization & Communication: Visualizing and communicating data is key part of the data science job and hence one needs to be familiar with data visualization tools like ggplot and d3.js.
Software Engineering: This is not a deal breaker in large organisation where team may include software engineers who can do the job for you. However, it an advantage nevertheless to know software engineering as it will allow you to move fast and save money for the company in case of smaller organisations.
3. What is the difference between Data Analysis and Data Science?
A data scientist is different from a data analyst who uses languages like SQL to summarize data and communicate those findings to upper management. Data Analysis answers questions like What happened? When? Who? How many? It includes Reporting (KPIs, metrics), Automated Monitoring/Alerting (thresholds), Dashboards, Scorecards, OLAP (Cubes, Slice & Dice, Drilling), Ad hoc query etc.
Business Science, on the other hand, answers questions like Why did it happen? Will it happen again? What will happen if we change x? What else does the data tell us that never thought to ask? Data Science includes Statistical/Quantitative Analysis Data Mining, Predictive Modeling, Multivariate Testing.
As we can see the big difference between using data to create charts and graphs and actually combining and transforming data. Data science is predictive while oftentimes business intelligence employs backward-looking data. A data scientist is expected to build or deploy models which are what makes data science a science.
4. What tools do Data Scientists use?
Data Science activities are performed by a team consisting of different individulas who have the capability to use different tools. No one person is capable of performing all parts of the activities.
For Data Analysis, one needs a statistical programming languages like R or more general purpose programming languages like Python which have libraries like NumPy, SciPy, Pandas or Scikit-learn.For database querying language like SQL. There are upcoming languages like Julia and Scala which are alternatives to R and Python.
For Data Warehousing, MySQL for reasonable data sizes and a Pig, Hive, Impala, Shark, Redshift etc. for data of humongous sizes. Apache Spark is the latest big data technology which is being adopted by increasing number of organizations for its ease of use and speed compared to Hadoop.
For Data Visualization, the popular tools that are used include rCharts, D3.js, Matplotlib for ad-hoc Python plotting and ggplot2 for R.
For Machine Learning, people are using scikit-learn(Python), Weka , Spark (MlLib), Knime, Rapidminer etc.
Apart from these, proprietary tools like SAS, SPSS, Statistica are still leaders on the entreprise analytics software segment.
5. How does data science differ from traditional statistical analysis?
Many people hold the view that ‘Data Science’ is nothing but statistics repackaged. However this is not correct. Statistics is the application of the ability to collect and interpret quantitative data intelligently/rationally. Traditional statistics definitely forms a critical element of data science. However, Statistics is just one of the several fields that would have to be leveraged as tools in order to perform Data Science. Thus while Data Science uses statistical analysis but it’s not just that, it is also algorithms, machine learning, databases, data engineering and visualization.
6. How is Data Science different from Big Data?
Data Science and Big Data have a lot in common. Big data primarily deal with Data-processing and looks to collect and manage large amounts of varied data to serve large-scale web applications and vast sensor networks. On the other hand, Data Science looks to create models that capture the underlying patterns of complex systems, and codify those models into working applications. As the following Figure 1.1 shows, Big Data is integral to the Data science Process.
So one can say that while Big Data is about Data Engineers, Data Science is about Data Scientists.
7. Which is better for data analysis: R or Python?
It’s difficult to say which one is better as both R and Python have strong followings and are robust languages from the data science point of view. While Python, finds favour with the programmers, R is preferred by researchers and academics. Python could be easier to learn between the two while R has unconventional syntax that can be tricky to understand. Python, being a general-purpose language while R is a specialized environment that looks to optimize for data analysis. Nevertheless, there are very cool options for Python such as Pandas, a data analysis library built just for it.
In expert opinion, one can use either of the two to begin with but over a period of time learning both tools and using them for their respective strengths can make you a better data scientist. Versatility and flexibility are traits any data scientist at the top of their field. 23 per cent of data scientists surveyed by DataCamp used both R and Python.
8. What kind of Salaries Data Scientists Get?
Data Science is one of the better paid jobs out there. In fact, for job postings nationwide, data-scientist salaries are 113% more than average salaries for all job postings, according to Indeed.com. According to US data sourced from indeed.com, the average data scientist today earns $123,000 a year. The Glassdoor numbers put average salary at $118,000. According to Salar.ly, salary data of workers with H1B visas, for people with the tile “Data Scientist” listed on their site, the mean salary is $120,000 and median is $112,000.
According to an estimate in India, the Associate level salary for Business Analytics Professional is in the range of Rs 5-7 Lacs per annum while the Associate Consultant salary ranges between Rs 8-14 Lacs.(Source:Glassdoor).
9. What is the Career Outlook for Data Scientists?
According to a recent Wall Street Journal, companies want employees who can both sift through the information and help solve business problems. As the use of analytics grows quickly, companies will need employees who understand the data. Analytics, as an industry is set for exponential growth. With more and more data being available in digital form, need for smarter, faster, data based decisions is only going to increase.
According to Harvard Business Review (October 2012 edition), job of a data scientist is the sexiest job of 21st century.
According to the McKinsey Global Institute (In a May 2011 report): “By 2018, the United States alone could face a shortage of 140,000 to 190,000 people with deep analytical skills as well as 1.5 million managers and analysts with the know-how to use the analysis of big data to make effective decisions.”
However, despite the hype surrounding ‘Data Science’, it has not translated into a large supply of data scientists.there is still a large gap between supply and demand of data scientists. Data is exploding and companies are gradually waking up to the fact that they need to get serious with how they use data. Demand for good data scientists will be there for at least a few years to come. According to Google Trends, data scientist is more popular than stock statistician, and becoming competitive with software engineer.
10. Which are World’s top companies in Data Science space?
According to the CIO Review, the following are the top 20 companies in the data science field: Actian, Birst, BloomReach, CBIG Consulting, Cirro, Digital Reasoning, Flutura Solutions , Fractal Analytics, Hadapt, Link Analytics, Loqate, MarkLogic, Miner3D, Moser Consulting, Mozenda, NextGen Invent, OpTier, Saama Technologies, WGSigma Systems and Zementis.