Many of us want to be a Data Scientist. To quote, a recent article by Thomas Davenport and D.J. Patil in the Harvard Business Review calls ‘data scientist’ the sexiest job of the 21st century. However, at the same time, many of us are not clear about the exact role of a data scientist and what exact skills make one a data scientist.
Let’s first get an idea on what exactly data scientists do? While there are many definitions of a data scientist, the one I like the most is by Josh Wills through his tweet. “Data Scientist (n.): Person who is better at statistics than any software engineer and better at software engineering than any statistician”. Data scientist knows enough programming to pull relevant data from multitude of sources and knows enough statistics and advanced mathematical modeling to derive insights, discover patterns and communicate findings.
Data Scientist (n.): Person who is better at statistics than any software engineer and better at software engineering than any statistician.
— Josh Wills (@josh_wills) May 3, 2012
Data Science as the name suggests is a science which uses data that has been generated from day-to-day activities to build models that explain and predict events and behaviors. Often that model is encapsulated in the language of mathematics using computers to build models. Thus, being a Data Scientist is about using the tools of mathematics, scientific inquiry, and computing to solve today’s biggest challenges by finding and interpreting patterns hidden in the huge piles of data.
Data Scientist vs Data Analyst
— Chris Dixon (@cdixon) January 30, 2014
Although these two titles are used interchangeably now-a-days for more or less similar roles, there are few differences in the kind of work they both do. A data analyst is not expected to build or deploy models which are what makes data science a science. A data analyst will use languages like SQL to summarize data and communicate those findings to upper management (or to data scientists). It is more about summarizing existing data.
Data science is about bringing the tools of science either from statistics or advanced machine learning to that data, so that models can be built and deployed in real-world applications. This requires an understanding of machine learning, big data architecture, advanced statistics and software development…and above all the ability to test hypotheses as a scientist.
Coming to the specific skills to be a data scientist, here we have listed a few:
Tools of Data Science: Some kind of command over the Statistical programming languages like R or more general purpose programming languages like Python which have libraries like NumPy, SciPy, Pandas or Scikit-learn for the purpose of data analysis, and a database querying language like SQL is the first pre-requisite for you to land an interview for a data scientist. Although considered more platforms than languages, Pig and Hive in the Hadoop stack are still being used quite frequently to layout data pipelines and set up data flow. Obviously you should know some basic SQL to wrangle in data from legacy RDBMS tools that many organizations still use. There are some interesting languages like Julia and Scala but their package ecosystems are much less mature than R Programming and Python.
Apache Spark is the latest big data technology which is being adopted by increasing number of organizations for its ease of use and speed compared to Hadoop.
Comfort with Statistics: Comfort with statistics is another basic requirement. Techniques from statistics are used to explore the distributions that generate the data which feed the models and validate these models. You must understand the assumptions your models make about the underlying data and how the results can be extrapolated to a more generalized scenarios. When learning algorithms, break the process down into steps like, understanding the assumptions, model building, interpreting the parameters, analyzing the performance of the models and finally and most importantly validating your assumptions. All the above should sound exciting to you, not daunting.
Skills with Machine Learning: For most of the cases a simple linear regression from statistics would give satisfactory results but for better performance and to deal with huge data sets knowledge of more advanced techniques from machine learning is required. Data science is about dealing with huge amounts of data and this requires that you are familiar with machine learning methods like k-nearest neighbors, random forests, SVM, ensemble methods etc. Good news is that you need not have an expert level command and even an understanding level of when it is appropriate to use different techniques may help you sail through.
Calculus and Algebra: Multivariable calculus or Linear algebra forms the basis of a lot of machine learning techniques. Understanding these concepts is most important at companies where the product is defined by the data and small improvements in predictive performance or algorithm optimization can lead to huge wins for the company.
Data Munging: Many a times, the data can be messy and difficult to work with. A data scientist must be good at dealing with imperfections in data. Some examples of data imperfections include missing values, inconsistent string formatting (e.g., ‘New York’ versus ‘new york’ versus ‘ny’), and date formatting (‘2014-01-01’ vs. ‘01/01/2014’, unix time vs. timestamps, etc.). It’s an important skill for everyone to have.
Data Visualization & Communication: Visualization is used to better explore trends inside the data and to communicate the findings to the decision makers. Visualization wise, it can be immensely helpful to be familiar with data visualization tools like ggplot, d3.js and tableau. It is important to not just be familiar with the tools necessary to visualize data, but also the principles behind visually encoding data and communicating information.
Software Engineering: Software engineering ground is important as one needs to develop software so that one can send the models into the real world where they can collect more data and take automated actions.
Thinking Like A Data Scientist: Finally, Companies want to see that you’re a (data-driven) problem solver. This will show in all you activities, like how you interact with the engineers and product managers, what methods you use, when you opt for approximations etc.
To be a data scientist means using the above skills to do great science on interesting data. This is the responsibility of anyone looking to get into the field.
So are you now ready for a career as data scientist?
My advice is look beyond the hype surrounding Data science. Data Science is a demanding career where you are responsible for doing good science, managing client expectations, understanding technical and theoretical aspects deeply, and knowing how to be productive and work with many other disciplines. If you love going deep and being technical, and also being awesome at communicating those technical aspects in laymen terms to real-world businesses, then you may give it a shot.
You can get started by solving real problems. In data science this means downloading a public dataset and trying to find something or build a data product from it. Go build an App that shows crime predictions in real time and publish it online. When you do, you will have to work with R or Python. You will have to write some SQL. You will have to develop software. You will have to learn how to integrate a model with that software. You will have to learn what model to use and which ones not to use. You will have to go to online communities like Stack Overflow and Cross Validated and ask questions to better understand what to use or why something doesn’t work. You will have to research articles on models, and math, and stats just to get your App to do something worthwhile. You will have to validate your models and improve your models. You will fail miserably again-and-again, but this is worth its weight in gold. You will come out with so much new knowledge about how to make data science work in the real world.
Forget the courses and countless books. Use the internet and go solve problems.
If you want to impress a company and get hired… go enter public competitions like Kaggle, and compete in a data mining and machine learning contests. That will prove you know what you’re doing. And don’t worry about first place…just place among the top 10% to show you can compete with the best of them.