Friday, March 25, 2016

The most important skills for a Data Scientist

This post has been written as homework for the course
"Statistical Thinking for Data Science and Analytics"
which I'm attending through the edX platform
at Columbia University.

Question:
What are the most important skills for a Data Scientist?

In a short list:

·         Statistics & Probability
·         IT
o   Databases
o   Big Data
o   Cloud Computing
o   Graph Design
o   Data Mining
o   Machine Learning
·         Insight
·         Ability to team with people specialized in different kinds of activities

Statistics and Probability

A Data Scientist shall know enough Statistics to be able to find what is meaningful in the collected data points and its correlations and what is just “noise”.
Actually, to start, a Data Scientist needs to be able to assess, in a collected dataset, the “margin of error” (“sample error”), a possible bias (“sample bias” and – last but not least – what is missing in the sample, since “N = All” is, almost certainly, not true.

IT

“Big data analysis” will require advances in statistical methods – and will require advances in IT, as well.  But just as the new resources in statistics shall be built on the existing statistical knowledge, the (relatively) new tools to deal with (mostly Big) Data from several sources, like the “Web Exhaust” data stream, genomic projects, astrophysics, and so on, shall be better understood and used when we take in account the existing database management tools.
A Data Scientist shall master relational databases and NoSQL databases as well as Data warehouses.   Needs to be able to create data presentation in forms that allow himself to understand the data and find patterns and create hypothesis and also communicate his findings to non-specialized audiences. 
And, to be able to use these tools in massive datasets, will need to be familiar with Cloud Computing, IaaS and SaaS, as can be used, for example, in AWS.  

Insight

A Data Scientist needs to provide the insight, the inferring which sometimes the “black-box algorithms” of Machine Learning fail to achieve.  To be able to put apart the “signal” from the “noise” among a huge amount of possible correlations, and point out the “causality” among these “correlations”.

Ability to Team

Since Data Science is essentially multidisciplinary, a Data Scientist needs to be able to team up, learn by himself and quickly, understand the domain of the project he is working, and communicate efficiently.



No comments:

Post a Comment