This post has been written as homework for the course
"Statistical Thinking for Data Science and Analytics"
which I'm attending through the edX platform
at Columbia University.
Question:
What are the most important skills for a Data Scientist?
In a short list:
·
Statistics & Probability
·
IT
o
Databases
o
Big Data
o
Cloud Computing
o
Graph Design
o
Data Mining
o
Machine Learning
·
Insight
·
Ability to team with people specialized in
different kinds of activities
Statistics and Probability
A Data Scientist shall know enough Statistics to be able to
find what is meaningful in the collected data points and its correlations and
what is just “noise”.
Actually, to start, a Data Scientist needs to be able to assess,
in a collected dataset, the “margin of error” (“sample error”), a possible bias
(“sample bias” and – last but not least – what is missing in the sample, since “N
= All” is, almost certainly, not true.
IT
“Big data analysis” will require advances in statistical
methods – and will require advances in IT, as well. But just as the new resources in statistics
shall be built on the existing statistical knowledge, the (relatively) new tools
to deal with (mostly Big) Data from several sources, like the “Web Exhaust”
data stream, genomic projects, astrophysics, and so on, shall be better
understood and used when we take in account the existing database management
tools.
A Data Scientist shall master relational databases and NoSQL
databases as well as Data warehouses. Needs to be able to create data presentation
in forms that allow himself to understand the data and find patterns and create
hypothesis and also communicate his findings to non-specialized audiences.
And, to be able to use these tools in massive datasets, will
need to be familiar with Cloud Computing, IaaS and SaaS, as can be used, for
example, in AWS.
Insight
A Data Scientist needs to provide the insight, the inferring
which sometimes the “black-box algorithms” of Machine Learning fail to
achieve. To be able to put apart the “signal”
from the “noise” among a huge amount of possible correlations, and point out
the “causality” among these “correlations”.
Ability to Team
Since Data Science is essentially multidisciplinary, a Data
Scientist needs to be able to team up, learn by himself and quickly, understand
the domain of the project he is working, and communicate efficiently.
No comments:
Post a Comment