How do I learn more about data science? What can I do to start analyzing data? Where do I get started with machine learning?
I constantly hear these questions, and other variations. The people that ask them and their backgrounds are just as varied as the questions themselves. From recent college grads who want to branch out, to marketers looking to be more quantitative, to startup founders wanting to develop algorithms, it seems like everyone these days is interested in data. And why shouldn't they be? Drawing inferences from data can be semi-magical, even before you get to machine learning.
I've struggled a bit over the years to answer the questions. After figuring out the person's background and interests a little more, I used to say "so start with these videos on Khan Academy, then read this book, then watch these videos, then try solving these problems on your own." Eyes would quickly glaze over -- I'd lost the opportunity to help someone out, and potentially pushed them a little further from getting into the field. I evolved this to "well, Kaggle has some great competitions — just try them out". This worked really well — for about 5% of people. Even something as seemingly simple as installing python and scikit-learn can become a huge barrier to entry (although Anaconda is helping here). Not to mention the fact that the baseline for starting even the simplest Kaggle competition is "understands programming pretty well, knows some basic stats, and can munge data reasonably well". I also tried other variations after working at edx, like "take this MOOC, then this MOOC" -- this gave people a clear starting point and structure, but MOOCs are often very theory-driven, and many people can't muster the motivation to stick through them when they aren't directly applicable to their goals.
Barriers to learning data science
Over time, it became clear to me that the barriers were:
"Data science" is becoming an aspiration unto itself for a wide range of people. Many people who don't know coding want to learn data science. The barrier here is that many data science focused courses assume a very good knowledge of programming, and often a good knowledge of math like linear algebra.
A wide range of topics are covered under the "data science" umbrella, like natural language processing, map/reduce, machine learning, time series analysis, and many others. A lot of these topics are interdependent -- for example, to learn machine learning (enough to tune parameters reasonably and know what's going on), you have to learn statistics, linear algebra, programming, and machine learning. Yet, each of these topics is often presented independently, and these links aren't surfaced well. People who want to learn machine learning are either daunted when confronted with 10 different courses to take, or confused when they try to skip all of the building blocks and directly apply algorithms.
"Bite-sized learning" -- not everyone who wants to learn about data science wants to become a data scientist. Maybe they want to do some text analysis at work. Or maybe they want to build a side project with a data component. Or maybe they just want to learn the basics. There isn't really a resource for this group of people (which is much larger than the set that do want to become data scientists). Pointing someone in this camp to a data science course is tough, because data science courses require several hours a week, and are often geared towards people who want to become one professionally.
Terminology! Sometimes I feel that 75% of the battle with learning about topics like machine learning or statistics is the density of terminology. Often, when you "get" the concept, you find that it's simple and elegant. Learning terminology is important, because it lets you communicate with people in the field, but it's not necessary in the first pass, and turns a lot of people off.
There's been a lot of progress in focusing more on application than theory among MOOC providers, particularly Udacity. But we're still stuck with the concept of a "course" that runs on a fixed schedule, and doesn't allow exploration outside the fixed track. This can become a huge barrier in and of itself, as people drop out when they can't change their life to meet the fixed schedule.
And, last but not least, cost. A data science degree, such as the masters offered by UC Berkley, is 60,000 dollars. And that's relatively cheap, as far as degrees go. Even the newer "bootcamps" are often 10k+ for a few months. And online courses can run in the neighborhood of 2,000 dollars or so.
I've thought about making a tool that enabled people learn data science online more effectively for a couple of years. In November of 2014, I found myself with free time, and I started hacking up a site. I called it "LearnDS" to start with, and showed it to a few friends after a week. I got some good feedback (the site wasn't so good back then), and spent some time iterating on it and making it better. Eventually, Dataquest came to be. It doesn't have all of the content we'd like to have up yet, but it's coming along.
Dataquest addresses the issues with learning data science in these ways:
- Every "mission" (lesson) centers around a dataset, so you're always directly analyzing data.
- Learning in broken into topic areas, with links between each.
- You're welcome to jump around lessons however you want, although there is an ordering.
- Very few videos. 95% of the time, you'll be writing code to solve problems.
- Start from the basics -- no programming knowledge is assumed (although you can skip the intro sections if you know Python)
- Focus on the whole picture, not just algorithms -- for example, you need to know about unicode and character encodings to do text analysis, so unicode is taught.
- We constantly improve the site based on learner feedback.
Tens of thousands of people have taken the first steps to learning data science on Dataquest so far. Why not be the next one?