At Dataquest, we strive to help our users get a better sense of how data science works in industry as part of the data science educational process. We've started a series where we interview experienced data scientists. We highlight their stories, advice they have for budding data scientists, and the kinds of problems they've worked on. The first post in this series is our interview with Trey Causey.
Trey Causey is a data scientist who has worked in fields ranging from ECommerce to Sports. After a brief stint in academia, he switched to industry by working as a data scientist at Zulily, an ECommerce startup in Seattle. He then had stints working on the data science teams at Facebook and Dato, a maker of tools for data scientists. In his ample free time, he blogs about data science on his personal website and has turned his hobby of football analytics into a consulting position for an NFL team.
What is your definition of data science?
There’s been a lot of media coverage around certain aspects like the size of data or around specific magical algorithms, but we’d would love to know your thoughts.
Trey: Great question, and certainly one without an easy answer. Here’s one attempt: data science is the use of statistics, machine learning, software engineering, and domain knowledge to a) extract information from data, b) make predictions using that information, and c) put those predictions to use in some applied setting. That’s a wordy answer! And we can quibble about how to assign the weights to each of those components. Or, we could just link to Drew Conway’s famous Data Science Venn Diagram. I don’t think data science only involves big data or fancy algorithms like deep neural networks. But I do think that using statistics and machine learning are necessary components.
You have a strong quantitative background from your time in academia. What mistakes do you see people who lack your rigorous quantitative training often making?
Trey: Full disclosure, I have an uncompleted PhD from UW. I am what is known as ABD. I think people without solid training in some kind of quantitative discipline often get caught up in tooling and programming language debates like, should you learn R or Python? What package do I need to use random forests? And so on. Lacking some solid understanding of the algorithms you’re applying can be dangerous, because you don’t know when things aren’t working the way they’re supposed to. I also see a lot of people making basic correlation/causation errors, whether they intend to or not, because they don’t have the healthy skepticism of one’s own models that often comes through years of education.
What’s the best way to fill in knowledge gaps on the quantitative side?
Trey: To fill in these gaps, we’re obviously in an unprecedented period with respect to access to information and motivated auto-didacts have a wealth of options. That being said, I don’t think there’s a good substitute for taking a class that has a lot of focused homework and requires careful attendance and attention. I recommend everyone take at least one class in statistical inference – and I mean take a class, not just enroll in a MOOC. I’m skeptical that the evaluation methods that enable thousands of people to take MOOCs are conducive to the kinds of work needed to really learn and retain complicated material. I’m happy to be proven wrong, though!
What was your experience in academia like and what made you transition into industry?
Trey: I transitioned from academia to industry partially out of necessity – I was in grad school when the job market for assistant professors imploded along with the rest of the economy – and partially because data science was starting to be a viable option for people like me who wanted to continue to do statistics and machine learning but didn’t necessarily want to write papers that no one will read and that take years to get published. I just didn’t have the patience for it.
What was the transition from academia to Zulily like?
Trey: I found the transition to be pretty easy – I was fortunate enough to have a fantastic manager at Zulily, Mike Errecart, who really understood the potential for data science to improve the company, understood what was and wasn’t possible, and was really interested in helping me grow as a data scientist.
You’ve worked in ecommerce, consumer internet, and sports. What did you like and dislike about each industry’s attitude towards data?
Trey: Obviously ecommerce and social media companies like zulily and Facebook have fully embraced data not only as a way to make better decisions but to improve and create new products using data. As for sports, I think that there’s a lot of interesting new data becoming available that no one’s using. Many in sports, especially the NFL, are conservative, favor “gut-based” decision-making, and haven’t been convinced that data can help them win games without turning the game into a rote, robotic exercise.
What other kinds of roles have you seen in industry that you feel very few people have the skills for?
At Dataquest, we’re pretty big skeptics of the data scientist unicorn motif and think that specialization in data professional roles will happen and is important. We already have Data Analyst and Data Scientist tracks but want to add more.
Trey: Good question – data engineers are extremely valuable to any organization using data, and I don’t see a lot of training for those roles. I agree with many others that data engineers should often be hired before data scientists at startups in order to build scalable and robust infrastructure that lends itself to the kinds of work that data scientists are better at.
I don’t know if there’s much demand for it, or if online training is the answer, but I also see a real dearth of material on leading data science teams or working at the director or executive level as a data scientist. We’re just now getting to the point where data scientists have risen to senior leadership positions in larger companies, but I don’t know yet that we have a good idea what a VP of Data or a Director of Data Science looks like as a role.
Besides filling in skills gaps and doing lots of side projects, what are other ways junior data professionals can land their first job?
Trey: I always recommend networking as much as possible. Go to meetups, attend talks, and try and meet people that work in the kinds of roles you want to see yourself in. Cold-emailing people for coffee is a strategy that a lot of people recommend, but I get a lot of these and can tell you that it gets a little onerous and I end up feeling like a jerk because I just have to ignore them or say no most of the time.
Being visible is important – if that means side projects, that’s great. It could also mean presenting your side project at a meetup or conference. Don’t expect to just throw some code up on GitHub and to start watching the recruiter emails roll in.
How has your experience consulting for the NFL affected your experience watching football?
Trey: Probably less than you might imagine. I’ve always been a football fan and am just like a normal person – I get excited at big plays, yell about stupid mistakes and decisions, and generally get caught up in the drama of a good game. I still get very excited about football season. One small irony is that working on multiple football-related projects (both with an NFL team and helping to build the New York Times’ Fourth Down Bot) often means that I am working during games and attending to how various models that I’ve built are performing or if anything needs an emergency bug fix.
What new data science technologies and techniques are you the most excited about?
Trey: I try not to get too wrapped up in hype about specific technologies or techniques. Spark has been interesting to watch, as there is a tremendous amount of excitement surrounding it, but I think that many people find it is somewhat difficult to wrangle when applied to real projects.
I experimented some with transfer learning while I was at Dato, using pre-trained deep neural networks on new data unrelated to the training data for things like image similarity. I was very surprised about how well this worked across multiple domains.
I think I’m most excited about boring things like the continuing improvement of tools like Scikit-learn and Pandas such that they allow me to do my work while “getting out of the way” while I’m doing so. I’ll take a well-designed and stable API over a flashy new tool nearly any day of the week.
If you enjoyed this interview or want to learn more about data science, head over to Dataquest and join our community. We'll be interviewing data scientists on a regular basis and will be announcing new interviews through our newsletter. You can also join our community chat by requesting an invite here.