April 23, 2018

What Does it Mean to Be a Senior Data Scientist?

This post is partly for myself and based on various peoples conversations – it is also inspired by On Being a Senior Engineer. I’m trying to answer questions like ‘what do we expect from a Senior Data Scientist’.

My job title is ‘Senior Data Scientist’ and I often joke I’ve no idea what that means. 🙂

I’m going to adopt rather similar views to the above post on Engineers since I feel there is a lot of overlap. I’m going to fundamentally focus on ‘maturity’ of the Analyst/ Engineer/ Researcher, since I feel titles can be very misleading and I’ve frankly seen ‘Senior’ Data Scientists have rather immature behaviour.

I present some ideas, these aren’t necessarily in order.

1. Senior Data Scientists understand that Software/ Machine Learning has a lifecycle and so spend a lot of time thinking about that.

Technical debt, maintainability, systems design, design docs, etc. These are all:

“What could I be missing?”

“How will this not work?”

“Will you please shoot as many holes as possible into my thinking on this?”

“Even if it’s technically sound, is it understandable enough for the rest of the organization to operate, troubleshoot, and extend it?”

2. Senior Data Scientists understand that ‘data’ always has flaws. These flaws can be data generating processes, biases in data.

I once did a technical interview with a Senior Data Scientist as a candidate – and I was a bit flummoxed at the question at the end which was ‘what if the data is wrong?’. It’s a valid question and one we should think about.

Often a lot of the populations we end up observing aren’t randomly sampled and we need to think a bit about how to manage this. I find anecdotally that Junior Data Scientists often think that this is not the case.

3. Senior Data Scientists understand the ‘soft’ side of technical decision making.

Increasingly I see tool choices being made and wonder about the ‘feeling’ aspect of those. It can be for example that ‘static languages are best’ or ‘we should use pytest not unittest’, increasingly this is because of ‘taste’ or ‘feelings’ or ‘philosophy’. And those are perfectly reasonable things. For example, I love the pytest functional syntax, however I know other engineers like other tools – and that’s ok.

The other thing is that sometimes people have bad experiences with tools from particular vendors, or in particular ecosystems. If you, for example, worked at a company that wrote software in Zorg and you found it incredibly difficult to deploy, and the project was a complete failure, then you’d have an emotional response to Zorg if it’s brought up in a company meeting. Engineers and Data Scientists often are obsessed with the rational, but our feelings about architectures, software matter. Otherwise we’ll never get the buy-in we need. I’ve not finished reading it, but a book that’s been recommended by a few senior Technologists who I respect is Words that Work

A corollary to this is that we can produce Machine Learning models that don’t get used.

4. Senior Data Scientists focus on impact and value

If a deep learning model doesn’t get into production because of lack of trust – you’ve failed. It’s not about satisfying your intellectual curiosity, or your need for ‘Resume Driven Development’. It’s important to think about buy-in and your time to value. As Erik Bernhardsson tweeted:

Think most of my value of knowing machine learning these days is gained from telling people why ML won’t solve their problem

This is terribly important, sometimes a simple rules engine will do. Sometimes just a SQL query. Using the right tool for the job is very important. This is complicated though, and there’s not often one ‘best’ solution, all solutions have trade offs.

Often you can make things simple with data. A question I like asking Data Scientists is ‘when did you decide not to use ML?’ For example, a few years ago I saved thousands of dollars at $OLD_EMPLOYER by building a data pipeline, for some analytics. Some of the analysis pipeline involved matching text for inventory management. For example, inventory names would be similar – so it seemed natural to use fuzzy-matching or something similar. It turned out this algorithm was too slow, and impractical. And it turned out by monetary value there were 100 inventory items that needed matching, so I simply encoded the most common misspellings/abbreviations in a dictionary.

This was tremendously valuable, and a much more robust solution than using Machine Learning. Sometimes automation is what you need to do 🙂 Sometimes counting and sometimes Machine Learning.

5. Senior Data Scientists care about ethics

Recently in the Data Science and Tech communities we’ve seen the need for discussion of ethics. There’s been some interesting and worth reading literature on this from the Academic communities, and I’ll not wade too much into these debates.

However, as a Senior Data Scientist working in a the regulated world of Financial Services – I’ve grown to appreciate that it’s my job to have a working knowledge of GDPR, it’s something we regularly bring up when we discuss the viability of projects, and it’s a ‘risk factor’. It would be immature to just ignore this, and frankly unethical and unprofessional.

At the very least Senior Data Scientists should read some of the code of ethics in Data Science and have views on these. Ideally you should have your own code of ethics, and maybe enforce those on yourself. Certainly you should bring that into account in your risk planning, and in terms of what data you get access to, and how you integrate security. This can unfortunately add to time frames, but doing things ‘right’ both in terms of customer trust and in terms of good compliance often takes longer time. As we’ve seen with the Theranos affair – ‘move fast and break things’ isn’t always the best motto.

Acknowledgements: Thanks to Eoin Hurrell and Bertil Hatt who helped with fleshing out these ideas. I’m grateful also to conversations with friends and colleagues such as Eddie Bell, Mick Cooney, Mick Crawford, Ian Ozsvald, Dat Nguyen and Vlasios Vasileiou. I learn from most people I speak to, so sorry if I’ve forgotten. Finally also thanks to Audrey Somnard, who has constantly reminded me that ‘algorithms do what they want’ isn’t a sufficient ethical explanation and I should think more about these issues.

Editor's note: This was originally posted on Models are illuminating and wrong, and has been reposted with perlesson. Author Peader Coyle is a Senior Data Scientist at Zopa.

Peadar Coyle

About the author

Peadar Coyle

Peadar is driven by challenging problems at the intersection of machine learning, data engineering, statistics and software engineering and building data-driven agile teams to solve them.