All-too-often organizations struggle to integrate data science into digital products, due to a persistent gap between the skill set most data scientists have, and what it takes to build software in 2020.
In this blog post, I'll discuss why this gap exists and a couple ways of overcoming it.
To better understand the gap between the skills data scientists hold and the skills they need to deploy software, let's start by answering a simple question: who are data scientists in 2020?
Data scientists most often come out of
All these sources create data scientists with at least a basic understanding of statistics and oftentimes awareness of what methods should be used to analyze what common problems. However, what none of these do consistently is teach either software skills or business understanding. When it comes to software skills there might be a class on object oriented program, but usually nothing on software engineering or cloud platforms. And none of these practitioners will have taken coursework in industry business models, marketing, or operations.
Now think about what it takes to deploy software in 2020?
The mismatch should be obvious: most data scientists are applied statisticians with better branding. Their modeling capabilities and business understanding vary. But there's almost universally a gulf between what data scientists know about software and what it takes to build it and deploy it properly.
Before we go on to how to solve for this mismatch, we should talk about its repercussions. What happens when data scientists don't know how to build software, and organizations don't take the often large amounts of time and energy necessary to train them on how to do it?
Usually, they'll be thrown into a software engineering or data engineering team, where they'll struggle for a long, long time. They'll have trouble writing code outside of notebooks, writing code that passes code coverage and other tests set up in CI / CD, and getting code that works locally to work reliably on a cloud platform.
A few things can happen then:
In short, you either don't put the model into production, wait a very long time to put the model into production, or put the model into production (with risks of getting it wrong) at double the price.
Now let's talk about how to deal with this mismatch and mitigate everything I was just warning about. Because nearly no data scientists are coming in with the full necessary skill set to deploy software, you realistically have 2 options for acquiring data scientists who can contribute to digital products:
We'll talk about each of these, and then about how to make a decision between them. In short, there are a lot of tradeoffs.
Let's start with hire and train. In order to hire and train data scientists you need to have
On the hiring portion you can include software engineering and cloud platform skills in the requirements portion of a JD and ask questions about software engineering and your company's cloud platform during candidate interviews. Because every engineering organization's systems and tools are different, general knowledge is often good enough to cut time to pushing to prod in maybe half, compared to a similar data scientist without knowledge of software engineering or cloud platforms.
Then there's training. You'll need more documentation on your version control, your unit testing tool, your CI/CD tool, and your cloud platform's major services. All this in addition to specific documentation on your data systems and data model. Furthermore, data engineers and senior data scientists will need to spend time with new hires in order to get them up to speed, which is more difficult in a distributed organization (all of us right now) than when the whole team sits together.
The good news is that careful hiring and training can cut time to pushing reliably to prod to a month or two.
The other option is to contract with outside talent, either through supplementing your team with contractors (usually on an hourly basis) or an experienced team from an external service provider (usually on a time and materials or fixed cost basis).
This option usually requires either having an existing relationship with a firm that you trust, or taking the time to vet the consultants, agree to specs and work plan, and come to agreement on pricing. It takes a lot less practitioner time, but also a lot more executive time.
So how to pick between hire and train or contract out? There are a few considerations, including the maturity of your data engineering and data science organizations, your tolerance for variability in time or cost, and your enterprises's resource availability.
Let's go through each of these decision points one-by-one:
In the end, while hire and train vs. contract for a data science project or initiative is an important and often difficult decision, either one is nearly guaranteed to ensure a better outcome than leaving your new, expensive data scientists to go it alone. Starting to think about what the best way to build up your data science capability is the first step to getting data science built into your digital products.
Tags: strategy, data science