Source – techtarget.com
Enterprises are adopting data science pipelines for artificial intelligence, machine learning and plain old statistics. A data science pipeline — a sequence of actions for processing data — will help companies be more competitive in a digital, fast-moving economy.
Before CIOs take this approach, however, it’s important to consider some of the key differences between data science development workflows and traditional application developmentworkflows.
Data science development pipelines used for building predictive and data science models are inherently experimental and don’t always pan out in the same way as other software development processes, such as Agile and DevOps. Because data science models break and lose accuracy in different ways than traditional IT apps do, a data science pipeline needs to be scrutinized to assure the model reflects what the business is hoping to achieve.
At the recent Rev Data Science Leaders Summit in San Francisco, leading experts explored some of these important distinctions, and elaborated on ways that IT leaders can responsibly implement a data science pipeline. Most significantly, data science development pipelines need accountability, transparency and auditability. In addition, CIOs need to implement mechanisms for addressing the degradation of a model over time, or “model drift.” Having the right teams in place in the data science pipeline is also critical: Data science generalists work best in the early stages, while specialists add value to more mature data science processes.
Data science at Moody’s
“As soon as a new model is built, it is at its peak performance, and over time, they get worse,” Grotta said. Declining model performance can have significant impacts. For example, in the finance industry, a model that doesn’t accurately predict mortgage default rates puts a bank in jeopardy.
Watch out for assumptions
Grotta said it is important to keep in mind that data science models are created by and represent the assumptions of the data scientists behind them. Before the 2008 financial crisis, a firm approached Grotta with a new model for predicting the value of mortgage-backed derivatives, he said. When he asked what would happen if the prices of houses went down, the firm responded that the model predicted the market would be fine. But it didn’t have any data to support this. Mistakes like these cost the economy almost $14 trillion by some estimates.
The first line of defense is to encourage the data modelers to be honest about what they do and don’t know and to be clear on the questions they are being asked to solve. “It is not an easy thing for people to do,” Grotta said.
A second line of defense is verification and validation. Model verification involves checking to see that someone implemented the model correctly, and whether mistakes were made while coding it. Model validation, in contrast, is an independent challenge process to help a person developing a model to identify what assumptions went into the data. Ultimately, Grotta said, the only way to know if the modeler’s assumptions are accurate or not is to wait for the future.
A third line of defense is an internal audit or governance process. This involves making the results of these models explainable to front-line business managers. Grotta said he was working with a bank recently that protested its bank managers would not use a model if they didn’t understand what was driving its results. But he said the managers were right to do this. Having a governance process and ensuring information flows up and down the organization is extremely important, Grotta said.
Baking in accountability
Models degrade or “drift” over time, which is part of the reason organizations need to streamline their model development processes. It can take years to craft a new model. “By that time, you might have to go back and rebuild it,” Grotta said. Critical models must be revalidated every year.
To address this challenge, CIOs should think about creating a data science pipeline with an auditable, repeatable and transparent process. This promises to allow organizations to bring the same kind of iterative agility to model development that Agile and DevOps have brought to software development.
Transparent means that upstream and downstream people understand the model drivers. It is repeatable in that someone can repeat the process around creating it. It is auditable in the sense that there is a program in place to think about how to manage the process, take in new information, and get the model through the monitoring process. There are varying levels of this kind of agility today, but Grotta believes it is important for organizations to make it easy to update data science models in order to stay competitive.
How to keep up with model drift
Nick Elprin, CEO and co-founder of Domino Data Lab, a data science platform vendor, agreed that model drift is a problem that must be addressed head on when building a data science development pipeline. In some cases, the drift might be due to changes in the environment, like changing customer preferences or behavior. In other cases, drift could be caused by more adversarial factors. For example, criminals might adopt new strategies for defeating a new fraud detection model.
With traditional software monitoring, the IT service management needs to track metrics related to CPU, network and memory usage. With data science, CIOs need to capture metrics related to accuracy of model results. “Software for [data science] production models needs to look at the output they are getting from those models, and if drift has occurred, that should raise an alarm to retrain it,” Elprin said.
Fashion-forward data science
At Stitch Fix, a personal shopping service, the company’s data science pipeline allows it to sell clothes online at full price. Using data science in various ways allows them to find new ways to add value against deep discount giants like Amazon, said Eric Colson, chief algorithms officer at Stitch Fix.
This kind of digital innovation, however, was only possible he said because the company created an efficient data science pipeline. He added that it was also critical that the data science team is considered a top-level department at Stitch Fix and reports directly to the CEO.
Specialists or generalists?
One important consideration for CIOs in constructing the data science development pipeline is whether to recruit data science specialists or generalists. Specialists are good at optimizing one step in a complex data science pipeline. Generalists can execute all the different tasks in a data science pipeline. In the early stages of a data science initiative, generalists can adapt to changes in the workflow more easily, Colson said.
Some of these different tasks include feature engineering, model training, enhance transform and loading (ETL) data, API integration, and application development. It is tempting to staff each of these tasks with specialists to improve individual performance. “This may be true of assembly lines, but with data science, you don’t know what you are building, and you need to iterate,” Colson said. The process of iteration requires fluidity, and if the different roles are staffed with different people, there will be longer wait times when a change is made.
In the beginning at least, companies will benefit more from generalists. But after data science processes are established after a few years, specialists may be more efficient.
Align data science with business
Today a lot of data science models are built in silos that are disconnected from normal business operations, Domino’s Elprin said. To make data science effective, it must be integrated into existing business processes. This comes from aligning data science projects with business initiatives. This might involve things like reducing the cost of fraudulent claims or improving customer engagement.
In less effective organizations, management tends to start with the data the company has collected and wonder what a data science team can do with it. In more effective organizations, data science is driven by business objectives.
“Getting to digital transformation requires top down buy-in to say this is important,” Elprin said. “The most successful organizations find ways to get quick wins to get political capital. Instead of twelve-month projects, quick wins will demonstrate value, and get more concrete engagement.”