This post summarizes the article by Signifyd COO and co-founder Michael Liberty in insideBIGDATA.
One of the unfortunate, unanticipated side effects of the title Data Scientist growing in popularity is that we’ve developed a misguided idea of how machine learning should be applied to the problems we’re trying to solve. Consequently, we’ve built teams heavy on scientists, churning out insights on an artisan-like scale, when what’s required in the next phase of applied machine learning are engineers building learning pipelines that produce insights at industrial scale.
The “Productivity Paradox”
Robert Solow, quipped in 1987, “You can see the computer age everywhere but in productivity statistics.” Since then, much has been written about the Productivity Paradox in Information Technology, yet it is too broad a field to conclusively say if the Paradox is real, but within the sub-field of Machine Learning, the Paradox is real for most firms.
Consider that the output of machine learning is technically a model, but functionally that model is an inexpensive, productive, perfectly rational, and singularly focused knowledge worker. This worker’s job is a very specialized task: make a product recommendation, tell you which web page will answer your question, or identify fraud. While you don’t need to tell them precisely how to perform the task, if the objective is clear and the relevant data is provided, they’ll make the optimal decision.
Given our tremendous quantities of data, access to cheap computing power and advanced algorithms for training these workers, we should increasingly be ceding decision-making power to machines. Yet, for all the advancements in information technology over the past 50 years, humans still maintain tight control on judgement, recommendations and decisions. We’re wary of a manufacturing line that produces knowledge workers. Computers are just supposed to bring us data while people are supposed to bring us understanding.
The role of the Data Scientist was born of this mindset and most Data Scientists are performing a job formerly known as analyst with a little more Python, a lot more data and less domain expertise. As the title implies, they do fundamentally experimental work. It’s artisanal production, not industrial.
Feeding the Machine
To be clear, Data Scientists are not to be blamed for what they do. We have embraced the notion that the rise of machine learning needs people who are hybrid engineers/analysts/statisticians — jack-of-all-trades exploring data with ad-hoc scripts. While all of these skills are necessary for the machine learning revolution, the way forward is specialization and automation. In fact, within the next decade we’ll see the routinizing and automating of work currently performed by Data Scientists to bring machine learning to its full potential. Moreover, the companies that will win with machine learning will treat it as an engineering-heavy and almost manufacturing-like function.
Features are the inputs to machine learning algorithms. They are the raw inputs to the manufacturing line. Bad inputs mean bad models, just as bad training means bad workers. Feature design is a creative, empathetic process. Doing it well depends more on domain expertise than data science skills, meaning business users often do it best. You must ask questions like, “How do I represent that fraudsters like to use expedited shipping?”
In a manufacturing plant, throughput is limited by bottlenecks. Machine learning pipelines should be bottlenecked by feature design, since nearly everything else can be automated. Improving throughput means relieving bottlenecks. Although the production of models isn’t visualized the same way a physical assembly line is, the concept is fundamentally the same. In this regard the Data Scientist role is flawed as it is a bottleneck. We attempt to combine, in one person, domain, statistics and engineering expertise, when what we should be doing is engineering an automated, statistically rigorous machine learning pipeline that is only limited by the pace at which domain experts can feed it new feature ideas.
Industries and Companies Making Progress
Certain companies applying machine learning, by virtue of their massive quantities of data and uniquely quantifiable outcomes, have been forced to take an engineering-centric approach to machine learning. Adtech is a notable area. Giants like Google and startups like Drawbridge and AppNexus track consumer browsing, shopping and buying habits across multiple devices with defined click-paths from point A to point D, grasping data from each step along the way. They simply have no choice but to engineer machine learning pipelines. Human behavior and competitive bidding can change so quickly that they necessitate highly engineered machine learning pipelines.
Outside of AdTech, the progress has been slower, but is now picking up. A defining characteristic of machine learning’s proliferation at this point seems to be that it is happening via focused services, rather than generic platforms. Take, for example, Legalist, a new startup out of Y Combinator that provides third-party litigation financing, betting on cases that are likely to yield them a profit. Legalist applies machine learning algorithms to calculate the likelihood of a particular lawsuit winning based on factors such as length of trial, judge caseload and past rulings. One could argue that law firms could have done this kind of analysis by hiring teams of data scientists but they would be much less effective than a company that is engineering machine learning into its core systems.
Similarly, Chrous.ai uses machine learning to predict what works in sales calls, automatically analyzing the content of these calls with Natural Language Processing. Again, a large sales organization could theoretically employ data scientists to analyze their calls but they will never be competitive with a model manufacturer working across potentially thousands of companies.
The Change Ahead
Machine learning has been around for many years but only recently have we seen practical advancements of massive, digital, data-driven application. Advancements have largely come from companies taking a distinctly industrial and engineering-centric approach to machine learning. In the coming years these capabilities will reach all areas of knowledge work.
While many companies doing machine learning will employ Data Scientists there needs to be a rebalancing. Data Science work needs to come out of the lab and onto the factory floor, a process that will be driven by engineers. For Data Scientists, this means repurposing themselves as engineers, product managers, domain experts and becoming masters in particular functions rather than jacks-of-all-trades.
Founded on August 28, 2011, insideBIGDATA is a news outlet that distills news, strategies, products and services in the world of Big Data for data scientists as well as IT and business professionals. Our editorial focus is big data, data science, AI, machine learning, and deep learning. insideBIGDATA is written and edited by big data professionals with the help of readers and occasional guest contributors.