Source: forbes.com
It’s been more than eight years since Facebook’s first data scientist told a Bloomberg reporter that “the best minds of my generation are thinking about how to make people click ads.”
By the time the data scientist, Jeff Hammerbacher, delivered that infamous criticism in 2011, he was already a few years removed from the data team at Facebook, where he had pored over enormous social data sets from 2006-2008. He was recruited by Facebook at 23 years old, fresh out of Harvard University and riding an early wave of bright data scientists who brought complex data tools and techniques out of academia and into the private sector.
Recently, TechRepublic reported that Facebook now employs more than 1,200 data workers (defined to include data scientists, data engineers, data architects, database administrators, machine learning experts, big data engineers and artificial intelligence specialists). Microsoft and Amazon each employ more than Facebook, while IBM tops the list at more than 2,500 data workers.
This growth has been part of a tumultuous decade for data. With tremendous democratizing advances like Amazon Web Services and a shift away from IT to more department-owned data have come considerable setbacks, many with broad societal implications. We have seen large-scale hacks of financial records, growing political polarization thanks to algorithmic echo chambers and coordinated interference in a U.S. presidential election. Data has been at the center of it all.
However, I believe our technological innovations are trending toward societal good — tools and services that make our communities safer, smarter and, yes, more efficient. After Facebook, Hammerbacher quickly channeled his data science prowess into cancer research, aiming to analyze large biological data sets and enable better treatments.
He’s not alone. In 2016, a 17-year-old high school student founded a startup that employs data science to help people identify dangerous breast tumors using their mobile devices. Data science tools are more accessible now than ever before. But even these benevolent endeavors are hampered by some of the same challenges that data brings to enterprises of all sizes.
It used to be that programmers wrote code to define rules. But now data scientists use vast data sets to train computers to recognize patterns. Many of the organizations I work with process thousands of documents each month, and the data scientists tasked with automating and expediting these processes through machine learning and artificial intelligence (AI) are burdened with extremely tedious and time-consuming tasks.
A machine learning-powered document extraction and classification system can take three-and-a-half months to develop. Nearly 75% of that time is spent transforming OCR data to a training data set, which requires data cleansing and exploring, as well as feature extraction. Another month is spent training, testing and finalizing the model with different parameters and iterations. In the near future, the model-building process itself will also be improved through machine learning. Data scientists will build one model that adapts itself in minutes to serve a wide variety of business structures and enterprise functions.
But as the past decade has illustrated, our quest for greater efficiency may come with certain setbacks. Young people around the world have grown up in a society where privacy is increasingly fragile. Millennials and Gen Z kids have seen far more cyber warfare in their feeds than on-the-ground warfare via traditional news. And, as a result, they are more protective of their data. This is smart of them, of course. But it will inevitably make life harder for data scientists working in business to consumer (B2C) organizations. Indeed, privacy legislation like the General Data Protection Regulation and California Consumer Privacy Act is already putting a pinch on the enterprise data that’s available to inform new machine learning and AI tools.
In turn, data scientists have come up with another workaround: fake data. We can train machines to analyze enormous sets of real data and create mock data sets that appear real enough to perform intended business functions. This is another step toward data science’s bold march forward. But like many steps before it over the past decade, fake data sets also bring potentially harmful side effects. For example, the data can appear so realistic that it could be used to open a fraudulent credit card if it fell into the wrong hands.
So, data scientists, and all of us who work to innovate our industries, are charged with ensuring our latest innovations are safe for the businesses, communities and people they serve. This is not a new responsibility, but it is one we must continue to put at the forefront of our business practices. We know this. Young professionals and even young adults demand this.
Hammerbacher, a millennial himself, was correct that many of the top technical minds his age helped create the most targeted advertising ecosystem the world has ever seen. But as time passes, leaders from that same generation — and their successors — are now poised to do much more, helping to create a world that uses data to improve medicine, infrastructure, government and the environment. With data scientist now the best job in America for four years running, it’s safe to say we’ll see an influx of younger data scientists who are eager to apply data science in new, transformative ways.
In fact, it won’t be long before we have our first millennial president. I don’t see any reason why that president couldn’t be a data scientist.