Source – techtarget.com
A good amount of flexibility has always been required for data professionals. They need even more flexibility today in industries where big data infrastructure is causing equally big changes in business practices.
This is sometimes called digital disruption, and it has an effect on how data engineering is evolving. That is the case whether the data pro is working for a disruptor, or for a company that could be disrupted.
To get the view of the disruptor, just ask Tarush Aggarwal, director of data engineering at WeWork Companies Inc. This startup builds out and rents interim or temporary workspaces to other startups. This business model could threaten conventional commercial real estate practices. Think of it as a kind of Uber — rather than sharing a ride in someone’s car, you share office space co-working with others as needed, and on a somewhat open-ended basis.
Nimble big data infrastructure
If much of WeWork’s efforts are about nimbly building out infrastructure, so are Aggarwal’s. But his efforts center on data infrastructure and what is needed today to inform those working to grow the company.
“Our focus is on what the business is doing. A data science team can live six months in the future, but a data engineering team has to live right now,” he said. In the age of web-borne big data, living in the now means handling a lot of quickly arriving data.
In terms of data ingestion, emphasis is placed more on extract, load and transform (ELT) than on extract, transform and load processes, according to Aggarwal.
“The advantage that ELT gives you is it allows you to separate your ingestion from your transformation. That allows you to automate it completely,” he said.
Also, Aggarwal added, separation of ingestion from transformation means WeWork can apply different data transformations later on, should someone get a better idea of what to do with the data.
Aggarwal shares a disruptor’s view. He advises that data engineers spend time looking at how data in the organization is being used, working toward optimizing access to that data, and then add features on an ongoing basis.
Data reliability is important, he emphasized, “but not at the cost of flexibility.”
Cord cutters call the tune
To see today’s data engineering from another point of view, you could turn to Jeffrey Pinard. As vice president for data technology and engineering for advanced advertising initiatives at NBC, he is at the center of the 91-year-old peacock network’s efforts to respond to disruption in the television advertising business. Like Aggarwal, he spoke as part of this month’s Big Data Innovation Summit in Boston.
“We need to change the way NBC approaches advertising,” said Pinard.
In pursuing that objective, NBC set out to build a portfolio of audience analytics products called Audience Studio. There is plenty of data engineering involved.
“To support this, we needed to build a foundation from scratch — an infrastructure that was going to support our needs for the future,” he said.
That meant changes, as NBC was traditionally, in Pinard’s words, “an on-premises organization.” The infrastructure build-out needed to be cost-effective and to support the technology changes over time, he said, and cloud came under consideration.
Pinard and his colleagues came up with a fairly unique approach — a cloud data lake. While it is somewhat in the spirit of Hadoop distributed processing, it actually forgoes Hadoop. Pinard described the use of Amazon Web Services Simple Storage Service, Apache Spark, Apache Parquet, Mesos and containers in building an on-cloud data lake that takes ingested data, allows for elastic processing and supports data access according to end users’ permissions and job needs.
Moreover, the ability to store vast amounts of data enables end users to trace data’s lineage, which is a useful trait in meetings that too often revolve around finding out how somebody arrived at a certain data point.
Data is central to transformation
There are threads connecting the disruptors’ schemes with those that would potentially be disrupted. With any effective big data infrastructure, the ultimate point is to understand how people use it.