Source: cio.economictimes.indiatimes.com
Artificial Intelligence (AI) is possibly the most widely discussed technology trend. There are strong views about the applications and benefits of AI, while some feel that it will bring in the apocalyptic day of machine controlling human beings; a few others believe that AI will be extremely beneficial for humans. In the near term, AI and machine learning is becoming critical to solve important business and social problems. In many cases, machine intelligence is being used as a supplement to humans, thereby “augmenting” human intelligence.
It is commonly believed that “mathematically complex” algorithms are the biggest challenges for leveraging the power of machine intelligence. However, a detailed discussion with any practitioner will reveal that this is mostly not the case. Integrating data across multiple sources and resolving data quality issues is often the most important challenge in leveraging machine intelligence. This is also the biggest challenge even if one desires to derive simple rules or insights from data.
Due to data integration and data quality challenges, most data driven initiatives within organisations tend to take very long to demonstrate tangible results. This causes budget escalation and frustrations among senior leadership.
Traditionally, organizations used to adopt a linear approach to data integration, wherein detailed Extract, Transform and Load (ETL) processes were created to obtain data from various systems and establish relationship between hundreds of data tables that belong to various source systems. These data warehousing initiatives used to take atleast 18 to 24 months to complete. In addition, it was often difficult to incorporate unstructured data like images, text files, streaming data from sensors etc.
It is not common to use data from the warehouse into various front-end applications, this is because the warehouse is a single storage of the entire data, and if a front-end application (e.g. a customer facing App, a distributor portal etc.) requires to fetch data from that large storage, the response time is usually very high, which is not acceptable for most front-end applications.
An agile approach to data integration attempts to change this paradigm and bring in a use-case-driven approach to data integration. The key element of this approach involves creation of a data lake. As opposed to a data warehouse, a data lake stores data in an AS-IS format without performing any transformation of the data. One performs only Extraction and Loading (EL) of the data in the as-is format from the source system into a single repository. The data lake also stores unstructured data like text, images etc. In this case, there is no transformation or summarization of the source data, hence there is no loss of information. The dramatic reduction in storage costs have made it possible to store data in the as-is form.
Using the right platform, a data lake can be created very quickly (within 4-6 weeks). Once, the data lake has been created, an agile approach can be used to combine the few tables that are required for creating a particular use case. These tables are combined into a mini data mart. For example, one may need to identify the relationships between a handful of 10 to 15 tables only to create a data mart that is required to analyse the effectiveness of salespersons. Each data mart supports specific insight generation or machine learning model development needs. This approach ensures that organizations see initial results very quickly.
Each data mart consists of relatively fewer data elements compared to a warehouse; hence, they can be used to provide data to front-end applications, within prescribed response times. Data from the data marts and results of machine learning models are exposed as APIs (application program interface) which can be consumed by various front-end applications.
A key aspect of using data from the data mart for various front-end applications involves the frequency at which data from the source systems are refreshed into the data lake. An ideal solution is to perform a near-real-time refresh, and to capture changes where the source data is overwritten. Specific capabilities like change data capture and ability of reading updates form data base logs is critical for this purpose.
The data lake approach allows different data marts to obtain different types of summarization of the same base data. As the granular as-is data is present in the data lake, one can perform different types of summarizations from the same data. Within a data lake framework, it is critical to have utilities that can convert (or make sense) of unstructured data like images, scanned PDFs, etc. and incorporate the same with traditional structured data.
Once a specific data mart has been created for one use case, then the next set of data marts can be created whenever needed – the agile data journey. The data lake can also use existing data warehouse as a data source thereby reusing existing investments in the data warehouse.