Data storage, management tooling and methodology is arguably going through the most interesting period of evolution. This is not just a result of the explosion in the amount of data being stored but really due to variety in types and volumes of source data, cloud technology and the demand and opportunities being brought on by all parts of the business and business partners.

Companies are scrambling to build the best culture and platform for management of their core structured data and a facility for the unstructured aspects of complementary data that can help build new business models and growth.

In the past 1-2 decades, many companies have implemented single or multiple data solutions.  The first generation data warehouses really serviced business intelligence through de-normalised reporting constructs.  The second generation were big data lakes that were used for masses of data ingestion and specialised data engineers and scientists. Both of these generational solutions have had challenges.  What is important is that companies take another look at these to determine whether they are set up for business success.

  • Is there the right business engagement and ownership of data?

  • Is the approach for data management & integration clear and fit for purpose?

  • Are the ingestion, ETL and data warehouse/lake platforms fit and all encompassing?

  • How will business & technology manage the holistic data quality?

  • How is the business empowered to access relevant and real time analytics?

We will review some of these aspects and how the cloud impacts the management of data and analytics and the opportunities this presents.  Today, companies should take the opportunity to validate their whole approach to your D&A Strategy - particularly the legacy on-prem warehouses.

Differentiating Data Warehouses and Data Lakes

Before getting into the current data opportunities, we wanted to give an overview and contrast the generation 1 and 2 data solutions introduced above.  Whilst they have traditionally had different purposes, data lakes and data warehouses are continuing to evolve through the benefit of the cloud.  

The data warehouse concept originated prior to the data lake and have traditionally been used for relational structured data, bringing together data from multiple core systems to integrate for analytics and business intelligence purposes.  The reporting or analytics queries would often rely on ‘de-normalisation’ to be able to improve processing performance and concepts like star schemas that supported rich business analytics.

The data lake came about with the large increase in data volumes and in cloud processing opportunities.  Although they can store structured, data lakes are somewhat suited to non structured data, where data is not stored in tables and rows, but rather collections and documents (similar to a JSON format).  What are stored as attributes or fields in tables are alternatively stored in key value pairs.  There are no or very few relations between collections and the primary key is an Object ID.  What is good about this form of storage is that it is very fast to load and can scale both vertically and horizontally to enable rapid processing.  It can cover NoSQL and other formats of data like large objects.

The following table provides a comparison of data lakes and warehouses.


Data Lake

Data Warehouse

Data Structure

Raw

Processed

Purpose of data

Interrogation, integration and ingestion & pass through

Integration, analytics and  reporting

Users

Data scientists

Data engineers

Data engineers

Business professions

Accessibility

Easier to ingest and update unstructured data

More difficult to ingest & update data due structural constraint, but easier to consume


What we are seeing today is that with cloud data warehouses, often the ‘lake’ is being offered as something that is part of or very close to the warehouse.  The data through the data lake ingestion and integration is used to augment a set of core structural data.  Technology like columnar storage and massive parallel processing can enable more effective querying and reduces the need in data warehouses to adopt denormalised data structures to add faster reporting and analytics.  In fact, if you do structure NoSQL in any way, highly normalised data vault structures can be used as part of this augmentation but should not be applied to the entire data model.

The Data Driven Company

Today more and more companies have a stated objective of changing culture to becoming ‘data driven’.  What does this mean?  This is where companies create built-in collaboration for business, developers, data engineers, analysts and operations across the entire data lifecycle. The right data in the right hands at the right time through self-service.  One of the key factors is allowing IT to give large numbers of people anywhere in the organization access to trusted data in a managed and governed way.

To enable companies to be data driven, they are building data management platforms for multiple use cases of data management including data discovery & profiling, catalogue & metadata management, data quality management, big data and integration.  However, rather than wholly centralising the data access and controls in this platform, to be data driven, organisations are finding success when they have autonomous domains (aligned to business function or maybe source) to host and serve their respective domain datasets in an easily consumable way.  The physical storage and data management platform can still be centralized infrastructure but the datasets content and ownership remains with the domain generating them.  This may duplicate data within each domain as data is transformed into a valuable form but this is ok.   

Some domains naturally align with the source, where the data originates. The source domain datasets represent the originating facts and reality of the business.  They capture the data that is mapped very closely to the form of the transaction in the source operational system.

Other domains align closely with consumption. The consumer domain datasets and the teams who own them have an objective of satisfying use cases that relate to consumer actions. They may for instance be set around customer view or sentiment.  These domains exist for morphing and are better suited to structurally go through more changes than source domains.  This domain will transform the source domain transactions into aggregate views and structures that fit a particular access model (eg graphing) or use case.

While the datasets ownership is delegated from the central platform to the domains, the need for cleansing, preparing, aggregating and serving data remains, so does the usage of data pipeline. In this architecture, a data pipeline is simply an internal complexity and implementation of the data domain and is handled internally within the domain with light oversight.

Having the domains ‘vested’ in the ownership of the data from source to quality to sharing is a strong model for establishing a data driven culture.

Organisation Data Capabilities: 

To start the journey in becoming a data driven company, organisations need to work on 5 important data capabilities.  These capabilities are:

1. Product Centred Data Domains: For the domains to be successful, they need to build product thinking into the capabilities they provide to the rest of the organization.  The domain teams provide these capabilities as building block APIs to the rest of the developers in the organization.  Data domains need to consider data assets as their products.

2. Multi Dimension Data Quality: A proactive approach to data quality allows you to check and measure how clean your data is before it gets to your core systems and reconciled as it is moving through a data pipeline.  The 3 areas of data quality to work on are pervasive, intelligent and collaborative.

  • Pervasive: This includes getting to data at the source - to ensure teams validate, monitor and analyse data at the origin.  It also includes data stewardship - delegating the data to people who know it best.  Productivity is improved by matching and merging data, resolving data errors, certifying, or cleansing source or other content.
  • Intelligent: Smarter tools such as machine learning can help with data quality. Utilising smart semantics to accelerate data discovery, data linking, and quality management. Utilise machine learning and advanced analytics to guide users by suggesting next best actions.
  • Collaborative: To ensure its quality and value, data has to be owned and managed collaboratively by the whole organization.

3. Paramount data security and privacy:  Organizations should seek to drive alignment between the legal, compliance, privacy, and enterprise data management teams to reuse existing data governance artifacts to support data compliance.  In particular, organizations should define sensitive personal data elements for data sovereignty and map these attributes to applications in the metadata repository.

4. Data management innovation with cloud based architectures: Data management platforms with roots in cloud and open source can provide access to the latest innovations faster than competitors.  Many of the governance systems being integrated are open source and your data management platform should include numerous connectors out of the box.  To scale the data integration strategy, take the pressure off IT, and truly make data a team sport, it is necessary to automate the data lifecycle: collect, govern, transform and share.  Having a central discoverable catalogue that houses links to data products with each domain being responsible.

5. Data storage and data sharing via APIs:  A storage platform that is scalable and can service the business needs is very important.  Modern Data as a Service (DaaS) platforms provide the necessary technology that the domains need to capture, process, store and serve their data products.  The self service capabilities offered help abstract the domain teams.  The benefit of sharing data internally and externally has become strategic and opens opportunities for new products and services. But what is the best way to share data without compromising data security and quality? How can it be shared not only with internal partners, but external ones? Solving this integration challenge has traditionally led teams to implement point-to-point integrations within their organization. There is a better way to share trusted data throughout the enterprise, in every application that needs it, at the moment that is needed. This approach is built on APIs and is called Data as a Service.

Summary

The evolution of data and analytics continues.  With the cloud data management tools available today, companies have the opportunity to build a renewed foundation and culture to be able to get the most out of data in their organisation.  Martin Fowler calls the new operational model a ‘Data Mesh’.  This is defined as “Distributed data products oriented around domains and owned by independent cross-functional teams who have embedded data engineers and data product owners, using common cloud data infrastructure as a platform to host, prep and serve their data assets is the model that can empower an organisation.”

Many companies have embarked and well advanced on their 3.0 Data Mesh solutions.  It is definitely worth consideration as there are benefits.  Just as digital brought the business closer to the applications through product led initiatives, the business will be closer to the data through this type of model which will improve the quality, availability and insights that benefit the organisation and it’s community.

Ben
Ben // AUTHOR

Ben is a passionate leader with over 25 years of experience in leveraging latest technology to bring value based outcomes and transformation to clients.

Related
Technologies