Learn Data Architecture

Data Contracts - Don't leave quality to chance

Published 7 days ago5 min read4 comments

In the realm of data management, the concept of a data mesh has emerged as a pivotal framework for understanding how data flows and is utilized across different domains within an organization. At the heart of this framework lies the principle of data contracts, a mechanism designed to ensure seamless and reliable data exchange between the source domains (creators of data) and consumer domains (users of data). For data scientists navigating the complexities of modern data ecosystems, grasping these concepts is essential.

Data Contracts — Ensuring order in a decentralised framework like Data Mesh
Data Contracts — Ensuring order in a decentralised framework like Data Mesh

Data contracts serve as the foundation for creating data products that are both independent and autonomous, facilitating a provider-consumer relationship between entities within an organization.

This structure, however, brings forth the challenge of ensuring that changes in one domain do not adversely affect another. For instance, a modification like the deletion of a column in a dataset by the source domain could cause applications or data pipelines in the consumer domain to fail. Data contracts emerge as a solution to mitigate such risks while preserving the autonomy of each domain.

Drawing parallels to real-life scenarios where contracts govern transactions like buying an apartment or leasing a car, data contracts establish a set of rules designed to maintain harmony between domains. These contracts can be verbal, written, or automated, each with its strengths and limitations.

Verbal, Written, and Automated Data Contracts

  • Verbal contracts involve informal agreements on data creation and maintenance to prevent pipeline disruptions. However, their reliability is often questionable in today’s fast-paced and complex data environments.
  • Written contracts provide a more tangible solution by documenting the agreed-upon schema and data types. These documents, maintained on platforms like wiki pages or Confluence, offer clarity but lack enforceability.
  • Automated contracts represent the optimal approach, acting as checks and balances that verify alignment with agreed-upon formats as data is created, written, and read. This automation prevents unplanned outages and ensures the integrity of data products.

An Example Data Contract in YAML

To better understand the practical application of data contracts, let’s examine a sample data contract articulated in a YAML-like declarative syntax. This example focuses on validations for a dataset, specifically a timeseries dataset, ensuring its schema integrity and the reliability of its data.

                    
                    checks for timeseries:
- schema:
- invalid_count(email) = 0:
valid format: email
- valid_count(email) > 0:
valid format: email
- failed rows:
samples limit: 50
fail condition: x >= 3
- correlation:
- avg_x_minus_y between -1 and 1:
avg_x_minus_y expression: AVG(x - y)

This YAML snippet outlines a series of validations:

  1. Email Validation Checks: These checks ensure that all email addresses in the dataset conform to a standard email format. The invalid_count(email) = 0 validation ensures there are no email addresses that deviate from the valid format, while valid_count(email) > 0 confirms the presence of at least one legitimate email address in the dataset.
  2. Failed Rows Check: This validation is designed to identify rows that meet or exceed a specific condition, in this case, x >= 3, with a sample limit of 50 rows. This check helps in spotlighting potential outliers or anomalies within the dataset.
  3. Correlation Check: Focusing on the relationship between two variables, this validation ensures that the average difference between x and y (avg_x_minus_y) remains within a range of -1 to 1. This is crucial for analyses that rely on the stability or consistency of these relationships over time.

By incorporating such validations into data contracts, organizations can safeguard against data quality issues and foster a more reliable data ecosystem.

The Future of Data Contracts

Looking ahead, advancements in technology might enable large language models to autonomously generate data contracts based on dataset observations. Enforcing these automated data contracts becomes a critical step in ensuring that data exchanged between domains complies with established formats and agreements.

Depiction of data contracts in a data pipeline
Depiction of data contracts in a data pipeline

When a source domain contributes data to a data lake or lakehouse, it must pass through these automated validations to ensure compliance. Similarly, when the target domain retrieves data, it undergoes verification to guarantee alignment with pre-established agreements.

Conclusion

For data scientists, understanding and implementing data contracts within a data mesh framework is crucial for navigating the complexities of data management and ensuring the integrity and reliability of data products. As the field continues to evolve, embracing automated data contracts will be key to fostering seamless collaboration between domains and enhancing the overall quality of data ecosystems.

Udemy Course - Lowest Price

Avail the lowest price for my BESTSELLING course "Data Architecture for Data Scientists" on Udemy by clicking on the course thumbnail below.

image