Introduction:
In the field of data science, structured data plays a vital role in extracting valuable insights and making informed decisions. In this article, we will explore the typical flow of structured data before it reaches the hands of data scientists or business analysts. Understanding this flow is crucial for data scientists to effectively utilize structured data in their analyses and models. So let’s dive in and unravel the journey of structured data.
The Importance of Structured Data:
Structured data constitutes only a fraction, approximately 20%, of the total data volume produced or accessible to organizations. However, despite the abundance of unstructured and semi-structured data, structured data remains the cornerstone of many data science use cases, accounting for about 80% to 90% of them. This underscores the significance of understanding how structured data is managed by data engineering teams.
Managing Structured Data:
Structured data can be found in various formats, such as database tables, comma-separated value (CSV) files, and Microsoft Excel files. It is organized into rows and columns, where columns represent the features used to train models, and rows contain observations. The responsibility of a data architect or data engineering team is to extract structured data from different sources and load it into a data warehouse.
Cleansing and Transformation:
To ensure the accuracy and completeness of data in the data warehouse, data engineers perform cleansing operations on structured data. Unclean or incomplete data is not permitted in the data warehouse. Therefore, structured data is initially loaded into staging tables, where data engineers clean the data by replacing null values and standardizing formats. Staging tables also serve as a place to join data from different sources and enrich existing data with external information.
Data Warehouse: The Single Source of Truth:
Once the data is clean, unified, and enriched, it is ready to be loaded into a data warehouse. A data warehouse serves as the single source of truth for the entire organization. It contains a comprehensive and reliable collection of structured data. For instance, when management needs to know the total sales for a particular month, business analysts can query the data warehouse for accurate information.
Data Mart: Empowering Data Scientists:
While data scientists have limited read access to the data warehouse, they typically work with a subset of data stored in a data mart. A data mart is a copy of the relevant data from the data warehouse, refreshed either on request or according to a predefined schedule. Data scientists have full access to the data mart and can perform further preprocessing and manipulation of the data to extract meaningful features for their data science use cases.
Model Training and Predictions:
Once the data scientists have identified the relevant features, they train their models using the data available in the data mart. Data marts also store the data required for making predictions. The predictions generated by the models are stored back in the data mart. From there, applications or dashboarding tools can retrieve the predictions for further analysis or visualization.
Conclusion:
By understanding this flow, data scientists can effectively leverage structured data and derive valuable insights to drive informed decision-making. In future modules, we will delve deeper into data warehousing, where structured data continues to play a pivotal role. Stay tuned for more insightful discussions in our upcoming chapters.