Gathering data from multiple sources:
Internal Systems: ERP, CRM, HR, sales platforms.
External Data: Social media, IoT devices, market data.
Unstructured Data: Text, logs, images, emails.
Tools: ETL (Extract, Transform, Load), APIs, streaming pipelines.
Data Warehouse (DW): Central repository for structured, historical data.
Data Lakes: Store massive volumes of raw, unstructured, or semi-structured data.
Lakehouse: Hybrid (combines DW + Data Lake, e.g., Databricks, Snowflake).
Cloud Storage: AWS S3, Azure Data Lake, Google Cloud Storage.
Structuring data for analysis using schemas:
Star Schema (fact + dimension tables).
Snowflake Schema (normalized).
Ensures data is optimized for querying and reporting.
Policies and procedures to maintain data accuracy, security, privacy, and compliance.
Defines:
Data ownership (who controls what data).
Data stewardship (quality and lifecycle).
Access control (role-based access).
Ensures compliance with GDPR, HIPAA, CCPA.
Data should be:
Accurate (no errors).
Consistent (uniform formats across systems).
Complete (no missing values).
Timely (up-to-date for reporting).
Techniques: deduplication, validation, normalization, enrichment.