Data

Data

Glossary #

Data Warehouse #

Hierarchical DW - stores data in files or folders. Uses proprietary systems.

DataLake (Databrics propertary) #

Repositories for raw data in a variety of formats (structured, unstructuured, from audio, video, xml, csv, avro, parquet, compressed, chunked, from bytes to GBs). It is represented mostly as a storage but sometimes as an architecture (Kappa, Lambda, Delta) with processing segments (ETL or sometimes ELT) pipelines in place. Uses object storage, flat locations, tags, metadata, unique ID for performance improvements. Schema on read. Unstructured data support - good for ML. . Examples: HDFS, GCS, S3 Pros:

  • Uses cheap storage, open formats
  • Highly durable, low costs, scalable
  • ML friendly Cons:
  • Don’t support transactions
  • No data check, quality, consistency Good for:
  • Powering data science and machine learning
  • Centralization, consolidation, cataloging data

Lakehouse #

Datamesh #

Data Pipeline #

MPP #

BigData #

Analytics #

KPI Dashboard #

High level, strategic goals of the organization, and we need to figure out what data we want to use in order to make a decision makers to understand how well we are doing aganst those goals and how well as a business we are performing. What data is

Self service #

History #

Tags #

flexibility, performance, costs, ingestion, governance, policies, master data management, lineage, real time processing, streaming, messaging, volumes, formats, consistency, isolation, refinement, raw, intermediate, final, bronze, silver, gold, segmentation

Data pipeline #

Analogy to water pipelines #

Fetching data from lakes, rivers and ponds could take long distances and time. It was manual process but in time the demand was bigger and the water supply has been automated with the new technologies. Basics Data pipeline is a mechanism to transfer data from point A to point B through some intermediate points C,D and E where data processing takes place. Data pipeline receives data from the Data Producers and the result of the processing is used by the Data Consumers.

Responsibilities #

Ingestion Data Governance Master Data Management Lineage

Segmentation #

Bronze / Silver / Gold Data format Security

Usage #

Data Pipelines are used in the following fields: Business Analytics Reporting Data Science Machine Learning

Types of the Data Pipelines #

ETL, ELT, CDA Batch, Realtime

Architectures #

Kappa, Delta, Lambda Storage Raw

Silver Gold

Concepts #

ETL #

Extract Transform Load

Pros:

  • Easy to test
  • Unstructured data / schema in runtime
  • Low cost storage
  • Very flexible
  • Good for data streaming

Cons:

  • Poor performance
  • High Cost
  • Data move (network overhead)

ELT #

Pros:

  • Good performance
  • Costs optimized for compute

Cons:

  • Only table data
  • Storage can be expensive
  • Only batch

Fact Table #

Example: Operation in time

Dimension Table #

Lookup table / dictionary. This might change (add / modify) often. Example: Client Table

Normalisation #

Database Normal Forms #

Normalization is the process of organizing data in a database to reduce redundancy and improve data integrity. Normal forms are a series of rules applied to database tables to ensure consistency and efficiency.


1st Normal Form (1NF): Eliminate Repeating Groups #

Principle: Each column should hold atomic (indivisible) values, and there should be no repeating groups or arrays within a table. Each row must be unique.

  • Good Example:

    OrderIDProductQuantity
    1Apple10
    1Banana5
    2Orange3
  • Bad Example (Repeating groups):

    OrderIDProduct1Quantity1Product2Quantity2
    1Apple10Banana5

Issue: Difficult to query and maintain as the number of products grows.


2nd Normal Form (2NF): Eliminate Partial Dependencies #

Principle: Ensure that every non-key column is fully dependent on the whole primary key (for composite keys). Tables with a single-column primary key are automatically in 2NF if they meet 1NF.

  • Good Example: Orders Table:

    OrderIDCustomerID
    1101
    2102

    OrderDetails Table:

    OrderIDProductIDQuantity
    120110
    12025
  • Bad Example:

    OrderIDCustomerIDProductIDQuantity
    110120110
    11012025

Issue: “CustomerID” depends only on OrderID, not the combination of OrderID and ProductID. This creates redundancy.


3rd Normal Form (3NF): Eliminate Transitive Dependencies #

Principle: Ensure that non-key columns are only dependent on the primary key, not on other non-key columns.

  • Good Example: Orders Table:

    OrderIDCustomerIDOrderDate
    11012024-11-25

    Customers Table:

    CustomerIDCustomerName
    101John Doe
  • Bad Example:

    OrderIDCustomerIDCustomerName
    1101John Doe
    2101John Doe

Issue: “CustomerName” depends on CustomerID, not directly on OrderID, leading to redundancy and maintenance problems.


Boyce-Codd Normal Form (BCNF): Generalization of 3NF #

Principle: Every determinant (column or combination of columns that uniquely identify another column) must be a candidate key.

  • Good Example: Rooms Table:

    RoomIDRoomTypePricePerNight
    101Single100
    102Double150
  • Bad Example:

    RoomIDRoomTypeManager
    101SingleAlice
    102DoubleBob

Issue: “RoomType” determines “Manager,” but “RoomType” is not a candidate key (it’s not unique).


4th Normal Form (4NF): Eliminate Multi-Valued Dependencies #

Principle: A table should not have more than one multi-valued dependency. Multi-valued dependencies occur when one column can have multiple values independently of another.

  • Good Example: Courses Table:

    StudentIDCourseID
    1Math
    1Science
    2Math

    Hobbies Table:

    StudentIDHobby
    1Painting
    1Chess
    2Swimming
  • Bad Example:

    StudentIDCourseIDHobby
    1MathPainting
    1ScienceChess

Issue: Independent multi-valued facts (Courses and Hobbies) are combined, leading to redundancy.


5th Normal Form (5NF): Eliminate Join Dependencies #

Principle: A table is in 5NF if it cannot be decomposed into smaller tables without losing data.

  • Good Example: Suppliers Table:

    SupplierIDPartID
    1A
    1B
    2A

    Projects Table:

    ProjectIDPartID
    XA
    XB
    YA
  • Bad Example:

    SupplierIDPartIDProjectID
    1AX
    1BX
    2AY

Issue: Combining multiple relationships (Supplier-Part and Part-Project) creates redundancy.


Summary of Normal Forms #

Normal FormKey PrincipleMain Issue Resolved
1NFAtomic columns, no repeating groupsEliminates multi-value cells
2NFNo partial dependenciesRemoves dependencies on part of key
3NFNo transitive dependenciesRemoves non-key dependencies
BCNFEvery determinant is a candidate keyResolves key dependency violations
4NFNo multi-valued dependenciesHandles independent multi-values
5NFNo join dependenciesPrevents redundancy from complex joins

Here’s how the principles of normalization are practically implemented in real-world scenarios:


Scenario 1: 1NF Implementation #

Problem: A student database stores multiple phone numbers in a single column.

  • Initial Table (Not in 1NF):

    StudentIDNamePhoneNumbers
    1John123-456, 789-012
    2Alice345-678
  • Solution: Break the multi-valued column into atomic values.

  • Normalized Table:

    StudentIDNamePhoneNumber
    1John123-456
    1John789-012
    2Alice345-678

Scenario 2: 2NF Implementation #

Problem: A sales database mixes data about customers and orders in a single table.

  • Initial Table (Not in 2NF):

    OrderIDCustomerIDCustomerNameProductQuantity
    1101John DoeLaptop1
    2101John DoeSmartphone2

    Issue: CustomerName depends only on CustomerID, not on the composite primary key (OrderID, Product).

  • Solution: Split the table to remove partial dependencies.

    Orders Table:

    OrderIDCustomerID
    1101
    2101

    Customers Table:

    CustomerIDCustomerName
    101John Doe

    OrderDetails Table:

    OrderIDProductQuantity
    1Laptop1
    2Smartphone2

Scenario 3: 3NF Implementation #

Problem: A supplier database stores unnecessary attributes in a single table.

  • Initial Table (Not in 3NF):

    SupplierIDSupplierNameCityRegion
    1Supplier AChicagoMidwest
    2Supplier BDallasSouth

    Issue: Region depends on City, not on SupplierID.

  • Solution: Remove transitive dependencies by creating a separate table for cities.

    Suppliers Table:

    SupplierIDSupplierNameCity
    1Supplier AChicago
    2Supplier BDallas

    Cities Table:

    CityRegion
    ChicagoMidwest
    DallasSouth

Scenario 4: BCNF Implementation #

Problem: A university database assigns classrooms to courses but also tracks room managers.

  • Initial Table (Not in BCNF):

    RoomIDCourseIDManager
    101MathAlice
    101PhysicsAlice
    102ChemistryBob

    Issue: RoomID determines Manager, but RoomID is not a candidate key.

  • Solution: Split the table to eliminate dependency violations.

    Rooms Table:

    RoomIDManager
    101Alice
    102Bob

    RoomCourses Table:

    RoomIDCourseID
    101Math
    101Physics
    102Chemistry

Scenario 5: 4NF Implementation #

Problem: A hobby and courses database for students combines unrelated multi-valued facts.

  • Initial Table (Not in 4NF):

    StudentIDCourseHobby
    1MathPainting
    1ScienceChess

    Issue: Course and Hobby are independent of each other but combined in one table.

  • Solution: Split into two tables for independent facts.

    Courses Table:

    StudentIDCourse
    1Math
    1Science

    Hobbies Table:

    StudentIDHobby
    1Painting
    1Chess

Scenario 6: 5NF Implementation #

Problem: A supplier database combines relationships between suppliers, parts, and projects.

  • Initial Table (Not in 5NF):

    SupplierIDPartIDProjectID
    1AX
    1BX
    2AY

    Issue: Redundancy due to multiple independent relationships (Supplier-Part, Part-Project).

  • Solution: Decompose into three independent tables.

    SuppliersParts Table:

    SupplierIDPartID
    1A
    1B
    2A

    PartsProjects Table:

    PartIDProjectID
    AX
    BX
    AY

    SuppliersProjects Table:

    SupplierIDProjectID
    1X
    2Y

Refs: