Glossary #
Data Warehouse #
Hierarchical DW - stores data in files or folders. Uses proprietary systems.
DataLake (Databrics propertary) #
Repositories for raw data in a variety of formats (structured, unstructuured, from audio, video, xml, csv, avro, parquet, compressed, chunked, from bytes to GBs). It is represented mostly as a storage but sometimes as an architecture (Kappa, Lambda, Delta) with processing segments (ETL or sometimes ELT) pipelines in place. Uses object storage, flat locations, tags, metadata, unique ID for performance improvements. Schema on read. Unstructured data support - good for ML. . Examples: HDFS, GCS, S3 Pros:
- Uses cheap storage, open formats
- Highly durable, low costs, scalable
- ML friendly Cons:
- Don’t support transactions
- No data check, quality, consistency Good for:
- Powering data science and machine learning
- Centralization, consolidation, cataloging data
Lakehouse #
Datamesh #
Data Pipeline #
MPP #
BigData #
Analytics #
KPI Dashboard #
High level, strategic goals of the organization, and we need to figure out what data we want to use in order to make a decision makers to understand how well we are doing aganst those goals and how well as a business we are performing. What data is
Self service #
History #
Tags #
flexibility, performance, costs, ingestion, governance, policies, master data management, lineage, real time processing, streaming, messaging, volumes, formats, consistency, isolation, refinement, raw, intermediate, final, bronze, silver, gold, segmentation
Data pipeline #
Analogy to water pipelines #
Fetching data from lakes, rivers and ponds could take long distances and time. It was manual process but in time the demand was bigger and the water supply has been automated with the new technologies. Basics Data pipeline is a mechanism to transfer data from point A to point B through some intermediate points C,D and E where data processing takes place. Data pipeline receives data from the Data Producers and the result of the processing is used by the Data Consumers.
Responsibilities #
Ingestion Data Governance Master Data Management Lineage
Segmentation #
Bronze / Silver / Gold Data format Security
Usage #
Data Pipelines are used in the following fields: Business Analytics Reporting Data Science Machine Learning
Types of the Data Pipelines #
ETL, ELT, CDA Batch, Realtime
Architectures #
Kappa, Delta, Lambda Storage Raw
Silver Gold
Concepts #
ETL #
Extract Transform Load
Pros:
- Easy to test
- Unstructured data / schema in runtime
- Low cost storage
- Very flexible
- Good for data streaming
Cons:
- Poor performance
- High Cost
- Data move (network overhead)
ELT #
Pros:
- Good performance
- Costs optimized for compute
Cons:
- Only table data
- Storage can be expensive
- Only batch
Fact Table #
Example: Operation in time
Dimension Table #
Lookup table / dictionary. This might change (add / modify) often. Example: Client Table
Normalisation #
Database Normal Forms #
Normalization is the process of organizing data in a database to reduce redundancy and improve data integrity. Normal forms are a series of rules applied to database tables to ensure consistency and efficiency.
1st Normal Form (1NF): Eliminate Repeating Groups #
Principle: Each column should hold atomic (indivisible) values, and there should be no repeating groups or arrays within a table. Each row must be unique.
Good Example:
OrderID Product Quantity 1 Apple 10 1 Banana 5 2 Orange 3 Bad Example (Repeating groups):
OrderID Product1 Quantity1 Product2 Quantity2 1 Apple 10 Banana 5
Issue: Difficult to query and maintain as the number of products grows.
2nd Normal Form (2NF): Eliminate Partial Dependencies #
Principle: Ensure that every non-key column is fully dependent on the whole primary key (for composite keys). Tables with a single-column primary key are automatically in 2NF if they meet 1NF.
Good Example: Orders Table:
OrderID CustomerID 1 101 2 102 OrderDetails Table:
OrderID ProductID Quantity 1 201 10 1 202 5 Bad Example:
OrderID CustomerID ProductID Quantity 1 101 201 10 1 101 202 5
Issue: “CustomerID” depends only on OrderID
, not the combination of OrderID
and ProductID
. This creates redundancy.
3rd Normal Form (3NF): Eliminate Transitive Dependencies #
Principle: Ensure that non-key columns are only dependent on the primary key, not on other non-key columns.
Good Example: Orders Table:
OrderID CustomerID OrderDate 1 101 2024-11-25 Customers Table:
CustomerID CustomerName 101 John Doe Bad Example:
OrderID CustomerID CustomerName 1 101 John Doe 2 101 John Doe
Issue: “CustomerName” depends on CustomerID
, not directly on OrderID
, leading to redundancy and maintenance problems.
Boyce-Codd Normal Form (BCNF): Generalization of 3NF #
Principle: Every determinant (column or combination of columns that uniquely identify another column) must be a candidate key.
Good Example: Rooms Table:
RoomID RoomType PricePerNight 101 Single 100 102 Double 150 Bad Example:
RoomID RoomType Manager 101 Single Alice 102 Double Bob
Issue: “RoomType” determines “Manager,” but “RoomType” is not a candidate key (it’s not unique).
4th Normal Form (4NF): Eliminate Multi-Valued Dependencies #
Principle: A table should not have more than one multi-valued dependency. Multi-valued dependencies occur when one column can have multiple values independently of another.
Good Example: Courses Table:
StudentID CourseID 1 Math 1 Science 2 Math Hobbies Table:
StudentID Hobby 1 Painting 1 Chess 2 Swimming Bad Example:
StudentID CourseID Hobby 1 Math Painting 1 Science Chess
Issue: Independent multi-valued facts (Courses and Hobbies) are combined, leading to redundancy.
5th Normal Form (5NF): Eliminate Join Dependencies #
Principle: A table is in 5NF if it cannot be decomposed into smaller tables without losing data.
Good Example: Suppliers Table:
SupplierID PartID 1 A 1 B 2 A Projects Table:
ProjectID PartID X A X B Y A Bad Example:
SupplierID PartID ProjectID 1 A X 1 B X 2 A Y
Issue: Combining multiple relationships (Supplier-Part and Part-Project) creates redundancy.
Summary of Normal Forms #
Normal Form | Key Principle | Main Issue Resolved |
---|---|---|
1NF | Atomic columns, no repeating groups | Eliminates multi-value cells |
2NF | No partial dependencies | Removes dependencies on part of key |
3NF | No transitive dependencies | Removes non-key dependencies |
BCNF | Every determinant is a candidate key | Resolves key dependency violations |
4NF | No multi-valued dependencies | Handles independent multi-values |
5NF | No join dependencies | Prevents redundancy from complex joins |
Here’s how the principles of normalization are practically implemented in real-world scenarios:
Scenario 1: 1NF Implementation #
Problem: A student database stores multiple phone numbers in a single column.
Initial Table (Not in 1NF):
StudentID Name PhoneNumbers 1 John 123-456, 789-012 2 Alice 345-678 Solution: Break the multi-valued column into atomic values.
Normalized Table:
StudentID Name PhoneNumber 1 John 123-456 1 John 789-012 2 Alice 345-678
Scenario 2: 2NF Implementation #
Problem: A sales database mixes data about customers and orders in a single table.
Initial Table (Not in 2NF):
OrderID CustomerID CustomerName Product Quantity 1 101 John Doe Laptop 1 2 101 John Doe Smartphone 2 Issue:
CustomerName
depends only onCustomerID
, not on the composite primary key (OrderID, Product
).Solution: Split the table to remove partial dependencies.
Orders Table:
OrderID CustomerID 1 101 2 101 Customers Table:
CustomerID CustomerName 101 John Doe OrderDetails Table:
OrderID Product Quantity 1 Laptop 1 2 Smartphone 2
Scenario 3: 3NF Implementation #
Problem: A supplier database stores unnecessary attributes in a single table.
Initial Table (Not in 3NF):
SupplierID SupplierName City Region 1 Supplier A Chicago Midwest 2 Supplier B Dallas South Issue:
Region
depends onCity
, not onSupplierID
.Solution: Remove transitive dependencies by creating a separate table for cities.
Suppliers Table:
SupplierID SupplierName City 1 Supplier A Chicago 2 Supplier B Dallas Cities Table:
City Region Chicago Midwest Dallas South
Scenario 4: BCNF Implementation #
Problem: A university database assigns classrooms to courses but also tracks room managers.
Initial Table (Not in BCNF):
RoomID CourseID Manager 101 Math Alice 101 Physics Alice 102 Chemistry Bob Issue:
RoomID
determinesManager
, butRoomID
is not a candidate key.Solution: Split the table to eliminate dependency violations.
Rooms Table:
RoomID Manager 101 Alice 102 Bob RoomCourses Table:
RoomID CourseID 101 Math 101 Physics 102 Chemistry
Scenario 5: 4NF Implementation #
Problem: A hobby and courses database for students combines unrelated multi-valued facts.
Initial Table (Not in 4NF):
StudentID Course Hobby 1 Math Painting 1 Science Chess Issue:
Course
andHobby
are independent of each other but combined in one table.Solution: Split into two tables for independent facts.
Courses Table:
StudentID Course 1 Math 1 Science Hobbies Table:
StudentID Hobby 1 Painting 1 Chess
Scenario 6: 5NF Implementation #
Problem: A supplier database combines relationships between suppliers, parts, and projects.
Initial Table (Not in 5NF):
SupplierID PartID ProjectID 1 A X 1 B X 2 A Y Issue: Redundancy due to multiple independent relationships (Supplier-Part, Part-Project).
Solution: Decompose into three independent tables.
SuppliersParts Table:
SupplierID PartID 1 A 1 B 2 A PartsProjects Table:
PartID ProjectID A X B X A Y SuppliersProjects Table:
SupplierID ProjectID 1 X 2 Y
Refs: