Introduction To Database Normalization
Contents
- What is Normalization?
- The Problem
- First Normal Form (1NF)
- Second Normal Form (2NF)
- Third Normal Form (3NF)
What is normalization?
Database normalization is a design, or process of organizing data within a database to eliminate data redundancy that arranges data into tabular a form by removing duplicate data from relation tables, ensuring data dependencies make sense and data is logically stored.
The Problem
Data which is not normalized becomes difficult to manage. Simple Create, Read, Update and Delete (CRUD) tasks become laboious and slow and increase the risk of data loss. Data anomolies are frequent as demonstrated below.
Consider a table of school teachers. The table, in non normalized form, might looks like this..
Full Name | First Name | Surname | Address | Zip Code | Department |
---|---|---|---|---|---|
Emma Smith | Emma | Smith | 1 Smith St | 1234 | Physics |
Roger Ramjet | Roger | Ramjet | 2 Fast St | 2345 | Aeronautics |
Chris Pitt | Chris | Pitt | 4 Pit Lane | 3456 | Mechanics |
Emma Smith | Emma | Smith | 93 Jones St | 4567 | Biology |
Chen Lee | Chen | Lee | 101 Alphabet Rd | English |
In this simple example, we see the teacher Emma Smith teaches both Physics and Biology. Problems arise when we try to Create a record. If a teacher has not yet been assigned a subject to teach, the Department field will be NULL. Reading data from this type of table will be slower, as each record for Emma Smith is not related in any way. When updating a record for a teacher who teaches multiple subjects, each of the records needs to be updated, or the data will be inconsistent. Deleting a teachers record will become problematic as each record needs to be found and deleted.
To eliminate these anomolies, and potential data loss, Normalization is used. Normalization has four basic tenets.- First Normal Form (1NF)
- Second Normal Form (2NF)
- Third Normal Form (3NF)
- Boyce & Codd Normal Form (BCNF)
First Normal Form (1NF)
A row of data must not contain repeated data. This means each column must have a unique value. Each row of data must have a unique identifier, such as a Primary Key. In our Example Table we see several problems. There is not unique identifier, such as a Primary Key column, so this column cannot be guarenteed to be unique. The full name field is redundant, as this could be gained from the first name and surname fields, therefore the data is not atomic. There is no way of telling if the Emma Smith who teaches Physics, is the same Emma Smith who teaches Biology.
The solution to the problem of the Example Table to satisfy 1NF is to break the data into seperate tables and add unique identifiers.
Teacher ID | First Name | Surname | Address | Zip Code |
---|---|---|---|---|
1 | Emma | Smith | 1 Smith St | 1234 |
2 | Roger | Ramjet | 2 Fast St | 2345 |
3 | Chris | Pitt | 4 Pit Lane | 3456 |
4 | Chen | Lee | 101 Alphabet Rd | 4567 |
With the data seperated each record (row) has a unique identifier, the ID field, which is the Primary Key for this table in the database. The user Emma Smith now has the ID of 1. In the second table, the teacher ID of 1 is used for both Physics and Biology, thus the data is now normalized to First Normal Form (1NF).
But there is also another issue when a Primary Key for the second table is a concatenation of Department ID, and the Teacher ID.
Second Normal Form (2NF)
For a table to meet Second Normal Form (2NF), the data must satisfy the needs of 1NF, and there must not be any partial dependency of any column on primary key. The above, using a concatination of Department ID and Teacher ID does not satisy this second condition. There is no link between the Deparment Name, and the teacher.
To satisfy Second Normal Form (2NF), the tables are once again reduced to provide three tables.
- Teacher Table
- Department Table
- Teacher Department Link Table
Teacher ID | First Name | Surname | Address | Zip Code |
---|---|---|---|---|
1 | Emma | Smith | 1 Smith St | 1234 |
2 | Roger | Ramjet | 2 Fast St | 2345 |
3 | Chris | Pitt | 4 Pit Lane | 3456 |
4 | Chen | Lee | 101 Alphabet Rd | 4567 |
Department ID | Department |
---|---|
1 | Physics |
2 | Aeronautics |
3 | Mechanics |
4 | Biology |
5 | English |
Department ID | Teacher ID |
---|---|
1 | 1 |
2 | 2 |
3 | 3 |
4 | 1 |
5 | 4 |
With this third table in place, the tables now satisfy Second Normal Form (2NF). Each of the data rows is now accessable by its primary key (ID). But, I hear you say, are table JOINs not slow? No. JOINS on primary key fields are ATOMIC.
However, there is still a problem here. The address of the teacher depends upon the Zip Code field. This is called a "Transitive Functional Dependency".
Third Normal Form (3NF)
A table is said to satisfy Third Normal Form when it means the requirements of Second Normal Form, and has no Transitive Dependencies.
To fix the issue of the Transitive Dependency, one again the address data can be broken out into a seperate table. The zip code itself could be used as the primary key, or a seperate Primary Key field could be used. For the purpose of this tutorial, a seperate address table and ID field will be created with an address linking table to connect them.
Teacher ID | First Name | Surname | Address ID |
---|---|---|---|
1 | Emma | Smith | 101 |
2 | Roger | Ramjet | 102 |
3 | Chris | Pitt | 103 |
4 | Chen | Lee | 104 |
Department ID | Department |
---|---|
1 | Physics |
2 | Aeronautics |
3 | Mechanics |
4 | Biology |
5 | English |
Department ID | Teacher ID |
---|---|
1 | 1 |
2 | 2 |
3 | 3 |
4 | 1 |
5 | 4 |
Address ID | Address | Zip Code |
---|---|---|
101 | 1 Smith St | 1234 |
102 | 2 Fast St | 2345 |
103 | 4 Pit Lane | 3456 |
104 | 101 Alphabet Rd | 4567 |
The teacher table now relates to the address table via the Primary Key field Address IO. With this in place, the transitive dependency is gone and the table now satisfies third Normal Form (3NF).