Introduction To Database Normalization

By Kevin Waterson 8756 12 Comments

What is Normalization?
The Problem
First Normal Form (1NF)
Second Normal Form (2NF)
Third Normal Form (3NF)

What is normalization?

Database normalization is a design, or process of organizing data within a database to eliminate data redundancy that arranges data into tabular a form by removing duplicate data from relation tables, ensuring data dependencies make sense and data is logically stored.

The Problem

Data which is not normalized becomes difficult to manage. Simple Create, Read, Update and Delete (CRUD) tasks become laboious and slow and increase the risk of data loss. Data anomolies are frequent as demonstrated below.

Consider a table of school teachers. The table, in non normalized form, might looks like this..

Full Name	First Name	Surname	Address	Zip Code	Department
Emma Smith	Emma	Smith	1 Smith St	1234	Physics
Roger Ramjet	Roger	Ramjet	2 Fast St	2345	Aeronautics
Chris Pitt	Chris	Pitt	4 Pit Lane	3456	Mechanics
Emma Smith	Emma	Smith	93 Jones St	4567	Biology
Chen Lee	Chen	Lee	101 Alphabet Rd	English

In this simple example, we see the teacher Emma Smith teaches both Physics and Biology. Problems arise when we try to Create a record. If a teacher has not yet been assigned a subject to teach, the Department field will be NULL. Reading data from this type of table will be slower, as each record for Emma Smith is not related in any way. When updating a record for a teacher who teaches multiple subjects, each of the records needs to be updated, or the data will be inconsistent. Deleting a teachers record will become problematic as each record needs to be found and deleted.

To eliminate these anomolies, and potential data loss, Normalization is used. Normalization has four basic tenets.

First Normal Form (1NF)
Second Normal Form (2NF)
Third Normal Form (3NF)
Boyce & Codd Normal Form (BCNF)

First Normal Form (1NF)

A row of data must not contain repeated data. This means each column must have a unique value. Each row of data must have a unique identifier, such as a Primary Key. In our Example Table we see several problems. There is not unique identifier, such as a Primary Key column, so this column cannot be guarenteed to be unique. The full name field is redundant, as this could be gained from the first name and surname fields, therefore the data is not atomic. There is no way of telling if the Emma Smith who teaches Physics, is the same Emma Smith who teaches Biology.

The solution to the problem of the Example Table to satisfy 1NF is to break the data into seperate tables and add unique identifiers.

Teacher ID	First Name	Surname	Address	Zip Code
1	Emma	Smith	1 Smith St	1234
2	Roger	Ramjet	2 Fast St	2345
3	Chris	Pitt	4 Pit Lane	3456
4	Chen	Lee	101 Alphabet Rd	4567

Department IDTeacher IDDepartment 11Physics 22RamjetAeronautics 33Mechanics 41Biology 54English

With the data seperated each record (row) has a unique identifier, the ID field, which is the Primary Key for this table in the database. The user Emma Smith now has the ID of 1. In the second table, the teacher ID of 1 is used for both Physics and Biology, thus the data is now normalized to First Normal Form (1NF).

But there is also another issue when a Primary Key for the second table is a concatenation of Department ID, and the Teacher ID.

Second Normal Form (2NF)

For a table to meet Second Normal Form (2NF), the data must satisfy the needs of 1NF, and there must not be any partial dependency of any column on primary key. The above, using a concatination of Department ID and Teacher ID does not satisy this second condition. There is no link between the Deparment Name, and the teacher.

To satisfy Second Normal Form (2NF), the tables are once again reduced to provide three tables.

Teacher Table
Department Table
Teacher Department Link Table

Teacher ID	First Name	Surname	Address	Zip Code
1	Emma	Smith	1 Smith St	1234
2	Roger	Ramjet	2 Fast St	2345
3	Chris	Pitt	4 Pit Lane	3456
4	Chen	Lee	101 Alphabet Rd	4567

Department ID	Department
1	Physics
2	Aeronautics
3	Mechanics
4	Biology
5	English

Department ID	Teacher ID
1	1
2	2
3	3
4	1
5	4

With this third table in place, the tables now satisfy Second Normal Form (2NF). Each of the data rows is now accessable by its primary key (ID). But, I hear you say, are table JOINs not slow? No. JOINS on primary key fields are ATOMIC.

However, there is still a problem here. The address of the teacher depends upon the Zip Code field. This is called a "Transitive Functional Dependency".

Third Normal Form (3NF)

A table is said to satisfy Third Normal Form when it means the requirements of Second Normal Form, and has no Transitive Dependencies.

To fix the issue of the Transitive Dependency, one again the address data can be broken out into a seperate table. The zip code itself could be used as the primary key, or a seperate Primary Key field could be used. For the purpose of this tutorial, a seperate address table and ID field will be created with an address linking table to connect them.

Teacher ID	First Name	Surname	Address ID
1	Emma	Smith	101
2	Roger	Ramjet	102
3	Chris	Pitt	103
4	Chen	Lee	104

Department ID	Department
1	Physics
2	Aeronautics
3	Mechanics
4	Biology
5	English

Department ID	Teacher ID
1	1
2	2
3	3
4	1
5	4

Address ID	Address	Zip Code
101	1 Smith St	1234
102	2 Fast St	2345
103	4 Pit Lane	3456
104	101 Alphabet Rd	4567

The teacher table now relates to the address table via the Primary Key field Address IO. With this in place, the transitive dependency is gone and the table now satisfies third Normal Form (3NF).