PHPRO.ORG

Introduction To Database Normalization

Introduction To Database Normalization

Contents

  1. What is Normalization?
  2. The Problem
  3. First Normal Form (1NF)
  4. Second Normal Form (2NF)
  5. Third Normal Form (3NF)

What is normalization?

Database normalization is a design, or process of organizing data within a database to eliminate data redundancy that arranges data into tabular a form by removing duplicate data from relation tables, ensuring data dependencies make sense and data is logically stored.

The Problem

Data which is not normalized becomes difficult to manage. Simple Create, Read, Update and Delete (CRUD) tasks become laboious and slow and increase the risk of data loss. Data anomolies are frequent as demonstrated below.

Consider a table of school teachers. The table, in non normalized form, might looks like this..

Full NameFirst NameSurnameAddressZip CodeDepartment
Emma SmithEmmaSmith1 Smith St1234Physics
Roger RamjetRogerRamjet2 Fast St2345Aeronautics
Chris PittChrisPitt4 Pit Lane3456Mechanics
Emma SmithEmmaSmith93 Jones St4567Biology
Chen LeeChenLee101 Alphabet RdEnglish

In this simple example, we see the teacher Emma Smith teaches both Physics and Biology. Problems arise when we try to Create a record. If a teacher has not yet been assigned a subject to teach, the Department field will be NULL. Reading data from this type of table will be slower, as each record for Emma Smith is not related in any way. When updating a record for a teacher who teaches multiple subjects, each of the records needs to be updated, or the data will be inconsistent. Deleting a teachers record will become problematic as each record needs to be found and deleted.

To eliminate these anomolies, and potential data loss, Normalization is used. Normalization has four basic tenets.

  1. First Normal Form (1NF)
  2. Second Normal Form (2NF)
  3. Third Normal Form (3NF)
  4. Boyce & Codd Normal Form (BCNF)

First Normal Form (1NF)

A row of data must not contain repeated data. This means each column must have a unique value. Each row of data must have a unique identifier, such as a Primary Key. In our Example Table we see several problems. There is not unique identifier, such as a Primary Key column, so this column cannot be guarenteed to be unique. The full name field is redundant, as this could be gained from the first name and surname fields, therefore the data is not atomic. There is no way of telling if the Emma Smith who teaches Physics, is the same Emma Smith who teaches Biology.

The solution to the problem of the Example Table to satisfy 1NF is to break the data into seperate tables and add unique identifiers.

Teacher IDFirst NameSurnameAddressZip Code
1EmmaSmith1 Smith St1234
2RogerRamjet2 Fast St2345
3ChrisPitt4 Pit Lane3456
4ChenLee101 Alphabet Rd4567
Department IDTeacher IDDepartment 11Physics 22RamjetAeronautics 33Mechanics 41Biology 54English

With the data seperated each record (row) has a unique identifier, the ID field, which is the Primary Key for this table in the database. The user Emma Smith now has the ID of 1. In the second table, the teacher ID of 1 is used for both Physics and Biology, thus the data is now normalized to First Normal Form (1NF).

But there is also another issue when a Primary Key for the second table is a concatenation of Department ID, and the Teacher ID.

Second Normal Form (2NF)

For a table to meet Second Normal Form (2NF), the data must satisfy the needs of 1NF, and there must not be any partial dependency of any column on primary key. The above, using a concatination of Department ID and Teacher ID does not satisy this second condition. There is no link between the Deparment Name, and the teacher.

To satisfy Second Normal Form (2NF), the tables are once again reduced to provide three tables.

  1. Teacher Table
  2. Department Table
  3. Teacher Department Link Table
Teacher IDFirst NameSurnameAddressZip Code
1EmmaSmith1 Smith St1234
2RogerRamjet2 Fast St2345
3ChrisPitt4 Pit Lane3456
4ChenLee101 Alphabet Rd4567
Department IDDepartment
1Physics
2Aeronautics
3Mechanics
4Biology
5English
Department IDTeacher ID
11
22
33
41
54

With this third table in place, the tables now satisfy Second Normal Form (2NF). Each of the data rows is now accessable by its primary key (ID). But, I hear you say, are table JOINs not slow? No. JOINS on primary key fields are ATOMIC.

However, there is still a problem here. The address of the teacher depends upon the Zip Code field. This is called a "Transitive Functional Dependency".

Third Normal Form (3NF)

A table is said to satisfy Third Normal Form when it means the requirements of Second Normal Form, and has no Transitive Dependencies.

To fix the issue of the Transitive Dependency, one again the address data can be broken out into a seperate table. The zip code itself could be used as the primary key, or a seperate Primary Key field could be used. For the purpose of this tutorial, a seperate address table and ID field will be created with an address linking table to connect them.

Teacher IDFirst NameSurnameAddress ID
1EmmaSmith101
2RogerRamjet102
3ChrisPitt103
4ChenLee104
Department IDDepartment
1Physics
2Aeronautics
3Mechanics
4Biology
5English
Department IDTeacher ID
11
22
33
41
54
Address IDAddressZip Code
1011 Smith St1234
1022 Fast St2345
1034 Pit Lane3456
104101 Alphabet Rd4567

The teacher table now relates to the address table via the Primary Key field Address IO. With this in place, the transitive dependency is gone and the table now satisfies third Normal Form (3NF).