Database Normalization

Normalization is a process of organizing data in a database so that it is free from redundancy and dependency. It helps to eliminate data inconsistencies and anomalies, thereby improving data integrity. Normalization is a set of rules or guidelines to design a database schema in such a way that it avoids data duplication, data redundancy, and data inconsistency.

The main purpose of database normalization is to avoid complexities, eliminate duplicates, and organize data in a consistent way. Normalization reduces data redundancy and dependency, making the database more efficient, flexible, and scalable. It also helps in maintaining data consistency and accuracy, and ensures that updates and deletions are properly handled.

There are 6 + 3 normal forms, but this chapter focuses on the first 3 only, since the commonest normal form you’ll see out there is 3NF (third normal form).

Normalized tables are:

Easier to understand
Easier to enhance and extend
Protected from:
- insertion anomalies
- update anomalies
- deletion anomalies

Anomalies

Insertion anomalies

There are circumstances in which certain facts cannot be recorded at all. For example, each record in the “Faculty and Their Courses” relation might contain a Faculty ID, Faculty Name and Course Code. Therefore, the details of any faculty member who teaches at least one course can be recorded, but a newly hired faculty member who has not yet been assigned to teach any courses cannot be recorded, except by setting the course code to null.

Faculty ID	Faculty Name	Course Code
389	Dr. Headstrong	ENG-206
407	Dr. Mugwort	CMP-101
407	Dr. Mugwort	CMP-201

Faculty ID	Faculty Name	Course Code
389	Dr. Newone	???

Update anomalies

The same information can be expressed on multiple rows; therefore updates to the relation may result in logical inconsistencies. For example, each record in an “Employees’ Skills” relation might contain ah Employee ID, Employee Address, and Skill; thus a charge of address for a particular employee may need to be applied to multiple records (one for each skill). If the update is only partially successful – the employee’s address is updated on some records but not others – then the relation is left in an inconsistent state. Specifically, the relation provides conflicting answers to the question of what this particular employee’s address is.

Employee ID	Employee Address	Skill
426	230 Parkland Crescent	typing
426	230 Parkland Crescent	shorthand
529	158 Watkins Place	public speaking
529	108 Church Street	carpentry

Deletion anomalies

Under certain circumstances, the deletion of data representing certain facts necessitates the deletion of data representing completely different facts. The “Faculty and Their Courses” relation described in the previous example suffers from this type of anomaly, for if a faculty member temporary ceases to be assigned to any courses, the last of the records on which that faculty member appears must be deleted, effectively also deleting the faculty member, unless the Course Code field is set to null.

First normal form (1NF)

Data example

The table below represents players and their characters inventory, ordered by their characters’ levels.

username	inventory
prombery87	2 amulets, 3 rings
wheed1997	18 copper coins, 1 bow, 5 arrows
acen1999	1 sword, 1 shield, 4 rings

Requirements

Row order should not be used to convey information.
Data types should not be mixed within the same column (and the RDBMS won’t let you do in anyway).
Tables should have their primary keys.
Repeating groups should be eliminated.

Applying 1NF

First of all, using the rows order to represent the information violates 1NF. This can be easily fixed by adding additional column level to store the info about the character’s level. From this, the ordered sequences can be fetched by using the query:

SELECT username FROM player ORDER BY level;

The second rule is do not mix data types within the same column. As example, the characters’ level might be described as follows:

username	inventory	level
prombery87	2 amulets, 3 rings	9
wheed1997	18 copper coins, 1 bow, 5 arrows	almost 8
acen1999	1 sword, 1 shield, 4 rings	5.5

The data types are mixed in level column, which violates 1NF; and normally, RDBMS won’t let you do this.

The tables without a primary key violates 1NF as well. username field may be used as PK for the players relation example. This would prevent the data duplication, like having both wheed1997 players at level 7 and 8 at the same time, which makes no sense.

username	inventory	level
prombery87	2 amulets, 3 rings	9
wheed1997	18 copper coins, 1 bow, 5 arrows	8
wheed1997	18 copper coins, 1 bow, 5 arrows	7
acen1999	1 sword, 1 shield, 4 rings	5

At last, inventory column contains repeating groups of data (also called non-atomic data), which violates 1NF. To normalize the table, it should be altered in a way, to represent only one inventory item per row.

player_id	item_type	item_quantity
prombery87	amulet	2
prombery87	ring	3
wheed1997	copper coin	18
wheed1997	bow	1
wheed1997	arrow	5
acen1999	sword	1
acen1999	shield	1
acen1999	ring	4

The primary key for this table is the combination of player_id and item_type – a composite primary key.

Second normal form (2NF)

Here, 2NF starts. For example, let’s bring back level info to the table, representing players and theirs inventory.

player_id	item_type	item_quantity	player_level
prombery87	amulet	2	9
prombery87	ring	3	9
wheed1997	copper coin	18	6
wheed1997	bow	1	6
wheed1997	arrow	5	6
acen1999	sword	1	3
acen1999	shield	1	3
acen1999	ring	4	3

The player’s level should be duplicated for each row. This table design keeps the doors open for insertion, update and deletion anomalies.

Requirements

Each non-key attribute must depend on the entire primary key

Applying 2NF

The 2NF is all about “how non-key columns related to the primary key”. The example above has non-key attributes (which is another term for the “non-key columns”) are: item_quantity and player_level.

The relations between the primary key and non-key attributes can be described as:

{ player_id, item_type } -> { item_quantity }
{ player_id } -> { player_level }

The item_quantity column depends on the player_id and item_type columns, which creates a 1-to-1 dependency for this column on the entire primary key. This means, there is no way to place different values in the item_quantity column for the same primary key.

But, the player_level column depends on the player_id only. It’s a problem, because the player_id is not the primary key, but a part of a primary key. This violates the 2NF, and what’s why the table is vulnerable to the problems.

The player is the important concept for representing the data. So, it’s the best solution to store player related data in a separate table.

player
player_id	player_level
prombery87	9
wheed1997	6
acen1999	3

player_inventory
player_id	item_type	item_quantity
prombery87	amulet	2
prombery87	ring	3
wheed1997	copper coin	18
wheed1997	bow	1
wheed1997	arrow	5
acen1999	sword	1
acen1999	shield	1
acen1999	ring	4

Third normal form (3NF)

Let’s assume, the player table is extended with the player_rating column, which describes the players’ skills as “beginner”, “intermediate”, or “advanced”. And let’s assume, the skills are arranged in some like:

Skill Level	Beginner			Intermediate			Advanced
Player Level	1	2	3	4	5	6	7	8	9

player
player_id	player_level	player_rating
prombery87	9	advanced
wheed1997	6	intermediate
acen1999	3	beginner

The non-key attributes dependencies can be described as follows:

{ player_id } -> { player_level }
{ player_id } -> { player_level } -> { player_skill }

The player_rating does not depend on the primary key, but it depends on the player_level attribute, which is dependent on the primary key. These type of dependencies is called transitive dependency. Having these in a table opens a way for having data in an inconsistent form. E.g. let’s assume player “acen1999” have increase their level to 4, but something went wrong, and player_rating has not been updated.

acen1999

4

beginner

Requirements

Every non-key attribute in a table should depend on the key, the whole key, and nothing but the key (no transitive dependency).

Note

Boyce-Codd normal form

BCNF is the stronger version of 3NF. It also is called as 3.5NF (three-and-a-half normal form).

Every attribute in a table should depend on the key, the whole key, and nothing but the key.

Applying 3NF

Introducing the “player_rating_levels” will fix the problem.

player
player_id	player_level
prombery87	9
wheed1997	6
acen1999	3

player_rating_levels
player_level	player_rating
9	advanced
8	advanced
7	advanced
6	intermediate
5	intermediate
4	intermediate
3	beginner
2	beginner
1	beginner

Conclusion

Each next normal form includes all the requirements for the previous one.

Following the rules above will lead you to having a fully normalized tables in 99% of time.