.. meta:: :description: Database normalization :author: Serhii Horodilov :keywords: sql, db, database, normalization ******************************************************************************* Database Normalization ******************************************************************************* Normalization is a process of organizing data in a database so that it is free from redundancy and dependency. It helps to eliminate data inconsistencies and anomalies, thereby improving data integrity. Normalization is a set of rules or guidelines to design a database schema in such a way that it avoids data duplication, data redundancy, and data inconsistency. The main purpose of database normalization is to avoid complexities, eliminate duplicates, and organize data in a consistent way. Normalization reduces data redundancy and dependency, making the database more efficient, flexible, and scalable. It also helps in maintaining data consistency and accuracy, and ensures that updates and deletions are properly handled. There are 6 + 3 **normal forms**, but this chapter focuses on the first 3 only, since the commonest normal form you'll see out there is *3NF* (third normal form). Normalized tables are: - Easier to understand - Easier to enhance and extend - Protected from: - insertion anomalies - update anomalies - deletion anomalies Anomalies ========= Insertion anomalies ------------------- There are circumstances in which certain facts cannot be recorded at all. For example, each record in the "Faculty and Their Courses" relation might contain a *Faculty ID*, *Faculty Name* and *Course Code*. Therefore, the details of any faculty member who teaches at least one course can be recorded, but a newly hired faculty member who has not yet been assigned to teach any courses cannot be recorded, except by setting the course code to ``null``. +------------+----------------+-------------+ | Faculty ID | Faculty Name | Course Code | +============+================+=============+ | 389 | Dr. Headstrong | ENG-206 | +------------+----------------+-------------+ | 407 | Dr. Mugwort | CMP-101 | +------------+----------------+-------------+ | 407 | Dr. Mugwort | CMP-201 | +------------+----------------+-------------+ +------------+----------------+-------------+ | Faculty ID | Faculty Name | Course Code | +============+================+=============+ | 389 | Dr. Newone | ??? | +------------+----------------+-------------+ Update anomalies ---------------- The same information can be expressed on multiple rows; therefore updates to the relation may result in logical inconsistencies. For example, each record in an "Employees' Skills" relation might contain ah *Employee ID*, *Employee Address*, and *Skill*; thus a charge of address for a particular employee may need to be applied to multiple records (one for each skill). If the update is only partially successful -- the employee's address is updated on some records but not others -- then the relation is left in an inconsistent state. Specifically, the relation provides conflicting answers to the question of what this particular employee's address is. +-------------+-----------------------+-----------------+ | Employee ID | Employee Address | Skill | +=============+=======================+=================+ | 426 | 230 Parkland Crescent | typing | +-------------+-----------------------+-----------------+ | 426 | 230 Parkland Crescent | shorthand | +-------------+-----------------------+-----------------+ | 529 | 158 Watkins Place | public speaking | +-------------+-----------------------+-----------------+ | 529 | 108 Church Street | carpentry | +-------------+-----------------------+-----------------+ Deletion anomalies ------------------ Under certain circumstances, the deletion of data representing certain facts necessitates the deletion of data representing completely different facts. The "Faculty and Their Courses" relation described in the previous example suffers from this type of anomaly, for if a faculty member temporary ceases to be assigned to any courses, the last of the records on which that faculty member appears must be deleted, effectively also deleting the faculty member, unless the *Course Code* field is set to ``null``. First normal form (1NF) ======================= .. rubric:: Data example The table below represents players and their characters inventory, ordered by their characters' levels. +------------+----------------------------------------------------------------+ | username | inventory | +============+================================================================+ | prombery87 | 2 amulets, 3 rings | +------------+----------------------------------------------------------------+ | wheed1997 | 18 copper coins, 1 bow, 5 arrows | +------------+----------------------------------------------------------------+ | acen1999 | 1 sword, 1 shield, 4 rings | +------------+----------------------------------------------------------------+ Requirements ------------ - Row order should not be used to convey information. - Data types should not be mixed within the same column (and the RDBMS won't let you do in anyway). - Tables should have their primary keys. - Repeating groups should be eliminated. Applying 1NF ------------ First of all, using the rows order to represent the information violates 1NF. This can be easily fixed by adding additional column *level* to store the info about the character's level. From this, the ordered sequences can be fetched by using the query: .. code-block:: sql SELECT username FROM player ORDER BY level; The second rule is **do not** mix data types within the same column. As example, the characters' level might be described as follows: +------------+----------------------------------+-----------------------------+ | username | inventory | level | +============+==================================+=============================+ | prombery87 | 2 amulets, 3 rings | 9 | +------------+----------------------------------+-----------------------------+ | wheed1997 | 18 copper coins, 1 bow, 5 arrows | almost 8 | +------------+----------------------------------+-----------------------------+ | acen1999 | 1 sword, 1 shield, 4 rings | 5.5 | +------------+----------------------------------+-----------------------------+ The data types are mixed in *level* column, which violates 1NF; and normally, RDBMS won't let you do this. The tables without a **primary key** violates 1NF as well. *username* field may be used as **PK** for the players relation example. This would prevent the data duplication, like having both ``wheed1997`` players at level 7 and 8 at the same time, which makes no sense. +------------+----------------------------------+-----------------------------+ | username | inventory | level | +============+==================================+=============================+ | prombery87 | 2 amulets, 3 rings | 9 | +------------+----------------------------------+-----------------------------+ | wheed1997 | 18 copper coins, 1 bow, 5 arrows | 8 | +------------+----------------------------------+-----------------------------+ | wheed1997 | 18 copper coins, 1 bow, 5 arrows | 7 | +------------+----------------------------------+-----------------------------+ | acen1999 | 1 sword, 1 shield, 4 rings | 5 | +------------+----------------------------------+-----------------------------+ At last, *inventory* column contains repeating groups of data (also called *non-atomic data*), which violates 1NF. To normalize the table, it should be altered in a way, to represent only one inventory item per row. +------------+-------------+---------------+ | player_id | item_type | item_quantity | +============+=============+===============+ | prombery87 | amulet | 2 | +------------+-------------+---------------+ | prombery87 | ring | 3 | +------------+-------------+---------------+ | wheed1997 | copper coin | 18 | +------------+-------------+---------------+ | wheed1997 | bow | 1 | +------------+-------------+---------------+ | wheed1997 | arrow | 5 | +------------+-------------+---------------+ | acen1999 | sword | 1 | +------------+-------------+---------------+ | acen1999 | shield | 1 | +------------+-------------+---------------+ | acen1999 | ring | 4 | +------------+-------------+---------------+ The primary key for this table is the combination of *player_id* and *item_type* -- a composite primary key. Second normal form (2NF) ======================== Here, 2NF starts. For example, let's bring back *level* info to the table, representing players and theirs inventory. +------------+-------------+---------------+--------------+ | player_id | item_type | item_quantity | player_level | +============+=============+===============+==============+ | prombery87 | amulet | 2 | 9 | +------------+-------------+---------------+--------------+ | prombery87 | ring | 3 | 9 | +------------+-------------+---------------+--------------+ | wheed1997 | copper coin | 18 | 6 | +------------+-------------+---------------+--------------+ | wheed1997 | bow | 1 | 6 | +------------+-------------+---------------+--------------+ | wheed1997 | arrow | 5 | 6 | +------------+-------------+---------------+--------------+ | acen1999 | sword | 1 | 3 | +------------+-------------+---------------+--------------+ | acen1999 | shield | 1 | 3 | +------------+-------------+---------------+--------------+ | acen1999 | ring | 4 | 3 | +------------+-------------+---------------+--------------+ The player's *level* should be duplicated for each row. This table design keeps the doors open for **insertion**, **update** and **deletion** anomalies. Requirements ------------ - Each non-key attribute must depend on the entire primary key Applying 2NF ------------ The 2NF is all about "how non-key columns related to the primary key". The example above has *non-key attributes* (which is another term for the "non-key columns") are: *item_quantity* and *player_level*. The relations between the *primary key* and *non-key attributes* can be described as: :: { player_id, item_type } -> { item_quantity } { player_id } -> { player_level } The *item_quantity* column depends on the *player_id* and *item_type* columns, which creates a 1-to-1 dependency for this column on the entire primary key. This means, there is no way to place different values in the *item_quantity* column for the same primary key. But, the *player_level* column depends on the *player_id* only. It's a problem, because the *player_id* is not the primary key, but a part of a primary key. This violates the 2NF, and what's why the table is vulnerable to the problems. The **player** is the important concept for representing the data. So, it's the best solution to store player related data in a separate table. .. table:: player +------------+--------------+ | player_id | player_level | +============+==============+ | prombery87 | 9 | +------------+--------------+ | wheed1997 | 6 | +------------+--------------+ | acen1999 | 3 | +------------+--------------+ .. table:: player_inventory +------------+-------------+---------------+ | player_id | item_type | item_quantity | +============+=============+===============+ | prombery87 | amulet | 2 | +------------+-------------+---------------+ | prombery87 | ring | 3 | +------------+-------------+---------------+ | wheed1997 | copper coin | 18 | +------------+-------------+---------------+ | wheed1997 | bow | 1 | +------------+-------------+---------------+ | wheed1997 | arrow | 5 | +------------+-------------+---------------+ | acen1999 | sword | 1 | +------------+-------------+---------------+ | acen1999 | shield | 1 | +------------+-------------+---------------+ | acen1999 | ring | 4 | +------------+-------------+---------------+ Third normal form (3NF) ======================= Let's assume, the player table is extended with the *player_rating* column, which describes the players' skills as "beginner", "intermediate", or "advanced". And let's assume, the skills are arranged in some like: +--------------+-----------+--------------+-----------+ | Skill Level | Beginner | Intermediate | Advanced | +--------------+---+---+---+----+----+----+---+---+---+ | Player Level | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | +--------------+---+---+---+----+----+----+---+---+---+ .. table:: player +------------+--------------+---------------+ | player_id | player_level | player_rating | +============+==============+===============+ | prombery87 | 9 | advanced | +------------+--------------+---------------+ | wheed1997 | 6 | intermediate | +------------+--------------+---------------+ | acen1999 | 3 | beginner | +------------+--------------+---------------+ The *non-key attributes* dependencies can be described as follows: :: { player_id } -> { player_level } { player_id } -> { player_level } -> { player_skill } The *player_rating* does not depend on the primary key, but it depends on the *player_level* attribute, which is dependent on the primary key. These type of dependencies is called **transitive dependency**. Having these in a table opens a way for having data in an inconsistent form. E.g. let's assume player "acen1999" have increase their level to 4, but something went wrong, and *player_rating* has not been updated. +------------+--------------+---------------+ | acen1999 | 4 | beginner | +------------+--------------+---------------+ Requirements ------------ - Every non-key attribute in a table should depend on the key, the whole key, and nothing but the key (no transitive dependency). .. note:: **Boyce-Codd normal form** :abbr:`BCNF (Boyce-Codd Normal Form)` is the stronger version of 3NF. It also is called as 3.5NF (three-and-a-half normal form). **Every attribute** in a table should depend on the key, the whole key, and nothing but the key. Applying 3NF ------------ Introducing the "player_rating_levels" will fix the problem. .. table:: player +------------+--------------+ | player_id | player_level | +============+==============+ | prombery87 | 9 | +------------+--------------+ | wheed1997 | 6 | +------------+--------------+ | acen1999 | 3 | +------------+--------------+ .. table:: player_rating_levels +--------------+---------------+ | player_level | player_rating | +==============+===============+ | 9 | advanced | +--------------+---------------+ | 8 | advanced | +--------------+---------------+ | 7 | advanced | +--------------+---------------+ | 6 | intermediate | +--------------+---------------+ | 5 | intermediate | +--------------+---------------+ | 4 | intermediate | +--------------+---------------+ | 3 | beginner | +--------------+---------------+ | 2 | beginner | +--------------+---------------+ | 1 | beginner | +--------------+---------------+ Conclusion ========== Each next normal form includes all the requirements for the previous one. Following the rules above will lead you to having a fully normalized tables in 99% of time.