Database Normalization

In database design, database normalization is the process of structuring tables in a relational database system. A well-normalized database can reduce data redundancy and eliminate data anomalies, though it may sacrifice some querying performance: it involves decomposing a larger table into smaller, more manageable tables, and then establishing relationships between them. 

There is a series of structured guidelines, called normal forms, that help organize data to reduce redundancy, improve data integrity and prevent anomalies:

  • First Normal Form (1NF): Each column contains atomic values and prohibits arrays in a cell; a table is two-dimensional. Also, each row should be uniquely identified.
  • Second Normal Form (2NF): Extends 1NF by requiring that all non-prime columns (columns not part of the primary key) have full functional dependencies on the entire primary key columns, not just part of it.
  • Third Normal Form (3NF): Further builds upon 2NF by eliminating any transitive dependencies. This ensures that non-prime columns depend only on the primary key and not on other non-key columns.
  • Boyce-Codd Normal Form (BCNF): Similar to 3NF but stricter, aimed at eliminating structures that could introduce rare logical inconsistencies in the data. It ensures that every non-trivial functional dependency of columns is on a superkey.
  • Forth Normal Form (4NF): Enhances BCNF by addressing certain types of complex multi-valued dependencies that are not handled by 3NF, ensuring every multi-valued dependency is dependent on a superkey.
  • Fifth Normal Form (5NF): 4NF plus every join dependency for the table is a result of the candidate keys. It addresses where certain join dependencies exist that are not covered by other normal forms.

While achieving higher normal forms is desired depending on specific needs, in many practical cases, 3NF is typically considered as fully-normalized. 

1NF: First Normal Form

The First Normal Form (1NF) is the most basic level of normalization. It ensures that a table's structure is simple and that all of its data are stored in a way that allows for efficient querying. The key requirement for a table to be in 1NF is that all its columns must contain atomic values and each row must be uniquely identified. Let's look at an example to illustrate the concept of 1NF.

Job History Table:

Employee ID Start Date End Date Job ID Department ID Salary
101 21-SEP-07, 28-OCT-11 27-OCT-11, 15-MAR-15 AC_ACCOUNT, AC_MGR 110 17000, 17000
102 13-JAN-11 24-JUL-16 IT_PROG 60 17000
114 24-MAR-16 31-DEC-17 ST_CLERK 50 11000
122 01-JAN-17 31-DEC-17 ST_CLERK 50 7900
176 24-MAR-16, 01-JAN-1731-DEC-16, 31-DEC-17 SA_REP, SA_MAN 80 8600, 8600
20017-SEP-05, 01-JUL-12 17-JUN-11, 31-DEC-16AD_ASST, AC_ACCOUNT 110 4400, 4400
201 17-FEB-14 19-DEC-17 MK_REP 20 13000

This table violates 1NF. Three columns - Start Date, End Date, and Job ID - contain non-atomic values that could be further separated by commas. As a result, the Employee ID is not a unique identifier for each row: there can be multiple Start Date, End Date, and Job ID values associated with an Employee ID value.

To achieve 1NF, you should restructure this table, ensuring each column contains indivisible values. Specifically, separate the inner arrays in the Start Date, End Date, and Job ID columns into individual rows.

Job History Table (1NF):

Employee ID  Start DateEnd Date Job ID Department ID Salary
10121-SEP-0727-OCT-11AC_ACCOUNT11017000
101 28-OCT-1115-MAR-15 AC_MGR 110 17000
102 13-JAN-1124-JUL-16 IT_PROG 60 17000
114 24-MAR-16 31-DEC-17 ST_CLERK 50 11000
122 01-JAN-17 31-DEC-17 ST_CLERK 50 7900
176 24-MAR-16 31-DEC-16 SA_REP 80 8600
176 01-JAN-17 31-DEC-17 SA_MAN 80 8600
200 17-SEP-05 17-JUN-11 AD_ASST 110 4400
200 01-JUL-12 31-DEC-16 AC_ACCOUNT 110 4400
201 17-FEB-14 19-DEC-17 MK_REP 20 13000

This table now satisfies 1NF. Each cell in any of the columns contains indivisible atomic values. Also, the combination of the three columns (Employee ID, Start Date, End Date) can serve as a unique identifier for each row: there is no two rows where all three column values are equal. This simplifies queries as you don't need complex logic to handle multiple values in a cell. 

2NF: Second Normal Form

In relational databases, functional dependency is the fundamental concept that describes the relationship between columns in a table. It is a constraint that dictates how the values in one column determine the values in another column. Put simply, a functional dependency states that the value of one set of columns uniquely determines the value of another column within the same table. For example, in the 1NF table we've seen earlier, one set of columns (Employee ID, Start Date, End Date) uniquely determines the value of Job ID: any one employee can hold one and only one job title during a given period.

The Second Normal Form (2NF) is built upon the foundation of 1NF. 2NF eliminates any partial dependencies in a table by ensuring that all non-prime columns are fully functionally dependent on the entire primary key. Here, partial dependency refers to a situation where a non-key column (column that is not part of the primary key) depends on only a part of the primary key, rather than on the entire primary key. For example, in the 1NF table mentioned earlier, the Salary column depends only on Employee ID, not on the entire primary key (which is the combination of Employee ID, Start Date, and End Date.) This is the scenario where we refer to as a partial dependency occurs. 

To move this table into 2NF, we should decompose it into two separate tables as follows.

Job History Table (2NF):

Employee ID  Start DateEnd Date Job ID Department ID
10121-SEP-0727-OCT-11AC_ACCOUNT110
101 28-OCT-1115-MAR-15 AC_MGR 110
102 13-JAN-1124-JUL-16 IT_PROG 60
114 24-MAR-16 31-DEC-17 ST_CLERK 50
122 01-JAN-17 31-DEC-17 ST_CLERK 50
176 24-MAR-16 31-DEC-16 SA_REP 80
176 01-JAN-17 31-DEC-17 SA_MAN 80
200 17-SEP-05 17-JUN-11 AD_ASST 90
200 01-JUL-12 31-DEC-16 AC_ACCOUNT 110
201 17-FEB-14 19-DEC-17 MK_REP 20

Salary Table (2NF):

Employee ID  Salary
101 17000
101 17000
102 17000
114 11000
122 7900
176 8600
176 8600
200 4400
200 4400
201 13000

Now, both tables are in 2NF. The 1NF and 2NF particularly useful for heavy read-only cases. For systems with millions of records, lots of complex querying and no need to add new data or update existing records, 1NF and 2NF offer faster record retrieval.

3NF: Third Normal Form

In the context of database normalization, a transitive dependency occurs when a non-prime column (a column that is not part of the primary key) depends on another non-prime column. To be more specific, consider three columns: A, B, and C. Given A acts as the primary key, A determines B and C. However, if the column B can also fully determine the column C independently, we say that there is a transitive dependency.

Third Normal Form (3NF) is built upon 2NF and eliminates any transitive dependencies in a 2NF table, ensuring that non-prime attributes are transitively dependent only on the primary key. For example, let's consider the 2NF job history table we made earlier.

Job History Table (2NF):

Employee ID  Start DateEnd Date Job ID Department ID
10121-SEP-0727-OCT-11AC_ACCOUNT110
101 28-OCT-1115-MAR-15 AC_MGR 110
102 13-JAN-1124-JUL-16 IT_PROG 60
114 24-MAR-16 31-DEC-17 ST_CLERK 50
122 01-JAN-17 31-DEC-17 ST_CLERK 50
176 24-MAR-16 31-DEC-16 SA_REP 80
176 01-JAN-17 31-DEC-17 SA_MAN 80
200 17-SEP-05 17-JUN-11 AD_ASST 90
200 01-JUL-12 31-DEC-16 AC_ACCOUNT 110
201 17-FEB-14 19-DEC-17 MK_REP 20

In this table, notice that the Department ID is fully dependent on Job ID, whereas the composite primary key consists of Employee ID, Start Date, and End Date. That is, any Job ID value belongs to one and only one Department ID column value. Thus, knowing the job ID allows us to determine the corresponding department ID. This situation is referred to as a transitive dependency, and thereby the table violates 3NF. 

To achieve 3NF, you should decompose the table as follows.

Job History Table (3NF):

Employee ID  Start DateEnd Date Job ID
10121-SEP-0727-OCT-11AC_ACCOUNT
101 28-OCT-1115-MAR-15 AC_MGR
102 13-JAN-1124-JUL-16 IT_PROG
114 24-MAR-16 31-DEC-17 ST_CLERK
122 01-JAN-17 31-DEC-17 ST_CLERK
176 24-MAR-16 31-DEC-16 SA_REP
176 01-JAN-17 31-DEC-17 SA_MAN
200 17-SEP-05 17-JUN-11 AD_ASST
200 01-JUL-12 31-DEC-16 AC_ACCOUNT
201 17-FEB-14 19-DEC-17 MK_REP

Job ID Table (3NF):

Job ID Department ID
AC_ACCOUNT110
AC_MGR 110
AD_ASST 110
IT_PROG 60
SA_MAN 80
SA_REP 80
ST_CLERK 50
MK_REP 20

The 3NF tables are ideal when lots of data entry and updates are involved, as the 3NF design minimizes the possible duplication of data and the associated risk of data conflicts. However, if your goal is to build a read-heavy table, you may consider 1NF or 2NF, as 3NF requires table joins to retrieve the whole information.

Denormalization

Database normalization offers many benefits, but it also introduces some drawbacks. It minimizes data redundancy, improves data integrity, and enhances data security. However, normalized databases often necessitate many table joins for data retrieval, as information is distributed across multiple tables. This complexity may potentially impact query performance, particularly with more complex queries, due to the added overhead of joining tables.

Denormalization is the process of introducing intentional data redundancy by reversing the database normalization. This is usually employed to avoid expensive SQL joins between 3NF tables. The cost of denormalization is the advantages of normalization we have to give up. So, one should apply denormalization selectively as a last resort if the querying performance problems are experienced.

Rather than denormalizing the actual tables, one might consider introducing a materialized view. It is a database object that provides precomputed snapshot of the result set of a complex query. The database objects from which a materialized view retrieves data-such as tables, views, and other materialized views-are referred to as detailed tables[1][2]. 

For data warehousing purposes, which involve maintaining a central repository of integrated data from multiple sources within an organization, commonly created materialized views include materialized aggregate views, single-table materialized aggregate views, and materialized join views. All these types of materialized views can leverage query rewrite, an optimization technique that transforms a user request written in terms of detailed tables into a semantically equivalent request that incorporates one or more materialized views.


[1] Term used in the context of data warehousing purpose. Database objects where a materialized view query data from is called master tables when it is for replication purpose.  
[2] Materialized views can also serve as read-only copies of remote data on a local machine. For the purpose, primary key, rowid, object, and/or subqueries are commonly created as materialized views in replication environments.  
[3] And thereby it is distinguished from   

Post a Comment

0 Comments