Normalization in SQL

AJAY NEGI
8 min readNov 22, 2020

Data in the database is stored in terms of enormous quantity, retrieving certain data will be a tedious task if the data is not organized correctly. We can achieve all this with the help of normalization, and also reduce the redundant data. The technical definition could be :

The systematic approach of decomposing the table in order to eliminate anomalies is referred as Normalization. It is a multi-step process that puts data into tabular form removing the duplicated data from its relational tables.

Two purpose of Normalization

1. It is used to eliminate repeated data, having repeated data in the system not only makes the process slow, but will cause trouble during the later part of the transactions.

2. To ensure the data dependencies make some logical sense, usually the data is stored in database with certain logic, huge datasets without any purpose are completely waste, its like having an abundant resource without any application.

Normalization came into existence because of the problems that occurred on data, now lets look at those problems which are technically called as data anomalies. If a table is not properly normalized and has data redundancy then it will not only take up the extra memory space, but will also make it difficult to handle and update the database.

Types of anomalies

Insertion anomaly

Suppose for a new position in a company Mr. Atul is selected, but the department has not been allocated for him, in that case if we want to update his information to the database, we need to set the department information as null, similarly if we have to insert data of thousand employees who are in similar situation then the department information will be repeated for al those thousand employees, this scenario is the classical example of insertion anomaly.

Update anomaly

What if Mr Atul leaves the company or is no longer the head of the development department, in that case all the employee records will have to be updated and if by mistake if miss any record, it will dead to data inconsistency, this type of problem is nothing but updation anomaly.

Deletion anomaly

Assume in our employee table two different pieces of information are kept together that is employee information and department information, hence at the end of financial year, if employee records are deleted, we will also lose the department information, this is nothing but deletion anomaly.

So these were some of the problems that occurred while managing the data, to eliminate all these anomalies, normalization came into picture.

There are many normal many forms which are still under development but lets focus on the very basic and the essential ones only, so we will be talking about first normal form (1NF), second normal form(2NF), third normal form(3NF) and Boyce-codd normal form. So without wasting any further time lets proceed to understand first normal form(1NF).

Different types of Normal forms

INF

In first normal form, we tackle the problem of atomicity, here atomicity means values in the table should not be further divided in simple terms, a single cell cannot hold multiple values, if a table contains a composite or multi-valued attributes, it violates the first normal form.

So the following functions are performed in first normal form :-

1. Removes repeating groups from the table.

2. Creates a separate table for each set of related data.

3. Identify each set of related data with a primary key.

To understand this in a better way, lets take a example by using this given table.

In this table we have employee Id, employee name, phone number and salary as columns. You can clearly see that the phone number column has two values, thus it violates the 1NF. Now if we apply the first normal form to the above table, we get the following result:-

In this table each and every row is distinct, that is no cell has multiple values, the table has achieved atomicity.

First normal form is simple and can be easily identified. Now in the table you can clearly see there is no multiple values in each and every column, that means we have achieved the first normal form(1NF).

2NF

A table is said to be in second normal form only when it fulfills the follwoing condition

  1. It has to be in 1NF.
  2. 2. Table also should not contain partial dependency. Here partial dependency means, the proper subset of a candidate key determines a non-prime attribute. Now you must be thinking, what is non prime attribute, lets understand this in a simple way, attributes that form a candidate key in a table are called “Prime attributes” and the rest of the attributes of the relation are called “Non-prime attributes”. For eg :- In a table, prime attributes can be like employee-id and department-id, and the non-prime attributes can be like office location.

To understand 2NF, lets consider the following table:

This table has a composite primary key, that is employee-id and department-id together makes a primary key, the non key attribute is the office location. In this case, office location only depends on department-id, which is the part of primary key, therefore this table does not satisfy the second normal form. So you must be thinking, “what to do in such scenarios?”.

The answer is simple, split the table accordingly. To bring this table to 2NF, we need to break the table into two parts which will give the following tables, the first table has employee ID and department ID as columns, the second one has department Id and office location as columns. As you can see, we have removed the partial functional dependency that we initially had. Now in the table, the column office location is fully dependent on the primary key of that table, which is nothing but department-id.

3NF

Third normal form is the normal form that is used in normalizing the table to reduce the duplication of data and ensure referential integrity. The following condition has to be met by the table to be in third normal form:-

1. It has to be in 2NF form.

2. There should be no transitive dependency for non-prime attributes.

The 3nf was designed, firstly to eliminate undesirable data anomalies, and secondly to reduce the need for restructuring over time, and finally to make the data model more informative.

Lets look at an example to understand 3NF more clearly:

In the above table student-id determines subject-Id and subject-id determines subject, therefore student-id determines subject via subject-id, this shows that we have transitive functional dependency, so this table does not satisfy the 3NF.

Now in order to achieve the 3NF, we need to divide the table as shown below:-

Firstly lets divide the table and store student-id, student name, subject-id and address in it, all the columns are referring to the primary key, which is student-id. Let the second table have subject-id and subject column, so that subject is dependent only on subject-id and not on student-id. As you can see from the above table, all the non-key attributes are now fully functionally dependent only on the primary key. In the first table, column such as student name, subject-id and address are only dependent on student-id whereas in the second table, subject is only dependent on subject-id.

BCNF(Boyce Codd Normal Form)

BCNF is basically a more developed form of 3NF, which was developed to address certain types of anomalies which were not dealt with 3NF. A table can only be said to be in BCNF, if it satisfies these conditions:

1. It has to be in 3rd Normal form.

2. Every functional dependency, that is A implies B, then A has to be the super key of that particular table. So you must be thinking “What is a super key”?. A super key is a group of single or multiple keys which are capable of identifying rows in a table.

Lets take a example to clearly understand BCNF.

In the give table, one student can enroll for multiple subjects, there can be multiple professor teaching one subject, and for each subject, a professor is assigned to the student, these are the necessary condition of this table. In this table all the normal forms are satisfied, except BCNF, as you can see that student-id an subject form the primary key, which means that the subject column is prime attribute, but there is one more dependency, that is professor is depending on subject, and well subject is a prime attribute whereas professor is non-prime attribute, which is not allowed by BCNF. Now in order to satisfy the BCNF, we will be dividing the table into two parts.

The table at the top will hold student ID which already exists, and we will create a new column that is professor-id and in the second table, which is below, we will have the columns:- professor-id, professor and subject columns. You must be thinking now “Why do we need to have a new column that is professor-id”. By doing this we are removing non prime attributes functional dependency, in the second table, professor-id will be the super key of that table and remaining column will be functionally dependent on it, by doing this we are satisfying BCNF.

Conclusion.

This brings us to the end of the blog, I hope you have clearly understood the concept of normalization and its different types. Lets meet on my next blog. See ya.

--

--

AJAY NEGI

Software Engineer Trainee at Mount blue Technologies.