There are two terms in RDBMS “Normalization” and “Denormalization”. Denormalization is the process of converting new normalized relations back to a pre-normalized state. This reduces the amount of data stored on a disk but takes more processing time. Normalization is a process that eliminates the redundancy of data by creating relations between existing tables. There are basically three normal forms (1NF, 2NF, and 3NF) with each one more general than the previous one. All queries which were possible in 1NF become impossible or less efficient in the third NF without significant change in the application itself.
Basic concepts introduced here will be helpful when we go through some
examples later in this article for better understandings.
Database design has several goals:
To understand the basic concepts behind Data normalization, let’s take an example.
Suppose we are given a table in which information is being stored in two attributes viz. student name and roll numbers. Suppose it is required to create a table of Roll Numbers in which both the attributes i.e., student name and roll Number are present so that whenever required this data can be accessed with high efficiency by performing necessary operations on these tables. Without any doubt, it will be better if the above-mentioned tables are combined into one single table called StudentRollNumber with just two columns i.e., student Name and roll Number where all students’ records appear once for each entry in the roll number column. This way data redundancy has been eliminated and efficient querying can be performed on these tables as and when required.
Data Normalization is the process of converting relations into relations that follow certain basic rules called normal forms, viz., 1st Normal Form (1NF), 2nd Normal Form (2NF), and 3rd Normal Form (3NF). Also, we cannot directly use relations that violate any one of the normal forms and must first convert them to a form that does not violate the particular normal form we are working with. To understand this better let’s look at what exactly we mean by data redundancy?
Let us take an example:
Suppose there is a relation R (A1, A2, An) where n > 1 such that for each possible value of Ai some value(s) appear for Aj. This is called a tuple, and it is said to be in 1NF if and only if the whole tuple represents some valid relation, i.e., R(A1, A2, An).
R (A1, A2) –> (a student with roll number 20205 has the name “John”)
Invalid Relation or Data Redundancy
TL; DR – Any instance where there are two or more identical attributes that are related to each other violates the First Normal Form (1NF), which means you’re not deduping your data properly.
First Normal Form (1NF)
A relation schema R is said to be in 1NF if and only if the following conditions are satisfied:
All attributes are of atomic or group type. E.g.: College (Name, Address, Head of the Department Name, no. of faculty) instead of College (Name, Address, Head of the Department Name, Faculty’s name).
In this case no. of faculty is a Group attributes not an atomic one so it violates 1st Normal form.
Attributes are not repeating groups.
There are no multi-valued dependencies among the attributes that are not necessary to define the purpose of the relation.
TL; DR – Make sure every attribute in a relational database is uniquely identified by its column name and a primary key.
Second Normal Form (2NF)
A relation schema R is said to be in 2NF if and only if the following conditions are satisfied:
If attributes A1, A2, A have domains D1, D2, Dn respectively then Ri must be a subset of Dj for all 1 <= I <= n where Ri represents some valid subset of Dj. In other words, no partial dependencies exist between attributes within the same domain. Again there’s another term called transitive dependency which means transitivity of whole normal form not partial or otherwise 2nd normal form will not be satisfied.
TL; DR – All multi-valued dependencies must be removed, and the table should follow domain integrity.
What is domain integrity?
Domain Integrity requires that all tuples in a relation/table belong to the same domain (list of values). Tables that don’t exhibit this property contain what are called ‘domain anomalies.’ Example: City (StateCodeNo, Name)
This violates Domain Integrity because for example (‘CA’, ‘Los Angeles’) are not allowed in the City table. So this has to be normalized into 2NF by splitting into two tables i.e., one for states and another for Cities like below:
State (StateCodeNo, Name)
City (StateCodeNo, Name, Population)
Normal Form 2NF
This “refers to the application of general normalization procedures to a table so that each attribute corresponds directly to an identifiable attribute of the original relation” i.e., not repeating groups corresponding atomic columns.
As you can see that by merely following these two rules your data is not necessarily normalized. There are various levels of normalization, which depend on how deep you want to go with respect to its normal form. It all depends upon what type of queries you run on the table, how it’s designed etc.