Database Normalization

Introduction

Database Normalization can be complicated. The definitions of the various "normal forms" are academic and not very practical. All this can leave one feeling lost. Hopefully, this guide can help. This guide aims to be a practical, succinct guide to database normalization for the working software engineer. In what follows we will look at the first few normal forms and their properties in simple language. (If you just want a table of normal forms and their properties, check out this table.)

Basic columns and special columns

In what follows I will refer to table columns as either "basic columns" or "special columns". When I say "basic", I really just mean "not special", which raises the core question of: What is meant by "special column"? Simply put, a special column is a column you could use to uniqely identify rows within a table. (The technical version is that when I say "special column" I am mostly referring to an attribute set, isolated by the column name, that is a member of a candidate key, aka, a minimal superkey, of a relational schema, isolated by the table name. Sometimes, it's instead referring to a non-minimal superkey.) It is of critical importance in database normalization to identify all the special columns of a table. This might naively seem easy to do, but determining the special columns of a table can quickly become a subtle and debatable business.

That said, there is one special column that is always straightforward, and will show up in essentially every production database table you will ever interact with: the serial id column. It's basically the most special column of all because it exists for the sole purpose of uniquely identifying rows in a table. It's so special in fact, that, really, it's too special, and so we will completely ignore it when discussing database normalization. I discuss this oddity in more detail in the next section.

Ignoring the serial `id` column

The reason we ignore the serial id column is because what we are really looking for is the naturally special columns. We are looking for the fundamental attributes, or properties, of the entities in question. We want to normalize the "real" data, the data you need in order to appropriately describe the entities. The serial id column isn't "real" data, it's not a property of the entity, it's just tacked on to a row to guarantee unique identification. Only by normalizing against the "real" data will you actually gain the benefits of normalization, which are, among other things, less duplication and less redundancy.

So, while I will include the id column in examples, because that's the most practical way to interact with tables, please keep in mind that (1) we will be ignoring it, and (2) why we are choosing to do so (which, as a reminder, is because we are only concerned with "real" data).

First Normal Form, 1NF

First Normal Form, simply put, means don't put lists in table cells. (The technical version is: The data-domain of each attribute contains only atomic values.)

Putting a list in a cell is strongly discouraged unless you have a very good reason to do so. There are multiple datatypes that will allow you to do this, for example the ARRAY keyword in SQL, but just because you can doesn't mean you should.

Before I discuss the most common way I see this violation occur, I want to briefly discuss JSON and JSONB data in table cells. JSON(B) is a valid data type, and has many justifiable use cases, including NoSQL-like data storage, but it should still be avoided unless you have a good reason to use it. Section 8.14.2 of the PostgreSQL docs states the following on this topic:

Ideally, JSON documents should each represent an atomic datum that business rules dictate cannot reasonably be further subdivided into smaller datums that could be modified independently.

Now, probably the most common way you'll see 1NF being violated is with cells containing comma-separated values as follows:

id	name	books
1	Austin	East of Eden,Green Eggs and Ham
2	Bertha	Grapes of Wrath,The Cat in the Hat

The solution to the problem is to create a row per book instead:

id	name	book
1	Austin	East of Eden
2	Austin	Green Eggs and Ham
3	Bertha	Grapes of Wrath
4	Bertha	The Cat in the Hat

See Appendix A for a SQL script that transforms the unnormalized example data into the 1NF data as shown above. It is in a format that's easy to copy and paste and experiment with yourself, which I encourage you to do. (For this and every other script, I will be using PL/pgsql, and writing .sql files against a PostgreSQL local instance.)

Second Normal Form, 2NF

Second Normal Form, simply put, means that, in addition to being in 1NF, every basic column is determined by all the special columns. (To phrase it in the negative: No basic column can be determined by just some of the special columns.) (The technical version is: No non-prime attribute is functionally determined by a proper subset of the candidate key.)

An example 1NF table that violates 2NF is as follows:

id	name	book	price
1	Austin	East of Eden	25
2	Austin	Green Eggs and Ham	10
3	Bertha	East of Eden	25
4	Bertha	Grapes of Wrath	20

The price column is determined completely by the book column, and nothing else. But, the special columns are both book and name. This partial determination of price by just book, and not also name, is a violation of 2NF. The logic here is that if a special column has no impact on a basic column, then the basic column probably shouldn't be in the table at all.

The solution is separating out the book details into a second table:

id	title	price
1	East of Eden	25
2	Grapes of Wrath	20
3	Green Eggs and Ham	10

id	name	book_id
1	Austin	3
2	Austin	5
3	Bertha	3
4	Bertha	4

This is sufficient to satisfy 2nd Normal Form. (Note that if this data were, for example, representing the books owned by each user, you'd be better off creating a join table, users_books, that had foreign key references to both users and books. In that situation, the last table above could represent half of a full join on users and users_books. I'm ignoring join tables in this guide. I will perhaps include them later, since they are pertinent to good database maintenance.)

See Appendix B for a SQL script that transforms the 1NF table into 2NF tables as shown above. It is in a format that's easy to copy and paste and experiment with yourself, which I encourage you to do.

Third Normal Form, 3NF

Third Normal Form, simply put, means that, in addition to being in 2NF, every basic column is directly determined by the special columns. (To phrase it in the negative: No basic column is determined by the special columns through another basic column.) (The technical version is: No non-prime attribute is transitively functionally determined through another non-prime attribute. Or, every functional dependency, transitive or otherwise, is determined by an element of a superkey.) This might sound confusingly similar to 2NF, but the key difference is direct determination. 2NF still allows for indirect determination, otherwise known as transitively dependent determination.

An example 2NF table that is violating 3NF is as follows:

id	loan_date	loanee	loanee_dob	book_id
1	2021-03-03	Bertha	1902-02-02	3
2	2021-03-03	Bertha	1902-02-02	7
3	2021-02-02	Austin	1900-01-01	3

The problem is that loanee_dob is determined by loanee, but loanee is not one of the special columns, which are book_id and loan_date. It is true that, if you give me the special columns of book_id and loan_date I will find the loanee_dob unique to that row, but the "truth" of loanee_dob is dependent on the loanee. This indicates that the loanee_dob is indirectly (i.e., transitively) dependent on the special columns through the loanee column. This is the kind of indirectness we are trying to avoid.

The solution is extracting the transitively dependent columns into their own table. For example:

id	name	dob
1	Bertha	1902-02-02
2	Austin	1900-01-01

id	loan_date	loanee_id	book_id
1	2021-03-03	2	3
2	2021-03-03	1	7
3	2021-02-02	1	3

See Appendix C for a SQL script that transforms the 2NF table into 3NF tables as shown above. It is in a format that's easy to copy and paste and experiment with yourself, which I encourage you to do.

A note about Boyce-Codd Normal Form, BCNF, 3.5NF

There are edge cases in the technical definition of 3NF that BCNF attempts to address. (This is the best articulation of the difference between 3NF and BCNF/3.5NF that I've read, which occurs near the end of the section.) To the best of my knowledge, the simple takeaway is essentially the same it was for 3NF, i.e., every basic column is directly determined by the special columns.

Fourth Normal Form, 4NF

Fourth Normal Form, in a sense, cannot be put simply. If I had to try, I'd say that 4NF, "simply put", means that, in addition to being in 3NF, if a group of cells across one or more columns can determine a group of cells in one or more other columns, then the original group of cells in the original column(s) must at least be in the special columns. Since that "simple" sentence barely makes sense, I'll just include the technical version, instead of hiding it in a footnote. The technical version is: Every non-trivial, multivalued functional dependency is derived from a superkey, minimal or otherwise.

I have not found a dataset that, in my opinion, is unequivocally both in 3NF and violating 4NF, so at the moment I do not have an example nor script to transform the relation. The wikipedia pages for 3NF, BCNF, and 4NF are good starting points for learning more about 4NF. When I find a good example, I will include it here.

Fifth Normal Form and beyond

Things get quite academic and arguably non-practical after 4NF, and maybe actually after 3NF, although it does seems that 4NF violations do actually occur in practice.

I will chose not to cover 5NF and other higher-order properties that can be found in this table (which I also linked to in the intro). It is an interesting field of research, so I encourage you to dig deeper and look into these properties yourself.

Conclusion

I hope you found this guide helpful. If you have any questions or feedback, feel free to reach out! Happy coding :D

Appendices

Appendix A: SQL script converting unnormalized data to 1NF

This is probably not the best solution. It is a fully contained solution. You are encouraged to copy and paste and experiment with it. For example, if you saved this file as example.sql, you could run the following from the command line (assuming you are using PostgreSQL):

createdb test
psql -d test -f example.sql

Similar functionality can be achieved with other datatypes. For example, if your data was in a text ARRAY data type, you could use unnest(books) instead of regexp_split_to_table().

DROP TABLE IF EXISTS users;
CREATE TABLE users (id SERIAL PRIMARY KEY, name text, books text);
INSERT INTO users (name, books) VALUES
  ('Austin', 'East of Eden,Green Eggs and Ham'),
  ('Bertha', 'Grapes of Wrath,The Cat in the Hat');

ALTER TABLE users ADD COLUMN book text;

DO $$
  DECLARE
    user_record record;
    user_book text;
  BEGIN
    FOR user_record IN
      SELECT * FROM users
    LOOP
      FOR user_book IN
        SELECT regexp_split_to_table(user_record.books, ',')
      LOOP
        INSERT INTO users (name, book) VALUES (user_record.name, user_book);
      END LOOP;
    END LOOP;
  END;
$$;

DELETE FROM users WHERE book IS NULL;
ALTER TABLE users DROP COLUMN books;

SELECT * FROM users;

Appendix B: SQL script converting 1NF to 2NF

createdb test
psql -d test -f example.sql

DROP TABLE IF EXISTS users, books, users_and_books;
CREATE TABLE users_and_books (
  id SERIAL PRIMARY KEY,
  name text, book text,
  price integer
);
INSERT INTO users_and_books (name, book, price) VALUES
  ('Austin', 'East of Eden', 25),
  ('Austin', 'Green Eggs and Ham', 10),
  ('Bertha', 'East of Eden', 25),
  ('Bertha', 'Grapes of Wrath', 10);

CREATE TABLE books (id SERIAL PRIMARY KEY, title text, price integer);
CREATE TABLE users (
  id SERIAL PRIMARY KEY,
  name text,
  book_id integer REFERENCES books ON DELETE CASCADE
);

WITH
  books_rows AS (SELECT DISTINCT book AS title, price FROM users_and_books)
INSERT INTO books(title, price) SELECT title, price FROM books_rows;

DO $$
  DECLARE
    r users_and_books%ROWTYPE;
    the_book_id books.id%type;
  BEGIN
    FOR r IN
      SELECT * FROM users_and_books
    LOOP
      SELECT id FROM books INTO the_book_id WHERE title = r.book;
      INSERT INTO users(name, book_id) VALUES (r.name, the_book_id);
    END LOOP;
  END;
$$
LANGUAGE plpgsql;

DROP TABLE users_and_books;
SELECT * FROM users;
SELECT * FROM books;

Appendix C: SQL script converting 2NF to 3NF

createdb test
psql -d test -f example.sql

DROP TABLE IF EXISTS loans, users;
CREATE TABLE loans (
  id SERIAL PRIMARY KEY,
  book_id integer,
  loan_date DATE,
  loanee text,
  loanee_dob DATE
);

INSERT INTO loans (book_id, loan_date, loanee, loanee_dob) VALUES
  (3, '2021-03-03', 'Bertha', '1902-02-02'),
  (7, '2021-03-03', 'Bertha', '1902-02-02'),
  (3, '2021-02-02', 'Austin', '1900-01-01');

CREATE TABLE users (
  id SERIAL PRIMARY KEY,
  name text,
  dob DATE
);

INSERT INTO users(name, dob)
  SELECT DISTINCT loanee AS name, loanee_dob AS dob
  FROM loans;

ALTER TABLE loans
  ADD COLUMN user_id integer REFERENCES users ON DELETE RESTRICT;

DO $$
  DECLARE
    r users%ROWTYPE;
  BEGIN
    FOR r IN
      SELECT * FROM users
    LOOP
      UPDATE loans SET user_id = r.id
      WHERE loanee = r.name AND loanee_dob = r.dob;
    END LOOP;
  END;
$$
LANGUAGE plpgsql;

ALTER TABLE loans
  DROP COLUMN loanee,
  DROP COLUMN loanee_dob;

SELECT * FROM users;
SELECT * FROM loans;