Database Basics #5 - Schema Design

So far, we have been discussing SQL statements and features that allow us to perform CRUD operations while maintaining reliability and security as much as possible. However, operations can be less performant and undermine data integrity without proper schema design. Therefore, in this article, we will cover two important concepts of schema design, normalization and indexing.

Normalization

In the context of relational databases, normalization refers to the process of organizing data for performance and reliability by applying a series of rules and arriving at well-structured logical schemas called normal forms. There are normal forms from the first to the fifth, ranging from least to most strict. The first normal form (1NF) prohibits using tuple orders to convey information about the data, having divisible attribute values, lacking a primary key, and repeating attributes (like item_1, item_2, and so on as columns).

Customer_Product
customer_name (in the order of date registered)	item_1	item_2	item_3	loyalty_level (1~10)	status
A	"name: pen, price: 30, quantity: 1"	NULL	NULL	10	"vip"
C	"name: pencil, price: 10"	"name: eraser, price: 5, quantity: 1"	"name: notebook, price: 15, quantity: 1"	moderate	"moderate"
B	NULL	NULL	NULL	1	"standard"

The above relation, Customer_Product, is defined based on a schema that violates every rule of 1NF (and lacks a primary key). We can clearly see how difficult it is to operate on (e.g., querying the customers who bought more than 10 pencils in descending order of customer registration date) and maintain data integrity (e.g., the quantity of pencils bought by customer C is missing, and we need to add more columns for a customer who bought more than 4 categories of items). The following shows the Customer_Product relation in 1NF.

Customer_Product
customer_name	item_name	item_price	item_quantity	date_registered	loyalty_level (1~10)	status
A	pen	30	1	2021/04/05	10	vip
C	pencil	10	5	2021/04/10	5	moderate
C	eraser	5	1	2021/04/10	5	moderate
C	notebook	15	1	2021/04/10	5	moderate

Here, the combination of customer_name and item_name is set as the primary key to avoid duplicates and follows the rule of 1NF. This already looks much better to work with than the previous version, though there are still improvements we can make. For example, if customer A cancels an order, all of customer A's information (like date_registered, loyalty_level, and status) will be lost (as has already happened to customer B, who made no purchases). Also, this information is duplicated for customer C. These problems occur because non-key attributes (date_registered, loyalty_level, and status) are only dependent on customer_name, whereas they are placed in a table with a primary key of customer_name and item_name. Hence, the second normal form (2NF) builds upon 1NF and restricts non-key attributes to depend on the entire key.

Customer
customer_name	date_registered	loyalty_level	status
A	2021/04/05	10	vip
B	2021/04/21	1	standard
C	2021/04/10	5	moderate

Customer_Product
customer_name	item_name	item_price	item_quantity
A	pen	30	1
C	pencil	10	5
C	eraser	5	1
C	notebook	15	1

The above shows the modified relation that follows 2NF. To adhere to the rule, we derived a new relation called Customer with a primary key of customer_name. We can also introduce a foreign key constraint on the customer_name attribute for data integrity. We can now recover customer B, who made no purchases, and guarantee that customer information will not be deleted even if a purchase is rolled back. We can also avoid the information duplication we saw in the previous version. At first glance, the above appears to have no problems, but there is a moderate risk when the status only depends on the loyalty_level (a transitive dependency).

This is because if we adjust a customer's loyalty level, we need to ensure that the status is also properly updated when appropriate, which we might forget. We can enforce consistent updates using transactions and triggers, though triggers have some overhead and might not be explicit for all users of the schema. The dependency might also be so complicated that using transactions and triggers becomes complex. Hence, the third normal form (3NF) aims to eliminate this risk by restricting non-key attributes to directly and only depend on the entire key. In this case, we can set up another relation called Loyalty_Status to map every loyalty level to the corresponding status.

4th & 5th Normal Forms

If a schema is in 3NF, it's usually easy and intuitive to operate on while maintaining consistency (we have been dealing with schema examples at least in 3NF in the past articles), though there are still some non-trivial risks. For example, when we set up a table of smartphones with various models, colorways, and storage capabilities, we might be tempted to create a table containing all those attributes, model, color, and storage as the primary key. Such a table satisfies 3NF, but adding a new color to a model will involve adding all combinations of the new color and the storage capabilities to the table, which is prone to error. The problem arises because we have two multivalued dependencies within the key (between model and colors, and model and storage).

Fourth normal form (4NF) restricts such non-trivial multivalued dependencies to the key only. To make the schema in 4NF, we can create separate tables, Model_Color and Model_Storage, which allows us to add new colors in a single update. The fifth normal form (5NF) enforces an even stricter rule, which is that the table cannot be described as the logical result of joining some tables together, meaning that tables must be split as much as possible. (For the Customer_Product example, we should have another table, Product, with product-related attributes, and multivalued dependencies should be further split.)

While it's safest to always build schemas in 4NF and 5NF, there are scenarios where 4NF is overkill. For example, when storing data regarding a few courses and their instructors, and that data barely changes and doesn't have serious concerns regarding the accuracy of information, having an extra Course_Instructor table storing the course and instructor IDs for 4NF might just be redundant. Hence, it's important to analyze if 4NF and 5NF are appropriate depending on the scenario.

Indexing

For optimizing the performance of the operations, DBMS utilizes data structures like B Tree or B+ Tree with $O(log(n))$ time complexity. For example, PostgreSQL and many other DBMS automatically sets up indexes for primary keys (and foreign keys) and builds B Tree (by default) of the indices to achieve fast insertion, querying, and deletion. For a table with millions of rows, there could be orders of magnitude difference between the qeuries based on indexed columns and non-indexed columns (linear search with $O(n)$ time complexity).

Aside from the automatically generated indexes, we can set up additional indexes for columns of which we expect frequent queries. For example, we can set up a custom index with the last_modified_at column on the Blog table, knowing that we will display the several latest modified blog articles on the home page. This can be done with CREATE INDEX last_modified_idx ON Blog (last_modified_at). Then, the DBMS builds a B Tree based on the index for faster query (other data structure types like GiST for nearest neighbor searches of 2D geometric datatypes and GIN for operations on arrays can be selected depending on the use cases).

We can use multiple columns to create an index, which often involves concatenating the attribute values. Multicolumn indexes can be well-suited for multicolumn queries and slightly reduce memory overhead compared to setting up multiple single-column indexes, though PostgreSQL generally recommends us to use single-column indexes for robustness and simplicity. You can check the indexes created for a table with SELECT * FROM pg_indexes WHERE tablename=<table_name>; and drop your index with DROP INDEX <idx_name>;. Using appropriate indexing on appropriate columns on appropriate tables and constructing statements based on those indexes can result in faster performance with minimal memory overhead.

Conclusion

In this article, we covered database normalization and indexing, which are important concepts for designing schemas for high data integrity and better performance. It's crucial to always keep those in mind and utilize them appropriately depending on the scenarios when we design database systems and interact with them as a user. In the next article, we're going to get closer to the application side and discuss concepts around virtual schemas and users.

Resources

Decomplexify. 2022. Learn Database Normalization - 1NF, 2NF, 3NF, 4NF, 5NF. YouTube.
PostgreSQL. n.d. PostgreSQL 17.4 Documentation. PostgreSQL.
Supabase. 2024. Are Multi-Column Indexes a good idea?. YouTube.