Database Basics #7

So far, we have been discussing the logical and virtual schema of relational DBMS, specifically PostgreSQL, for its robustness and rich features. While using relational DBMS and SQL is often the safest approach in many use cases, there are alternative classes of databases called NoSQL that do not use the relational data model to best serve their unique purposes. In this article, we will briefly cover some major NoSQL databases, including key-value stores, wide-column databases, document-oriented databases, and graph databases.

Key-Value Stores

A key-value store is a database that stores data as simple key-value pairs, utilizing various data structures like lists, ordered sets, and maps to flexibly organize complex data. Due to its simplicity, flexibility, and indexing support for performance, it is often utilized for caching, where data is temporarily stored in memory for fast response, and sometimes as in-memory databases with data persistence and high availability and performance. Redis is the primary example of a key-value store, supporting both caching and persistent in-memory database functionality and providing cloud solutions.

However, using a key-value store as an in-memory database is less preferred due to limited query capabilities and supported features. Migration to other databases can also be difficult. Hence, Redis is primarily used as a cache in combination with PostgreSQL, defaulting to requesting data from the cache first, responding from the cache for a cache hit, and sending the request to the database for a cache miss, while storing responses in the cache and sending them back to the users. To ensure the cache is up-to-date and as lean as possible, we perform cache invalidation by setting up expiration times and removing data with the least frequent or least recent access.

Wide-Column Databases

A wide-column database is a column-oriented database that stores columns or attributes as rows, making it easier and more performant to query a subset of attributes from a large number of entities, a characteristic highly relevant for analytics. Since columns are organized as rows, they can be stored separately on disk and across different servers, a process called sharding, which allows for simpler horizontal scaling to set up distributed database systems. It is also suited for applications requiring high-write throughput and/or time-series data, such as gaming and e-commerce.

Apache Cassandra is the primary example of a free and open-source DBMS for wide-column databases, and it, along with other solutions like Scylla DB, can be used as a data warehouse where data from multiple sources can be stored in a central repository and organized into columns for business intelligence and analytics. (A data lake stores raw data, while a data warehouse structures the data.) However, it is not a substitute for a relational database system, as it has limited query capabilities and features and data modeling for complex data structures and relationships.

Document-Oriented Databases

While key-value stores and wide-column databases are not primarily designed as direct alternatives to relational database systems (though Redis offers an in-memory database solution that can be used as a primary database), document-oriented databases, which store data as documents, are designed to be a direct alternative as a general-purpose database system. These documents are typically compressed XML, YAML, and JSON, which are human-readable and natural data formats for modeling complex objects used by many client applications, leading to a smooth developer experience with document-oriented databases.

They also do not require schemas, offering flexibility well-suited for modern agile development and simplifying complex queries in some cases. The primary example of a document-oriented database is MongoDB, which primarily uses compressed BSON format that can be retrieved directly as JSON, and it offers self-hosting and cloud solutions with optimized sharding for horizontal scalability and various other useful features (schema, transactions, etc.). Due to the rising popularity of highly abstracted solutions prioritizing developer user experience and fast development, MongoDB has been gaining popularity.

Though MongoDB and other document-oriented databases offer flexibility, especially beneficial in the early stages, this comes with a tradeoff. The rigid schema of relational databases offers consistency, data integrity, and reduces duplication, which becomes important in later stages. Hence, despite the scalability of document-oriented databases, developers might end up having to refactor applications and migrate to a relational database, which can be costly in the long run. Therefore, it's important to analyze whether a document-oriented database is appropriate and to compare costs depending on the use case. (Side Note: PostgreSQL supports JSON attribute types, allowing us to use it partly as a document-oriented database.)

Graph Databases

The relational data model is not well-suited for dealing with graph data with complex relationships, and the demand for storing graphs (social networks, geospatial graphs, etc.) has been rising quickly as graph algorithms and graph machine learning enable more sophisticated operations and analysis (recommender systems, anomaly detection, etc.). You can check the GL series on this blog to explore GNNs and their capabilities in various graph-related tasks.

In response to this demand, graph databases with a graph data model based on nodes and edges and a query language (GQL) are emerging. Neo4j is the primary example (Neo4j invented GQL, called Cypher, that allows us to easily query graph data like MATCH (p:Person)-[:LIVES_IN]->(c:City) RETURN p.first_name, c.name). Beyond the obvious benefits for analytics purposes and intuitive querying experience, graph databases do not require us to perform table joins for every query that involves a relationship, unlike relational DBMS. They allow us to query nodes (even those multiple hops away) by simply following edges with minimal overhead.

Conclusion

In this article, we covered four types of NoSQL databases, key-value stores, wide-column databases, object-oriented databases, and graph databases, all of which serve different purposes and can be used alongside relational databases (though using a document-oriented database with a relational database is relatively rare). Aside from the above, there are other NoSQL databases like vector stores, which store data with a vector index for performing similarity searches (relevant for recommender systems and RAG). Therefore, I would recommend always trying to analyze the benefits and drawbacks of various database types and their implementations for choosing the best one for each use case.

Resources

McQuillan, R. 2023. What is a Wide-Column Database?. Budibase.
MongoDB. n.d. What is a Key Value Database?. MongoDB.
MongoDB. n.d. Why Use MongoDB and When to Use It?. MongoDB.
Neo4j. 2021. What is a graph database? (in 10 minutes). YouTube.
ScyllaDB. n.d. Wide-column Database. ScyllaDB.