AWS DynamoDB vs AWS RDS (SQL vs NoSQL)?
- enforce data validatio and integrity This post is a DRAFT
I have noticed a frequent crossroad in AWS lands: store data in the DynamoDB (a fully managed key-value) or in on of the (a managed relational database). Comparisons I have found focus on technical comparison, but I was interested in higher level, use-case driven, comparison. Here is what I have gathered so far:
Design principles. Dynamo was designed to handle massive-scale, massive throughput, low latency, and key-value access. A Dynamo database consists of a table, partitioned by a primary key. and RDBMS offered by RDS were designed to provide transactional OLTP operations on structured data.
Query patterns. The advice for Dynamo goes: "You shouldn't start designing your schema for DynamoDB until you know the questions it will need to answer." When query access patterns are known, Dynamo shines. When query access patterns are unknown, RDS shines, and Dynamo struggles.
Joins. Dynamo doesn't support joins out-of-the-box, joins need to be implemented by denormalizing data. RDS supports joins out-of-the-box. The downside of RDS is that joins could become expensive if the dataset scales.
Scalability & latency. Dynamo offers massive scalability out-of-the-box and single-digit microsecond latency, easily supporting 100k reads per second. RDS could be scaled up and out, but the scalability of Dynamo is out of reach for RDS.
Database migrations. RDS supports database migrations with matured & battle-tested tools, such as Flyway. Migrations allow to version control the database schema and evolve the data structures. Migrating data in Dynamo is currently left to a developer (At least I haven't found an out-of-the-box battle-tested tool that provides data migrations for Dynamo; if you know of any, please, let me know). So if the shape of data is expected to evolve, there's a beaten path on the RDS turf.
Data validation. Dynamo doesn't support data validation out-of-the-box; validation is left to an application. Dynamo supports limited types, namely, booleans, numbers, and string. RDS comes with richer types, such as enums and timestamps. RDS also supports validation checks on data entry via constraints. In my experience, loose constraints–exemplified in the 57 ways how Philadelphia is spelled in PPP loans–is one of the root causes of bugs and confusion in business.
Views & aggregations. RDS supports views and materialized views out-of-the-box. However, when data scales, refreshing a materialized view can be costly. In Dynamo, views need to be implemented on the application side, for example, via Dynamo Streams. In RDS, creating and modifying views is easy; in Dynamo, creating and modifying views is costly. Once a view is created in Dynamo, the cost of refreshing is negligible.
Analytics, OLAP & Data Warehouses. In my experience, integrating a BI tool with an RDS is straightforward; integrating a BI tool with Dynamo could be tricky. However, a BI tool can be connected to a data warehouse, such as AWS Redshift or Google BigQuery. In this case, there are ETL SaaS solutions that load the data from RDS and Dynamo into the data warehouse.
Security, backups, maintenance. Both services offer backups, point-in-type recovery, encrypted data. Both services are managed: Dynamo provisioning is a no-ops, RDS might require tweaking if data scales.
Boring technology. Adding new technology to the stack is costly. The currently used technologies and team experience need to be considered.
Company maturity. If a company is in the early-stage or experimenting with product-market fit, data access patterns might be floating. For this company, RDS offers flexibility to explore the problem space. On the other hand, a matured company struggling with scaling issues is in a different position, and Dynamo could offer a better fit.
Data lifespan. Data can be short-lived and temporary, such as customer session data valid for seven days. Or data can be long-lived and persistent, such as customer details. Changing the data schema in Dynamo can be done by supporting two code-paths within the application. For short-lived data, this could be fine; for long-lived data, this could become a nightmare. On the contrary, changing the data schema in RDS can be done with migrations and a single code-path. But for a massive scale, changing the schema in RDS could be costly.
Some final thoughts
On the systems I worked on, most didn't require massive scaling nor massive throughput. These include data-science/machine-learning applications in an airline company, business-line applications for small and medium businesses, and aircraft/IoT/device monitoring systems. Most of data for these systems would, in fact, fit in a single unoptimized machine.
However, I have witnesses extensively another problem: loose data constraints leading to poor data quality, as manifested by the Philadelphia example. Poor data quality, in many forms of incarnations, further lead to proliferation of bugs, unexpected behavior and confusion.
Opinioned exec summary
So if I were asked for advice, I'd position RDS with a focus on data quality as a default option. And I'd position Dynamo as an alternative contender that needs to be well justified to be selected over the default RDS option.
Data models are perhaps the most important part of developing software, because they have such a profound effect: not only on how the software is written, but also on how we think about the problem that we are solving.
—Martin Kleppmann, Desigining Data-Intensive Applications
Thanks to Peter, Bram and Daan for reading this and making the post better.