DynamoDB: key technical concepts & features
This week I attended a knowledge sharing session about the DynamoDB. I knew Dynamo's basics concepts—I used different key-value NoSQL db before, such as Redis, MongoDb or HBase. But I was curious about the technical details of DynamoDb because this has been new territory for me. So here's what I've learned:
- Dynamo emerged to the public in 2007 when Amazon engineers published the Dynamo Paper, where they presented a solution to their ever-increasing demands for Amazon's key-value store. The paper was first mentioned in Werner Vogel's blog, the CTO of Amazon.
- As it's typical for NoSQL dbs, Dynamo is a key-value store optimized for high read/write throughput and low latency. Dynamo is also fully-managed by AWS. That means no need to care about servers, plus there's high-resiliency and auto-scaling out-of-the-box. There's no need to provision for the peak load. Dynamo is public, that is, it's not hosted within VPC.
- Dynamo organizes data into tables. A table consists of items (rows) and attributes (columns). Dynamo supports three attribute types:
String. When creating a table, we don't specify the schema; we specify the primary key (see the next point).
- Each table has a primary key that uniquely identifies each item in the table. It consists of a mandatory partition key and an optional sort key. The partition key determines data distribution (see the next point), and is identified as
HASHin code. The sort-key enables richer querying and is identified as
RANGEin code. A primary key consisting only of a partition key is called a simple primary key; a primary key consisting of both a partition key and a sort key is called a composite primary key. (Phew, this took me a while to grasp and explain)
- Tables are internally stored in partitions. The split into partitions is determined by the primary key, specifically by the partition key. Once a table is created, the primary key and partitions cannot be changed.
- Data are read with two types of operations: Query and Scan. A query searches within a partition, a scan searches all partitions. If speed matters, use queries.
- When minimizing latency or costs, consider query patterns, that is, how do you read data. Then, align primary keys with your query patterns. For newly emerged query patterns, consider secondary indices.
- Dynamo charges by read capacity units (RCU) and write capacity units (WCU). Capacity units attach price for reading/writing a given size of data within one second. Capacity units are allocated in two modes: provisioned capacity or on-demand capacity. With provisioned capacity, units are fixed: if exceeded, the data is lost, and Dynamo raises an exception. There is no queue for the lost data or anything like that. With on-demand capacity, units are scaled on demand. So on-demand is handy, but it comes with a risk of high cost in case of an unexpected event (such as a DDOS attack).
- Dynamo supports TTLs. When the time to live is exceeded, the given item is automatically deleted. Handy for temporary data, such as sessions or event logs.
- Similar to RDS, Dynamo features backups, point-in-time recovery, and data encryption at rest.
- Dynamo supports in-memory cache for improved performance, called DAX. Useful for read-intensive apps that require extremely low latency, such as trading or gaming.
- By default, Dynamo offers eventually consistent reads. It's possible to request a strongly consistent read, at the trade-off of slower reads and higher price.
- Dynamo also supports event-driven integration via streams, where a change to an item invokes AWS Lambda or Kinesis Firehose. Streams are like triggers or stored-procedures in RDBMS dbs, but disconnected from the table space. Here are posts with use-cases. Streams support inserts, updates, and deletes.
Thanks a lot to Evgeny for sharing most of these tips.
Side note on knowledge sharing sessions
I'm discovering more and more that I enjoy knowledge-sharing sessions. I learned a lot in a very short time (it's effective), I can ask questions (it's interactive), I can discuss the topic with others (it's collaborative). The same applies when I host a session: I can solidify my knowledge, discover gaps, and discuss the topic with others.
Here's an example of a Dynamo table provisioned via CloudFormation Serverless template. The template provisions a table called
charging_sessions with an attribute
start_date set as a simple primary key. The table is configured with provisioned capacity units. The template also contains an example Lambda function that accesses the Dynamo table.
AWSTemplateFormatVersion: '2010-09-09' Transform: AWS::Serverless-2016-10-31 ChargeSessionsTable: Type: AWS::DynamoDB::Table Properties: TableName: 'charge_sessions' AttributeDefinitions: - AttributeName: 'start_date' AttributeType: 'N' # type Number KeySchema: - AttributeName: 'start_date' KeyType: 'HASH' # simple primary key ProvisionedThroughput: ReadCapacityUnits: 10 WriteCapacityUnits: 10 AddChargeSessionLambda: Type: AWS::Serverless::Function Environment: Variables: CHARGE_SESSION_DDB_TABLE: !Ref ChargeSessionsTable Policies: - DynamoDBCrudPolicy: TableName: !Ref ChargeSessionsTable # Properties, CodeUri, etc...