Deep Dive on Amazon DynamoDB

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Deep Dive: Amazon DynamoDB
Sean Shriver
NoSQL Solutions Architect
Amazon Web Services

Relational (SQL) vs.
nonrelational (NoSQL)

Relational vs. nonrelational databases
Traditional SQL NoSQL
DB
Primary Secondary
Scale up
DB
DB
DBDB
DB DB
Scale out

SQL (Relational)
Price Desc.
$11.50
$8.99
Chaplin’s
first …
Columns
Rows
Primary Key Index
$14.95
One of 2
major …
The
Partitas
Product
ID
Type
1 Book
2 Album
3 Movie
Products

SQL (Relational)
Price Desc.
$11.50
$8.99
Chaplin’s
first …
Columns
Rows
Primary Key Index
$14.95
One of 2
major …
The
Partitas
Product
ID
Type
1
2
3
Title Date
Odyssey 1871
Book ID Author
1 Homer
Books
Albums
Title
6 Partitas
Album
ID
Artist
2 Bach
Genre Director
Drama,
Comedy
Chaplin
Movie ID Title
3 The Kid
Movies
Products
Book
Album
Movie

SQL (Relational) vs. NoSQL (Non-relational)
Product
ID
Type
Odyssey Homer1 Book ID
2 Album ID 6 Partitas
2
Album ID:
Track ID
Partita
No. 1
Bach
Attributes
Schema is defined per item
Items
Partition Key Sort Key
Price Desc.
$11.50
$8.99
Chaplin’s
first …
Columns
Rows
Primary Key Index
$14.95
One of 2
major …
The
Partitas
3 Movie ID The Kid
Drama,
Comedy
1871
Chaplin
Primary Key
Product
ID
Type
1
2
3
Title Date
Odyssey 1871
Book ID Author
1 Homer
Books
Albums
Title
6 Partitas
Album
ID
Artist
2 Bach
Genre Director
Drama,
Comedy
Chaplin
Movie ID Title
3 The Kid
Movies
Products Products
Book
Album
Movie

Why NoSQL?
Optimized for storage Optimized for compute
Normalized/relational Denormalized/hierarchical
Ad hoc queries Instantiated views
Scale vertically Scale horizontally
Good for OLAP Built for OLTP at scale
SQL NoSQL

Agenda
Tables, API, data types, indexes
Replication
Data modeling [1:1, 1:M]
DynamoDB Streams
Use cases
Reference architecture

Amazon DynamoDB
Managed NoSQL database service
Simple and powerful API
Supports both document and key-value data models
Highly scalable
Consistent, single-digit millisecond latency at any scale
Highly available—3x replication

Table
Table
Items
Attributes
Partition
Key
Sort
Key
Mandatory
Key-value access pattern
Determines data distribution Optional
Model 1:N relationships
Enables rich query capabilities
All items for a partition key
==, <, >, >=, <=
“begins with”
“between”
sorted results
counts
top/bottom N values
paged responses

APIs
CreateTable
UpdateTable
DeleteTable
DescribeTable
ListTables
UpdateTimeToLive
DescribeTimeToLive
GetItem
Query
Scan
BatchGetItem
_______________
PutItem
UpdateItem
DeleteItem
BatchWriteItem
ListStreams
DescribeStream
GetShardIterator
GetRecords
Stream API
DynamoDB
Table Item

Data types
String (S)
Number (N)
Binary (B)
String Set (SS)
Number Set (NS)
Binary Set (BS)
Map (M)
List (L)
Boolean (BOOL)
Null (NULL)
Used for storing nested JSON documents

00 55 A954 AA FF
Partition table
Partition key uniquely identifies an item
Partition key is used for building an unordered hash index
Table can be partitioned for scale
00 FF
Id = 1
Name = Jim
Hash (1) = 7B
Id = 2
Name = Andy
Dept = Engg
Hash (2) = 48
Id = 3
Name = Kim
Dept = Ops
Hash (3) = CD
Key Space

Partitions are three-way replicated
Id = 2
Name = Andy
Dept = Engg
Id = 3
Name = Kim
Dept = Ops
Id = 1
Name = Jim
Id = 2
Name = Andy
Dept = Engg
Id = 3
Name = Kim
Dept = Ops
Id = 1
Name = Jim
Id = 2
Name = Andy
Dept = Engg
Id = 3
Name = Kim
Dept = Ops
Id = 1
Name = Jim
Replica 1
Replica 2
Replica 3
Partition 1 Partition 2 Partition N

Partition-sort key table
Partition key and sort key together uniquely identify an Item
Within unordered partition key-space, data is sorted by the sort key
No limit on the number of items (∞) per partition key
• Except if you have local secondary indexes
00:0 FF:∞
Hash (2) = 48
Customer# = 2
Order# = 10
Item = Pen
Customer# = 2
Order# = 11
Item = Shoes
Customer# = 1
Order# = 10
Item = Toy
Customer# = 1
Order# = 11
Item = Boots
Hash (1) = 7B
Customer# = 3
Order# = 10
Item = Book
Customer# = 3
Order# = 11
Item = Paper
Hash (3) = CD
55:0 A9:∞54:∞ AA:0
Partition 1 Partition 2 Partition 3

Global secondary index (GSI)
Alternate partition (+sort) key
Index is across all table partition keys
GSIs
A5
(part.)
A4
(sort)
A1
(table key)
A3
(projected)
Table
INCLUDE A3
A4
(part.)
A5
(sort)
A1
(table key)
A2
(projected)
A3
(projected) ALL
A2
(part.)
A1
(table key) KEYS_ONLY
RCU/WCU provisioned
separately for GSIs
Online Indexing
A1
(partition)
A2 A3 A4 A5

How do GSI updates work?
Table
Primary
table
Primary
table
Primary
table
Primary
table
Global
Secondary
Index
Client
3. Asynchronous
update (in progress)
If GSIs don’t have enough write capacity, table writes will be throttled!

Local secondary index (LSI)
Alternate sort key attribute
Index is local to a partition key
A1
(partition)
A3
(sort)
A2
(table key)
A1
(partition)
A2
(sort)
A3 A4 A5
LSIs
A1
(partition)
A4
(sort)
A2
(table key)
A3
(projected)
Table
KEYS_ONLY
INCLUDE A3
A1
(partition)
A5
(sort)
A2
(table key)
A3
(projected)
A4
(projected)
ALL
10 GB max per partition
key, therefore LSIs limit
the # of sort keys!

Scaling: Throughput
Provisioned at the table level
Read and write throughput limits are independent
Tables re-partition according to provisioned throughput
• Write capacity units (WCUs) are measured in 1 KB per second
• Read capacity units (RCUs) are measured in 4 KB per second
• RCUs measure strictly consistent reads
• Eventually consistent reads cost 1/2 of consistent reads
WCURCU

Scaling: Disk Size
Virtually limitless storage
DynamoDB automatically splits partitions after 10GB
Items are limited to 400KB in size

Partitioning math
# 𝑜𝑓 𝑃𝑎𝑟𝑡𝑖𝑡𝑖𝑜𝑛𝑠 =
𝑇𝑎𝑏𝑙𝑒 𝑆𝑖𝑧𝑒 𝑖𝑛 𝐺𝐵
10 𝐺𝐵(𝑓𝑜𝑟 𝑠𝑖𝑧𝑒)
# 𝑜𝑓 𝑃𝑎𝑟𝑡𝑖𝑡𝑖𝑜𝑛𝑠
(𝑓𝑜𝑟 𝑡ℎ𝑟𝑜𝑢𝑔ℎ𝑝𝑢𝑡)
=
𝑅𝐶𝑈𝑓𝑜𝑟 𝑟𝑒𝑎𝑑𝑠
3000 𝑅𝐶𝑈
+
𝑊𝐶𝑈𝑓𝑜𝑟 𝑤𝑟𝑖𝑡𝑒𝑠
1000 𝑊𝐶𝑈
(𝑓𝑜𝑟 𝑡ℎ𝑟𝑜𝑢𝑔ℎ𝑝𝑢𝑡)(𝑓𝑜𝑟 𝑠𝑖𝑧𝑒)(𝑡𝑜𝑡𝑎𝑙)
In the future, these details might change…
Not used until table is created

Partitioning example
# 𝑜𝑓 𝑃𝑎𝑟𝑡𝑖𝑡𝑖𝑜𝑛𝑠 =
8 𝐺𝐵
10 𝐺𝐵
= 0.8 = 1
(𝑓𝑜𝑟 𝑠𝑖𝑧𝑒)
# 𝑜𝑓 𝑃𝑎𝑟𝑡𝑖𝑡𝑖𝑜𝑛𝑠
(𝑓𝑜𝑟 𝑡ℎ𝑟𝑜𝑢𝑔ℎ𝑝𝑢𝑡)
=
5000 𝑅𝐶𝑈
3000 𝑅𝐶𝑈
+
500 𝑊𝐶𝑈
1000 𝑊𝐶𝑈
= 2.17 = 3
Table size = 8 GB, RCUs = 5000, WCUs = 500
(𝑡𝑜𝑡𝑎𝑙)
RCUs per partition = 5000/3 = 1666.67
WCUs per partition = 500/3 = 166.67
Data/partition = 10/3 = 3.33 GB
RCUs and WCUs are uniformly
spread across partitions on table
creation

Getting the most out of DynamoDB throughput
“To get the most out of DynamoDB
throughput, create tables where
the partition key has a large
number of distinct values, and
values are requested fairly
uniformly, as randomly as
possible.”
—DynamoDB Developer Guide
1. Space:
1. Key Choice: High key
cardinality
2. Uniform Access: access
is evenly spread over
the key-space
2. Time: requests arrive evenly
spaced in time

Example: Key Choice or Uniform Access
Partition
Time
Heat

How does DynamoDB handle bursts?
DynamoDB saves 300 seconds of unused capacity per
partition
Bursting is best effort!

Burst capacity is built-in
0
400
800
1200
1600
CapacityUnits
Time
Provisioned Consumed
“Save up” unused capacity
Consume saved up capacity
Burst: 300 seconds
(1200 × 300 = 360k CU)

Burst capacity may not be sufficient
0
400
800
1200
1600
CapacityUnits
Time
Provisioned Consumed Attempted
Throttled requests
Don’t completely depend on burst capacity… provision sufficient throughput
Burst: 300 seconds
(1200 × 300 = 360k CU)

What causes throttling?
Non-uniform workloads
• Hot keys/hot partitions
• Large collection sizes [100s of MBs under one partition key]
Dilution of throughout across partitions caused by mixing
hot data with cold data
• Use a table per time period for storing time series data so WCUs
and RCUs are applied to the hot data set

Data Modeling
Store data based on how you will access it!

1:1 relationships or key-values
Use a table or GSI with a partition key
Use GetItem or BatchGetItem API
Example: Given a user or email, get attributes
Users Table
Partition key Attributes
UserId = bob Email = bob@gmail.com, JoinDate = 2016-11-15
UserId = fred Email = fred@yahoo.com, JoinDate = 2016-12-01
Users-Email-GSI
Partition key Attributes
Email = bob@gmail.com UserId = bob, JoinDate = 2016-11-15
Email = fred@yahoo.com UserId = fred, JoinDate = 2016-12-01

1:N relationships or parent-children
Use a table or GSI with partition and sort key
Use Query API
Example: Given a device, find all readings
between epoch X, Y
Device-measurements
Part. Key Sort key Attributes
DeviceId = 1 epoch = 5513A97C Temperature = 30, pressure = 90
DeviceId = 1 epoch = 5513A9DB Temperature = 30, pressure = 90

N:M relationships
Use a table and GSI with partition and sort key elements
switched
Use Query API
Example: Given a user, find all games. Or given a game,
find all users.
User-Games-Table
Part. Key Sort key
UserId = bob GameId = Game1
UserId = fred GameId = Game2
UserId = bob GameId = Game3
Game-Users-GSI
Part. Key Sort key
GameId = Game1 UserId = bob
GameId = Game2 UserId = fred
GameId = Game3 UserId = bob

Rich expressions
Projection expression
• Query/Get/Scan: ProductReviews.FiveStar[0]
Filter expression
• Query/Scan: #V > :num (#V is a place holder for keyword VIEWS)
Conditional expression
• Put/Update/DeleteItem: attribute_not_exists (#pr.FiveStar)
Update expression
• UpdateItem: set Replies = Replies + :num

DynamoDB Streams
Partition A
Partition B
Partition C
Ordered stream of item
changes
Exactly once, strictly
ordered by key
Highly durable, scalable
24 hour retention
Sub-second latency
Compatible with Kinesis
Client Library
DynamoDB Streams
1
Shards have a lineage and
automatically close after time or
when the associated DynamoDB
partition splits
2
3
Updates
KCL
Worker
Amazon
Kinesis Client
Library
Application
KCL
Worker
KCL
Worker
GetRecords
Amazon DynamoDB
Table
DynamoDB Streams
Stream
Shards

Event Logging
Storing time series data

Time series tables
Events_table_2016_April
Event_id
(Partition key)
Timestamp
(sort key)
Attribute1 …. Attribute N
Events_table_2016_March
Event_id
(Partition key)
Timestamp
(sort key)
Events_table_2016_Feburary
Event_id
(Partition key)
Timestamp
(sort key)
Events_table_2016_January
Event_id
(Partition key)
Timestamp
(sort key)
RCUs = 1000
WCUs = 100
RCUs = 10000
WCUs = 10000
RCUs = 100
WCUs = 1
RCUs = 10
WCUs = 1
Current table
Older tables
HotdataColddata
Don’t mix hot and cold data; archive cold data to Amazon S3

DynamoDB TTL
RCUs = 10000
WCUs = 10000
RCUs = 100
WCUs = 1
HotdataColddata
Use DynamoDB TTL and Streams to archive
Events_table_2016_April
Event_id
(Partition key)
Timestamp
(sort key)
myTTL
1489188093
…. Attribute N
Current table
Events_Archive
Event_id
(Partition key)
Timestamp
(sort key)

Isolate cold data from hot data
Pre-create daily, weekly, monthly tables
Provision required throughput for current table
Writes go to the current table
Turn off (or reduce) throughput for older tables
OR move items to separate table with TTL
Dealing with time series data

Product Catalog
Popular items (read)

Partition 1
2000 RCUs
Partition K
2000 RCUs
Partition M
2000 RCUs
Partition 50
2000 RCU
Scaling bottlenecks
Product A Product B
Shoppers
ProductCatalog Table
100,000 𝑅𝐶𝑈
50 𝑃𝑎𝑟𝑡𝑖𝑡𝑖𝑜𝑛𝑠
≈ 𝟐𝟎𝟎𝟎 𝑅𝐶𝑈 𝑝𝑒𝑟 𝑝𝑎𝑟𝑡𝑖𝑡𝑖𝑜𝑛
SELECT Id, Description, ...
FROM ProductCatalog
WHERE Id="POPULAR_PRODUCT"

RequestsPerSecond
Item Primary Key
Request Distribution Per Partition Key
DynamoDB Requests

DAX
DynamoDB Accelerator (DAX)
In-memory write-through caching cluster built specifically for
DynamoDB
DynamoDB
App
DAX

What is DAX?
The benefits of DAX:
•Latency: <1ms
•Throughput: Millions of RPS for a 3-node (highly
available) cluster
•Simplified caching: DynamoDB API, AWS integration
•Hot-key/cost savings: reduce over provisioning for
frequently accessed data

DAX Latency
DynamoDB
App
DynamoDB
App
~5ms
~5ms~400us
DAX

AMAZON VPC
EC2
App
DAX
SDK
How DAX works
Reads: Check DAX first, then read from DynamoDB
1
2
34
GetItem
(Synchronous)
DynamoDB

AMAZON VPC
EC2
App
DAX
SDK
How DAX works
Writes: DAX is a write-through cache
1
2
34
PutItem
DynamoDB

How DAX works
DAX is API compatible with DynamoDB
Read APIs: GetItem, BatchGetItem, Query, Scan
Modify APIs: PutItem, UpdateItem, DeleteItem,
BatchWriteItem
Control plane APIs: Not supported (CreateTable,
DeleteTable, etc.)

RequestsPerSecond
Item Primary Key
Request Distribution Per Partition Key
DynamoDB Requests Cache Hits

Requirements for promotion tracking
Monitor customer promotion redemption
Support adhoc queries from Hive
Archive historical promotion redemption in Redshift
BI requirements are not known in advance

1. Create Orders
2. Capture User Activity
CUSTOMER-ID ORDER-ITEM-ID PROMOTION-ID STATUS
CUST-1 ITEM-a PROMO-PRIMEDAY OK
CUST-1 ITEM-b PROMO-GOLDBOX OK
CUST-1 ITEM-c PROMO-GOLDBOX OK
CUST-3 ITEM-a PROMO-PRIMEDAY RETURNED

BI Questions
1. Which promotions were valued by customers aged 18-24 y/o?
2. How many PrimeDay promotions were returned, by country?
3. How did the PrimeDay sales compare to Black Friday last year?
Amazon S3 Redshift Other data
sources

Amazon
DynamoDB
DynamoDB
Streams
Amazon
Kinesis
Firehose
Redshift
1. Keep all promo redemption history in Redshift
2. Support adhoc queries with HQL
Amazon EMR
Amazon
DynamoDB
Other data
sources
Amazon S3

Use DynamoDB Streams to record changes
Use S3 as intermediary data store
Use Hive at low priority for adhoc queries
Analyze Items out of band
Queries are unknown at design
Other data
sources
Amazon S3
RedshiftAmazon
DynamoDB
DynamoDB
Streams

Real-Time Voting
Write-heavy items

Requirements for voting
Allow each person to vote only once
No changing votes
Real-time aggregation
Voter analytics, demographics

Real-time voting architecture
AggregateVotes
Table
Voters
RawVotes Table
Voting App

Partition 1
1000 WCUs
Partition K
1000 WCUs
Partition M
1000 WCUs
Partition N
1000 WCUs
Votes Table
Candidate A Candidate B
Scaling bottlenecks
Voters
Provision 200,000 WCUs

Write sharding
Candidate A_2
Candidate B_1
Candidate B_2
Candidate B_3
Candidate B_5
Candidate B_4
Candidate B_7
Candidate B_6
Candidate A_1
Candidate A_3
Candidate A_4
Candidate A_7 Candidate B_8
Candidate A_6 Candidate A_8
Candidate A_5
Voter
Votes Table

Write sharding
Candidate A_2
Candidate B_1
Candidate B_2
Candidate B_3
Candidate B_5
Candidate B_4
Candidate B_7
Candidate B_6
Candidate A_1
Candidate A_3
Candidate A_4
UpdateItem: “CandidateA_” + rand(0, 10)
ADD 1 to Votes
Candidate A_5
Voter
Votes Table

Votes Table
Shard aggregation
Candidate A_2
Candidate B_1
Candidate B_2
Candidate B_3
Candidate B_5
Candidate B_4
Candidate B_7
Candidate B_6
Candidate A_1
Candidate A_3
Candidate A_4
Candidate A_5
Periodic
Process
Candidate A
Total: 2.5M
1. Sum
2. Store Voter

Trade off read cost for write scalability
Consider throughput per partition key and per partition
Shard write-heavy partition keys
Your write workload is not horizontally
scalable

Correctness in voting
UserId Candidate Date
Alice A 2016-10-02
Bob B 2016-10-02
Eve B 2016-10-02
Chuck A 2016-10-02
RawVotes Table
Segment Votes
A_1 23
B_2 12
B_1 14
A_2 25
AggregateVotes Table
Voter
1. Record vote and de-dupe; retry 2. Increment candidate counter

Correctness in aggregation?
UserId Candidate Date
Alice A 2016-10-02
Bob B 2016-10-02
Eve B 2016-10-02
Chuck A 2016-10-02
RawVotes Table
Segment Votes
A_1 23
B_2 12
B_1 14
A_2 25
AggregateVotes Table
Voter

Stream of updates to a table
Asynchronous
Exactly once
Strictly ordered
• Per item
Highly durable
• Scale with table
24-hour lifetime
Sub-second latency
DynamoDB Streams

View Type Destination
Old image—before update Name = John, Destination = Mars
New image—after update Name = John, Destination = Pluto
Old and new images Name = John, Destination = Mars
Name = John, Destination = Pluto
Keys only Name = John
View types
UpdateItem (Name = John, Destination = Pluto)

Stream
Partition 1
Partition 2
Partition 3
Partition 4
Table
Shard 1
Shard 2
Shard 3
Shard 4
KCL
Worker
KCL
Worker
KCL
Worker
KCL
Worker
Amazon Kinesis Client
Library Application
DynamoDB
Client Application
Updates
DynamoDB Streams and
Amazon Kinesis Client Library

DynamoDB Streams
Open Source Cross-
Region Replication Library
Asia Pacific (Sydney) EU (Ireland) Replica
US East (N. Virginia)
Cross-region replication

Real-time voting architecture (improved)
AggregateVotes
Table
Amazon
Redshift Amazon EMR
Your
Amazon Kinesis–
Enabled App
Voters RawVotes TableVoting App RawVotes
DynamoDB
Stream

AggregateVotes
Table
Amazon
Redshift Amazon EMR
Your
Amazon Kinesis-
Enabled App
DynamoDB
Stream

AggregateVotes
Table
Amazon
Redshift Amazon EMR
Your
Amazon Kinesis-
Enabled app
DynamoDB
Stream

AggregateVotes
Table
Amazon
Redshift Amazon EMR
Your
Amazon Kinesis–
Enabled App
Voters RawVotes TableVoting app RawVotes
DynamoDB
Stream

Analytics with
DynamoDB Streams
Collect and de-dupe data in DynamoDB
Aggregate data in-memory and flush periodically
Performing real-time aggregation and
analytics

Deep Dive on Amazon DynamoDB

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Deep Dive on Amazon DynamoDB

Ähnlich wie Deep Dive on Amazon DynamoDB (20)

Mehr von Amazon Web Services

Mehr von Amazon Web Services (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (19)

Deep Dive on Amazon DynamoDB