In November 2015, Capital Games launched a mobile game accompanying a major feature film release. The back end of the game is hosted in AWS and uses big data services like Amazon Kinesis, Amazon EC2, Amazon S3, Amazon Redshift, and AWS Data Pipeline. Capital Games will describe some of their challenges on their initial setup and usage of Amazon Redshift and Amazon EMR. They will then go over their engagement with AWS Partner 47lining and talk about specific best practices regarding solution architecture, data transformation pipelines, and system maintenance using AWS big data services. Attendees of this session should expect a candid view of the process to implementing a big data solution. From problem statement identification to visualizing data, with an in-depth look at the technical challenges and hurdles along the way.
Ähnlich wie AWS re:Invent 2016: [REPEAT] How EA Leveraged Amazon Redshift and AWS Partner 47Lining to Gather Meaningful Player Insights (GAM301-R) (20)
3. What to Expect from the Session
• Analytics Architecture
• Challenges
• Effective patterns for ingest, de-dup, aggregate, vacuum
into Redshift
• How to balance rapid ingest and query speeds
• Strategies for data partitioning / orchestration
• Best practices for schema optimization, performant data
summaries incremental updates
• And how we built a Redshift solution to ingest 1 billion
rows of data per day
4.
5.
6. Life Before Redshift
• External solutions
• “One size fits all” for processing all games
• Serves the needs of central teams, but no focus on the
game team, no dedicated resource to us
• Lack of depth in data
• Client driven
7. Vision
• Discover how players play our game
• Drive better feature development
• Healthier operations through data
• Rapid iteration and evolution of telemetry gathering
• Decoupled from game server
• Frictionless access to data
• Easily query-able data
• Wall displays
9. Architecture – Persisting to S3
Game Servers Amazon
Kinesis
S3
Worker
S3
Bucket
Put Events
Game Clients
10. Architecture – Game Client
• iOS/Android clients
• Produces client specific events like screen transitions
• Events are batched up and sent to the game server
every minute
• In between flushes to server, events are persisted to disk
• If the client crashes events will be sent on the next
session
11. Architecture - Game Server
• EC2 Instances - Tomcat/Java
• Produces the majority of events
• Events are sent asynchronously to Kinesis
• ActiveMQ broker is responsible for the durability of the
message
• Persisted to disk until sent
• Retries with exponential backoff
• Dead Letter Queue
12. Architecture - Kinesis
• One Kinesis stream with 10 shards partitioned by event UUID
• 24 hour retention
• Provides fault tolerance to game server. Redshift can be
offline and the Game Server isn't impacted.
• Game Server batches many events into one Kinesis record
on every client request
• Records are compressed
13. Architecture – S3 Kinesis Worker
• Elastic Beanstalk
• Decompress records
• Transform hierarchical JSON structure into flat structure
• Patch missing data. PlayerId
• Clean/truncate data. 0/1 -> true/false
• Filter out unrepairable data. Bad timestamps
• Report operational metrics
• Write to S3 when thresholds are met
15. Architecture - S3 to Redshift
S3
Ingest Data
Pipeline
Amazon
Redshift
Amazon Elastic
Beanstalk
DeDupe & Analyze
Vacuum
SQL ETL
Data
Pipeline
16. Architecture – Ingest Data Pipeline
• Every hour data pipeline job bulk inserts all S3 files into
EventsDups table.
• Copy EventsDups from s3://sw-prod-kinesis-
events/#{format(minusHours(@scheduledStartTime,1),'Y
YYY/MM/dd/HH/')}
• Monitor for failures!
• Consider manifest driven ingest next time!
18. Architecture – Deduper
• Why deduplication?
• Redshift doesn't provide constraints.
• Distributed systems are complicated. Allow for retries when in
doubt.
• Data pipeline jobs can fail. Allow one to rerun ingest.
19. Architecture – Deduper Implementation
• Critical that a proper definition of duplicates is created
• Not based on all columns being the same
• Using the unique set of event identifying columns events
can be deduplicated both in the ingest table and against
the events table
20. Architecture – Deduper Implementation
• Beanstalk webapp that polls EventsDups table for work
• Deduplication is performed using the following columns to establish
uniqueness:
Description
Raw Event
Timestamp
Timestamp for event
User Id Player Identifier
Session Id Unique to each session
Step Each event gets a unique number generated from a memcached
increment operation.
Event Type Integer unique to each event.
21. Schema : Events
Sort Key Description
Ingest Time Unix time UTC when event is captured on Kinesis
Stat Date Raw Event Timestamp in yyyy-mm-dd format
Player Id Random generated UUID – Distribution Key
Raw Event
Timestamp
Unix time UTC when event is triggered on server
Event Type Integer unique to each event. 2924 = BattleSummaryEvent
Standard Fields Country, Device, Network, Platform...
Event Value 1-10 For each event type a set of 10 fields can be set.
22. Architecture – Vacuum
• Why Vacuum?
• Reclaim and reuse space that is freed when you delete and
update rows … we only insert …
• Ensure new data is properly sorted with existing table
• This is important in providing quality statistics to the query
optimizer.
• We Vacuum once a day which balances the time to Vacuum
and ability to provide performant statistics.
23. Architecture – Analyze
• Why Analyze?
• Any time one adds (modifies, or deletes) a significant number
of rows, you should run the analyze command to maintain the
query optimizers statistics.
• This occurs when the table is vacuumed.
• We analyze on every 4th successful deduper process.
• Analyze is resource intensive. Balance time to analyze
to optimizers ability to generate good plans.
24. Architecture – ETL – User Retention Daily
• Data Pipeline scheduled once an hour – along with many other aggregate tables
• Upsert into table looking back a week into events table
• Executed after users aggregate table is updated
Sort Key Description
PlayerId Random generated UUID – Distribution Key
Platform Apple/Google
Country US, GB...
Stat Date Row for every day player has played
Days In Game Number of days in game
Revenue Summary revenue data
25. Architecture – Scaling Growth
1 Billion Events!!!!
The Force
Awakens!!!
World Wide Launch!!!
29. Goals of Sorting
• Physically sort data within blocks and throughout a table
• Enable rrscans (block-rejection) to prune blocks by
leveraging zone maps
• Optimal SORTKEY is dependent on:
• Query patterns
• Data profile
• Business requirements
30. COMPOUND
• Most common
• Well-defined filter criteria
• Time-series data
Choosing a SORTKEY
INTERLEAVED
• Edge cases
• Large tables (>billion rows)
• No common filter criteria
• Non time-series data
• Organizing of time-series data
• Optimally newest data at the "end" of a time-series table,
• Primarily as a query predicate (date, identifier, …)
• Optionally, choose a column frequently used for aggregates
31. Best Practices for Time-Series Data
http://docs.aws.amazon.com/redshift/latest/d
g/vacuum-load-in-sort-key-order.html
It is important to have
sort keys that ensures
that new data is
“located”, per sort key
order, at the end of the
time-series table
32. Time
Events Out of Time
Incoming event destination post-vacuum
Events Table
Ingest Table
33. Altered Timestamp
By creating synthetic timestamp sort key the incoming
rows all vacuum to the end of the main events table
Ingest Table
Time
Events Table
34. Best Practice for Sort Key Selection
http://docs.aws.amazon.com/redshift/latest/dg/t_Sorting_data.html
Compound Sort Key:
A compound key is made up of all of the columns listed
in the sort key definition, in the order they are listed. A
compound sort key is most useful when a query's filter
applies conditions, such as filters and joins, that use a
prefix of the sort keys. The performance benefits of
compound sorting (may) decrease when queries depend
only on secondary sort columns, without referencing the
primary columns.
38. Balancing Vacuum and Query Speed
Four ingest batches come in with the
same truncated synthetic timestamp
After vacuum the secondary and tertiary
reorder the order of rows improving
sorting power for these later sort key
Vacuum time grows the number of
overlapping batches increases
Improved grouping of the secondary
and tertiary sort key values improves
query speed where these are used
Pre-Vacuum Post-Vacuum
40. Goals of Distribution
• Distribute data evenly for parallel processing
• Minimize data movement
• Co-located joins
• Localized aggregations
Distribution key All
Node 1
Slice
1
Slice
2
Node 2
Slice
3
Slice
4
Node 1
Slice
1
Slice
2
Node 2
Slice
3
Slice
4
Full table data on first
slice of every node
Same key to same location
Node 1
Slice
1
Slice
2
Node 2
Slice
3
Slice
4
Even
Round-robin
distribution
41. Choosing a Distribution Style
Key
• Large FACT tables
• Rapidly changing tables used
in joins
• Localize columns used within
aggregations
All
• Have slowly changing data
• Reasonable size (i.e., few
millions but not 100s of
millions of rows)
• No common distribution key
for frequent joins
• Typical use case: joined
dimension table without a
common distribution key
Even
• Tables not frequently joined or
aggregated
• Large tables without
acceptable candidate keys
42. Best Practice for Distribution
http://docs.aws.amazon.com/redshift/latest/dg/t_Distributing_data.html
Data redistribution can account for a substantial portion of the cost of a
query plan, and the network traffic it generates can affect other database
operations and slow overall system performance
1. To distribute the workload uniformly among the nodes in the cluster.
Uneven distribution, or data distribution skew, forces some nodes to
do more work than others, which impairs query performance
2. To minimize data movement during query execution. If the rows that
participate in joins or aggregates are already collocated on the
nodes with their joining rows in other tables, the optimizer does not
need to redistribute as much data during query execution
43. Unauthenticated (Anonymous) Events
Events Table
Slice 0 Slice 1 Slice 2 … Slice N
A small percentage of unauthenticated
events located on a single slice of a
large cluster leads to significant skew
Node Level Skew Slice Level Skew
44. Split Events Tables
Events Table - Authenticated
Slice 0 Slice 1 Slice 2 … Slice N
By splitting events into two tables querying speed was improved due to
unauthenticated events no longer unbalancing skew
UNION ALL view can be used to query all event data when needed
Events Table - Unauthenticated
Slice 0 Slice 1 Slice 2 … Slice N
45. Long Deduplication Time
Incoming events needs to be scrubbed
to prevent duplicate events
Duplicates removed from incoming data
Scanning the full events table for
deduplication slows as the events table
grows
Ingest Table
Time
Events Table
46. Events Table
Time Restricted Deduplication
Incoming events evaluated for ranges
on specific columns
Scan of main events table
limited to range of incoming
events
Ingest Table
Time