Grokking Techtalk #37: Data intensive problem

Ho Nguyen
• Senior Software Engineer
• Technical Interests:
• Solution & code design
• Distributed systems
• Video/Image encoding
• Hobbies
• Movies & music
• Manga & anime (One Piece, Dragon Ball...)
• Coffee lover

Data-intensive problem
Ho Nguyen
Senior Software Engineer

Outline
• Simple problem
• When the data is big
• More problems
• Approaches

How big is the data?
• A data set of 2 billion records of
unique URLs
• Assuming the previous program
needs 2 seconds to complete =>
Concurrency number = 0.5 URL/s
2 ∗ 2 ∗ 10𝑒8
3600 ∗ 24
= 46296 𝑑𝑎𝑦𝑠 ≈ 127(𝑦𝑒𝑎𝑟𝑠)

What is the concurrency number we need to
complete the dataset in X days?

What is the concurrency number we need?
• Goal: X=7 Days
• 2 billions URLs
• Current concurrency 0.5 URL/s.
2 ∗ 10𝑒8
X ∗ 3600 ∗ 24
=
2 ∗ 10𝑒8
7 ∗ 3600 ∗ 24
≈ 3307 𝑈𝑅𝐿𝑠/𝑠

How to increase concurrency?
• Optimize code performance
• Increase hardware resource (CPU,
RAM, Disk, Network…) aka Scale-
up
• Scale-out
• Cloning to multiple processes
(X-Axis)
• Splitting by functions (Y-Axis)
• Data partitioning (Z-Axis)

Optimize code
• Pros
• Most effective if we found a bottleneck that can increase performance
to 661,300%
• Save infrastructure cost
• Cons
• Time consuming and uncertain

Scale-up
• Pros
• Easy to apply
• Cons
• Take time to find out the suitable
hardware configuration
• Expensive and limited
• Still need to optimize code and
redesign to take advantage of
hardware resources when cannot scale-up

Scale-out by cloning (X-Axis)
• Pros
• Can use all hardware
resources
• Not limited by hardware
• Cons
• More complex than scale-up
• Concurrency problems
Node 1 Node 2 Node 3

Scale-out by Splitting (Y-Axis)
Review the workflow

• Download and resize image using CPU
• Face detection on GPU is faster
Reference: https://sites.google.com/site/facedetectionongpu/

X-axis: Cloning
Download and
Process Image
Download and
Process Image
Download and
Process Image
Face Detection Face Detection
Y-axis:Splitting

• Pro
• Reuse the advantage of hardware
• Cons
• Complex
• Concurrency problems

Scale-out by data-partitioning (Z-Axis)
Data schema
ID URL Done
1 https://abc.com/image1.jpg 1

ID URL Done
ID URL Done
Key hashing

ID URL Done
ID URL Done
Range base

• Pros
• Increase database performance
• Reduce locking/non-locking
• Cons
• Increase maintenance and infrastructure cost
• Hard for automation scaling

Summary
• Skip the code optimization approach
• Skip the scale-up approach
• Focus on scale-out approaches
• We can increase the number of
processes/machines to increase the
concurrency number
• We can split into 2 services: Downloader and
Face Detections
• We may need data partition to optimize
database performance

Race condition
• Cause
• Same URL process twice or
more
• Impact
• Waste of resources
• Data corruption
• Faking concurrency

Race condition: How to solve?
• Distributed locks
• Pros
• N/a
• Cons
• Pessimistic locking impact
performance
• Hard to apply because we need to
synchronize multiples nodes
• Not good fault-tolerance
• Data sharding
• Pro
• High performance because of share
load (Physical shard)
• Cons
• Hard for scaling
• Increase maintenance & infrastructure
cost
• Queue/Worker
• Pros
• Easy to implement
• Easy to scale
• Good fault-tolerance
• Reusable communnication
• Con
• The load concentrates on the
queue so it can become a
bottleneck

Race condition: root cause
Race condition only causes
between Downloaders
=> If we found a way to
distribute the unique URL for
each downloader it will solve
the race condition for the whole
system.

Fault Tolerance
• Faults
• Network fault
• Network interruption
• IP Blocking
• Service crash
• Problems
• Can data be lost?
• Can the service restart and
continue to work on remaining
tasks?

Fault Tolerance criteria
Given When Then
A service crashed It restarted No Rework (Continue on
remaining items only)
Downloader service is running It crashed All downloaded images should
not be lost
FaceDetector service is running It crashed All detected result should not be
lost
Downloader is downloading
image
Network error happens Retry
Downloader retry to download
an image again
Network error is IP Locking Should rotate proxy to change
the ip

Service communication
• How do the services communicate?
• Do we need a load balancer?

Service communication methods
Type Method Pros Cons
Synchronous
HTTP • Familiar and
Simple to use
• Need a load
balancer
• Tight coupling
• Lock thread wait for
response
RPC • High performance
than HTTP
Asynchronous
Queue Messaging
(One-One)
• High performance
• Failure isolation
• Act as a load
balancer
• Reduced coupling
• Extra maintenance
cost
• Queue may
become bottleneck
Publich/Subscribe
(One-Many)
• We only need the
one-to-one
comunication

Summary
• Find approach to distribute unique URL to downloaders.
• The approach should pass the fault tolerance criteria
• We can base on the communication methods table to choose
the final solution

Approach 1: Range based physical shard
𝑛: 𝑡𝑜𝑡𝑎𝑙 𝑈𝑅𝐿𝑠
𝑚 ∶ 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑎𝑟𝑡𝑖𝑜𝑛𝑠
𝑖 ∈ [0 … 𝑚 − 1]: 𝑝𝑎𝑟𝑡𝑖𝑡𝑖𝑜𝑛 𝑛𝑢𝑚𝑏𝑒𝑟
𝑘 =
𝑛
𝑚
∶ number of urls in a partition
𝑠𝑡𝑎𝑟𝑡 𝑖 = 𝑘 ∗ 𝑖
𝑒𝑛𝑑 𝑖 =
𝑠𝑡𝑎𝑟𝑡 𝑖 + 𝑘, 0 ≤ 𝑖 < 𝑚 − 1
𝑠𝑡𝑎𝑟𝑡 𝑖 + 𝑘 + 𝑛 𝑚𝑜𝑑 𝑚 , 𝑖 = 𝑚 − 1

Approach 1: Range based physical shard
Solve
Race
Condition
Faul tolerance Comunication
Types
Notes
Solved + No rework
+ Need to download
image again if crash
when face detection
+ Partition can be
abadoned
HTTP/gPRC • Pros
• Non locking on db level
• Cons
• Take time for preparation
• Hard to scale out/adjust
• Need load balancer

Approach 2: Logical shard
𝑛: 𝑡𝑜𝑡𝑎𝑙 𝑢𝑟𝑙𝑠
𝑚: 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑒𝑠
𝑖𝑑 ∈ 0. . 𝑚 − 1 : 𝑝𝑟𝑜𝑐𝑒𝑠𝑠 𝑖𝑑
𝑘 =
𝑛
𝑚
𝑖: 𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑖𝑛𝑔 𝑢𝑟𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑖𝑛 𝑎 𝑝𝑟𝑜𝑐𝑒𝑠𝑠
𝑖 ∈ 0. . 𝑘 − 1 𝑖𝑑 < 𝑚 − 1
𝑖 ∈ 0. . 𝑘 − 1 + 𝑛 𝑚𝑜𝑑 𝑚 𝑖𝑑 = 𝑚 − 1
𝑓 𝑖 𝑡ℎ𝑒 𝑖𝑑 𝑜𝑓 𝑢𝑟𝑙 𝑤𝑒 𝑛𝑒𝑒𝑑 𝑡𝑜 𝑝𝑖𝑐𝑘.
⇒ 𝑓 𝑖 = 𝑖𝑑 ∗ 𝑘 + 𝑖

Approach 2: Logical shard
Solve
Race
Condition
Types
Notes
Solved + No rework
+ Need to download image
again if crash when face
detection
+ Partition can be
abadoned
HTTP/gPRC • Pros
• Non locking on db level
• Simple implementation
• Cons
• Hard to scale out/adjust
• High database throughput
• Extra state to maintain: Total
Urls, Current Url Id,…

Approach 3: Queue/Worker x Logical Sharding
𝑛: 𝑡𝑜𝑡𝑎𝑙 𝑢𝑟𝑙𝑠
𝑚: 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑒𝑠
𝑖𝑑 ∈ 0. . 𝑚 − 1 : 𝑝𝑟𝑜𝑐𝑒𝑠𝑠 𝑖𝑑
𝑘 =
𝑛
𝑚
𝑖: 𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑖𝑛𝑔 𝑢𝑟𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑖𝑛 𝑎 𝑝𝑟𝑜𝑐𝑒𝑠𝑠
𝑖 ∈ 0. . 𝑘 − 1 𝑖𝑑 < 𝑚 − 1
𝑖 ∈ 0. . 𝑘 − 1 + 𝑛 𝑚𝑜𝑑 𝑚 𝑖𝑑 = 𝑚 − 1
𝑓 𝑖 𝑡ℎ𝑒 𝑖𝑑 𝑜𝑓 𝑢𝑟𝑙 𝑤𝑒 𝑛𝑒𝑒𝑑 𝑡𝑜 𝑝𝑖𝑐𝑘.
⇒ 𝑓 𝑖 = 𝑖𝑑 ∗ 𝑘 + 𝑖

Approach 3: Queue/Worker x Logical Sharding
Solve
Race
Condition
Types
Notes
Solved + No Rework
+ Failure isolation
+ Node is replacable
Messaging • Pros
• Easy to scale
• Easy fault-tolerance
• Fail isolation
• Asynchronous
• Cons
• Extra infrastructrure
• High throughput on queue

Questions
• How to measure and debug service?
• What is deployment process?

Q&A
THANK YOU FOR YOUR ATTENTION

Grokking Techtalk #37: Data intensive problem

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Grokking Techtalk #37: Data intensive problem

Ähnlich wie Grokking Techtalk #37: Data intensive problem (20)

Mehr von Grokking VN

Mehr von Grokking VN (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Grokking Techtalk #37: Data intensive problem

Hinweis der Redaktion