When you're collecting data from hundreds of millions of devices simultaneously, things get noisy. We go over key problems and solutions for collecting and validating data at scale.
2. Doing things at scale is noisy
u Code is supposed to run the same way, but what if you run the
same loop a million times on a million different machines- how
confident are you it will always run the same?
3. Data from phones is noisier
u Running on tens of thousands of different platforms with
hundreds of thousands of different software configurations on
hundreds of millions of phones
u Platforms have the craziest settings
4. How data can get messed up
u HTTP requests get mangled in transit
u Phone might not get the acknowledgement from the server
u People’s clocks are off
u People are running weird versions of Android
u Memory/disk corruption
u Gamma ray events
6. Problem: Data gets mangled in
transit
u Parameters from post requests get dropped
u Within a parameter, a chunk of data may not actually reach the
server
7. Solution: Checksumming
u Send a checksum that’s a function of all the fields
u If the checksum is wrong/not present, you know that you haven’t
got all the data. Tell the phone the upload wasn’t successful
u The phone will attempt to reupload the data
8. Problem: Client sends the same
data twice
u How does the phone know that the server has received the data
so it doesn’t reupload the same piece of data twice? It gets an
acknowledgement back
u How does the server know that the phone has received the
acknowledgement? It doesn’t!
u Equivalent to the two generals problem
u Requests that are successfully received by the server fail to
successfully send an acknowledgement to the phone 5% of the
time
u That means all counts are inflated by about 5%!
9. Solution: Deduplication
u Your system must be idempotent on the event level- it must be
able to receive an event it’s received before and not change its
state
u Create a unique key for every event that has been sent
u When you see an event, check your list of keys if the key is already
present, discard the event
10. Problem: Clocks are off
u Phones are often offline, so an analytics SDK needs to cache data
locally before uploading, including the time the event occurred
u But people’s clocks are often off, occasionally by years!
u We can’t timestamp to the upload time, 5% of data is uploaded
>24 hours after an event happened
11. Solution: Get an estimate of the
actual time an event was logged
u Timestamp the upload from the phone
u For each event, let’s compare:
u The difference between the phone event timestamp and the server
upload time
u The difference between the phone upload timestamp and the server
upload time
12.
13.
14. Solution: Get an estimate of the
actual time an event was logged
u For each event timestamp, subtract the difference between the
phone’s upload time and the server’s upload time
15. Other Problems
u People are running weird versions of Android
u MD5 library
u Memory/disk corruption
u Gamma ray events
17. Questions?
Always happy to talk about analytics problems!
spenser@amplitude.com
blog.amplitude.com
twitter: @amplitudemobile
MOBILE ANALYTICS FOR DECISION MAKERS