Roots tech 2013 Big Data at Ancestry (3-22-2013) - no animations

DNA
Big Data at Ancestry
Bill Yetman, Sr. Director Commerce, Data, Analytics
March 22, 2013

Agenda
• Introduction
• Understanding Big Data Scale
• Big Data Story – How it works
• What is Big Data?
• Ancestry’s Big Data
• Three Big Data Examples
– DNA Matching
– Machine Learning
– Data Mining
2

Understanding Big Data Scale
Dollar Amounts Data Sizes
$1 Byte
$1,000 (Thousands) Kilobyte (KB: 10^3)
$1,000,000 (Millions) Megabyte (MB: 10^6)
$1,000,000,000 (Billions) Gigabyte (GB: 10^9)
$1,000,000,000,000 (Trillions) Terabyte (TB: 10^12)
$1,000,000,000,000,000 (Quadrillion) Petabyte (PB: 10^15)
$1,000,000,000,000,000,000
(Quintillion)
Exabyte (EB: 10^18)
3
• Before the bubble, $100 million used to
be big, now $1 trillion is routine
• Data has gone through a similar trend
"640K ought to be enough for anybody.“
Bill Gates, 1981

Big Data at Work
Bill’s Facebook Story
– Signed up in early 2007
– Must have been a visionary
– Not really…
4
1 Friend for 2 ½ Years
Daughter joined in 2010
…
• Back in 2007, Facebook did not care!
• How Big Data has changed FB …

Big Data at Work (cont’d)
What happens today…
– Fill out your profile and that data is used to hook you into the “social graph”
– FB wants you to make 10 friend requests within the first 5 minutes of signing up
– Within 15 minutes you understand intrinsically how the service works
– Social graph (a huge dataset) is used to engage the individual user
5

What is Big Data?
 Volume (amount of data)
 Exceeds the processing capacity of conventional database systems
 Velocity (speed of data in/out)
 Needs high performance data pipeline and massively parallel processing
 Variety (range of data types and sources)
 Fluid data requirements where up-front schema design is a poor fit
6
Big Data
Characteristics
Volume Variety Velocity

Big Data: Why now?
• The simplified version of Moore’s Law states that processor speeds, or overall
processing power for computers will double every two years.
– Intel co-founder Gordon Moore
• Over the last 30 years, space per unit cost has doubled roughly every 14 months
(increasing by an order of magnitude every 48 months).*
– In early 2010, Terabyte drives were introduced at a cost of $0.10 per TB or $0.01 per GB
7 * http://www.mkomo.com/cost-per-gigabyte
*

Big Data Technologies
8
* http://en.wikipedia.org/wiki/Apache_Hadoop
4 Hadoop definition points are from a presentation by Jeremy Pollack, Ancestry.com 3/2/2013
What is Hadoop?
1. Hadoop* is an open-source platform for processing large
amounts of data in a scalable, fault-tolerant, affordable fashion
2. Hadoop specifies a distributed file system called HDFS
3. Hadoop supports a processing methodology known as
MapReduce
4. Many tools are built on top of Hadoop, such as HBase, Hive, and
Flume

What is Ancestry’s Big Data?
• Family Trees: 44 million+
• Profiles on the Family Trees: 4 billion+
• Records Attached to Family Trees: 2 billion+
• Photographs, Scanned Documents and Written Stories:
185 million+
• Total # of Records: 11 billion+
• Total # of Titles: 30,000
• Total # of DNA Samples: 100,000+
• 4 Petabytes of structured and unstructured data
9

DNA Matching
• Framing the Problem
– 46 chromosomes, 23 pairs, 3 billion+ base pairs, AGTC
– 99.9% of human DNA is shared
– Academic programs work at
small scale (GermLine)
– New samples are matched
against all previous samples
10
• Use Big Data technologies
to create a scalable
matching platform
– Hadoop and MapReduce
– HBase
http://www1.cs.columbia.edu/~gusev/germline/

DNA Matching
• Raw Data
3 123456789_RZZZZ2_XXXXXXH3Q7U7Q2B_YYYY84598-DNA 0 0 0 -9 C C G G G G G G A A A A C C G G A
A A A C C G G G G A A G G G A A A G G A G A A C C A A A A G G A A A G G G G G C C G G A A G G G G G G
G A A A A C G A A A A G A G A A A A G G G G G G A G G G G G G G … (continues for 700,000+ snips)
• Map File
0 rs10005853 0 0
0 rs10015934 0 0
0 rs1004236 0 0
0 rs10059646 0 0
0 rs10085382 0 0
0 rs10123921 0 0
0 rs10127827 0 0
0 rs10155688 0 0
0 rs10162780 0 0
0 rs1017484 0 0
0 rs10188129 0 0
11

DNA Matching
• Walk through an example of how the Open Source
algorithm Germline works
– http://www1.cs.columbia.edu/~gusev/germline/
• Highly simplified
– Simple, small example
– Uses names from Battlestar Galactica
• Two key steps when comparing two samples
– Initial “word match”
– Second “fuzzy logic” step
• Worst case
– Identical twins or two samples from the same user
12
Kara Thrace
Admiral Adama

DNA Matching with Big Data
• Map phase “adds words” to the HBase table for each new
sample and saves data to a “fuzzy match” table
• Reduce phase uses those tables to create segment
matches
21
• Matched Segments are analyzed statistically to determine
relationship distance (M0 to M11)

DNA Matching – Big Data Results
22
Introduced Big Data Hadoop Matching Process
Projected Process vs. Big Data Process

Machine Learning – Automated Content Pipeline
• OCR to extract the text from an image
• Use machine learning to categorize the
words and phrases to extract names, dates,
place, relationships, and other information
• Natural language processing using
supervised machine learning
– Use a set of data that is annotated for learning
– Validation set to test
– Run against the extracted text from a collection
• Result is a fully automated content delivery
process
23

Data Mining: Tree Sizes
24
x-axis: tree size, y-axis: number of trees of a specific size
More small trees, than large trees, but we also have
extremely large trees in the system > 500,000

Data Mining: What do trees look like?
25

Data Mining: What do trees look like?
26

Final Thoughts/Questions
• Just three examples of Big Data processing and
technologies in use at Ancestry
– Hinting
– Search
– Trees
– More…
• Big Data technologies to watch
– Hadoop and HDFS move from batch to real-time processing
– Impala and Apache Drill
• Questions?
27

Roots tech 2013 Big Data at Ancestry (3-22-2013) - no animations

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Roots tech 2013 Big Data at Ancestry (3-22-2013) - no animations

Similar to Roots tech 2013 Big Data at Ancestry (3-22-2013) - no animations (20)

Recently uploaded

Recently uploaded (20)

Roots tech 2013 Big Data at Ancestry (3-22-2013) - no animations