SlideShare a Scribd company logo
1 of 27
DNA
Big Data at Ancestry
Bill Yetman, Sr. Director Commerce, Data, Analytics
March 22, 2013
Agenda
• Introduction
• Understanding Big Data Scale
• Big Data Story – How it works
• What is Big Data?
• Ancestry’s Big Data
• Three Big Data Examples
– DNA Matching
– Machine Learning
– Data Mining
2
Understanding Big Data Scale
Dollar Amounts Data Sizes
$1 Byte
$1,000 (Thousands) Kilobyte (KB: 10^3)
$1,000,000 (Millions) Megabyte (MB: 10^6)
$1,000,000,000 (Billions) Gigabyte (GB: 10^9)
$1,000,000,000,000 (Trillions) Terabyte (TB: 10^12)
$1,000,000,000,000,000 (Quadrillion) Petabyte (PB: 10^15)
$1,000,000,000,000,000,000
(Quintillion)
Exabyte (EB: 10^18)
3
• Before the bubble, $100 million used to
be big, now $1 trillion is routine
• Data has gone through a similar trend
"640K ought to be enough for anybody.“
Bill Gates, 1981
Big Data at Work
Bill’s Facebook Story
– Signed up in early 2007
– Must have been a visionary
– Not really…
4
1 Friend for 2 ½ Years
Daughter joined in 2010
…
• Back in 2007, Facebook did not care!
• How Big Data has changed FB …
Big Data at Work (cont’d)
What happens today…
– Fill out your profile and that data is used to hook you into the “social graph”
– FB wants you to make 10 friend requests within the first 5 minutes of signing up
– Within 15 minutes you understand intrinsically how the service works
– Social graph (a huge dataset) is used to engage the individual user
5
What is Big Data?
 Volume (amount of data)
 Exceeds the processing capacity of conventional database systems
 Velocity (speed of data in/out)
 Needs high performance data pipeline and massively parallel processing
 Variety (range of data types and sources)
 Fluid data requirements where up-front schema design is a poor fit
6
Big Data
Characteristics
Volume Variety Velocity
Big Data: Why now?
• The simplified version of Moore’s Law states that processor speeds, or overall
processing power for computers will double every two years.
– Intel co-founder Gordon Moore
• Over the last 30 years, space per unit cost has doubled roughly every 14 months
(increasing by an order of magnitude every 48 months).*
– In early 2010, Terabyte drives were introduced at a cost of $0.10 per TB or $0.01 per GB
7 * http://www.mkomo.com/cost-per-gigabyte
*
Big Data Technologies
8
* http://en.wikipedia.org/wiki/Apache_Hadoop
4 Hadoop definition points are from a presentation by Jeremy Pollack, Ancestry.com 3/2/2013
What is Hadoop?
1. Hadoop* is an open-source platform for processing large
amounts of data in a scalable, fault-tolerant, affordable fashion
2. Hadoop specifies a distributed file system called HDFS
3. Hadoop supports a processing methodology known as
MapReduce
4. Many tools are built on top of Hadoop, such as HBase, Hive, and
Flume
What is Ancestry’s Big Data?
• Family Trees: 44 million+
• Profiles on the Family Trees: 4 billion+
• Records Attached to Family Trees: 2 billion+
• Photographs, Scanned Documents and Written Stories:
185 million+
• Total # of Records: 11 billion+
• Total # of Titles: 30,000
• Total # of DNA Samples: 100,000+
• 4 Petabytes of structured and unstructured data
9
DNA Matching
• Framing the Problem
– 46 chromosomes, 23 pairs, 3 billion+ base pairs, AGTC
– 99.9% of human DNA is shared
– Academic programs work at
small scale (GermLine)
– New samples are matched
against all previous samples
10
• Use Big Data technologies
to create a scalable
matching platform
– Hadoop and MapReduce
– HBase
http://www1.cs.columbia.edu/~gusev/germline/
DNA Matching
• Raw Data
3 123456789_RZZZZ2_XXXXXXH3Q7U7Q2B_YYYY84598-DNA 0 0 0 -9 C C G G G G G G A A A A C C G G A
A A A C C G G G G A A G G G A A A G G A G A A C C A A A A G G A A A G G G G G C C G G A A G G G G G G
G A A A A C G A A A A G A G A A A A G G G G G G A G G G G G G G … (continues for 700,000+ snips)
• Map File
0 rs10005853 0 0
0 rs10015934 0 0
0 rs1004236 0 0
0 rs10059646 0 0
0 rs10085382 0 0
0 rs10123921 0 0
0 rs10127827 0 0
0 rs10155688 0 0
0 rs10162780 0 0
0 rs1017484 0 0
0 rs10188129 0 0
11
DNA Matching
• Walk through an example of how the Open Source
algorithm Germline works
– http://www1.cs.columbia.edu/~gusev/germline/
• Highly simplified
– Simple, small example
– Uses names from Battlestar Galactica
• Two key steps when comparing two samples
– Initial “word match”
– Second “fuzzy logic” step
• Worst case
– Identical twins or two samples from the same user
12
Kara Thrace
Admiral Adama
DNA Matching with Big Data
• Map phase “adds words” to the HBase table for each new
sample and saves data to a “fuzzy match” table
• Reduce phase uses those tables to create segment
matches
21
• Matched Segments are analyzed statistically to determine
relationship distance (M0 to M11)
DNA Matching – Big Data Results
22
Introduced Big Data Hadoop Matching Process
Projected Process vs. Big Data Process
Machine Learning – Automated Content Pipeline
• OCR to extract the text from an image
• Use machine learning to categorize the
words and phrases to extract names, dates,
place, relationships, and other information
• Natural language processing using
supervised machine learning
– Use a set of data that is annotated for learning
– Validation set to test
– Run against the extracted text from a collection
• Result is a fully automated content delivery
process
23
Data Mining: Tree Sizes
24
x-axis: tree size, y-axis: number of trees of a specific size
More small trees, than large trees, but we also have
extremely large trees in the system > 500,000
Data Mining: What do trees look like?
25
Data Mining: What do trees look like?
26
Final Thoughts/Questions
• Just three examples of Big Data processing and
technologies in use at Ancestry
– Hinting
– Search
– Trees
– More…
• Big Data technologies to watch
– Hadoop and HDFS move from batch to real-time processing
– Impala and Apache Drill
• Questions?
27

More Related Content

What's hot

Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...Big Data Spain
 
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Mahantesh Angadi
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & HadoopEdureka!
 
Introduction to Big Data processing (FGRE2016)
Introduction to Big Data processing (FGRE2016)Introduction to Big Data processing (FGRE2016)
Introduction to Big Data processing (FGRE2016)Thomas Vanhove
 
Big data-denis-rothman
Big data-denis-rothmanBig data-denis-rothman
Big data-denis-rothmanDenis Rothman
 
Big data processing using - Hadoop Technology
Big data processing using - Hadoop TechnologyBig data processing using - Hadoop Technology
Big data processing using - Hadoop TechnologyShital Kat
 
Big Data Analytics for Non-Programmers
Big Data Analytics for Non-ProgrammersBig Data Analytics for Non-Programmers
Big Data Analytics for Non-ProgrammersEdureka!
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Chris Baglieri
 
Big data for SAS programmers
Big data for SAS programmersBig data for SAS programmers
Big data for SAS programmersKevin Lee
 
Intro to HDFS and MapReduce
Intro to HDFS and MapReduceIntro to HDFS and MapReduce
Intro to HDFS and MapReduceRyan Tabora
 
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier TordoirShare and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier TordoirSpark Summit
 
Python for Big Data Analytics
Python for Big Data AnalyticsPython for Big Data Analytics
Python for Big Data AnalyticsEdureka!
 
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...Ashok Royal
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irdatastack
 
Hadoop Online training by Keylabs
Hadoop Online training by KeylabsHadoop Online training by Keylabs
Hadoop Online training by KeylabsSiva Sankar
 

What's hot (20)

Big data ppt
Big data pptBig data ppt
Big data ppt
 
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
 
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
 
Hadoop Report
Hadoop ReportHadoop Report
Hadoop Report
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & Hadoop
 
Introduction to Big Data processing (FGRE2016)
Introduction to Big Data processing (FGRE2016)Introduction to Big Data processing (FGRE2016)
Introduction to Big Data processing (FGRE2016)
 
Big data-denis-rothman
Big data-denis-rothmanBig data-denis-rothman
Big data-denis-rothman
 
Big data processing using - Hadoop Technology
Big data processing using - Hadoop TechnologyBig data processing using - Hadoop Technology
Big data processing using - Hadoop Technology
 
Big Data Analytics for Non-Programmers
Big Data Analytics for Non-ProgrammersBig Data Analytics for Non-Programmers
Big Data Analytics for Non-Programmers
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 
A data analyst view of Bigdata
A data analyst view of Bigdata A data analyst view of Bigdata
A data analyst view of Bigdata
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Big data for SAS programmers
Big data for SAS programmersBig data for SAS programmers
Big data for SAS programmers
 
Intro to HDFS and MapReduce
Intro to HDFS and MapReduceIntro to HDFS and MapReduce
Intro to HDFS and MapReduce
 
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier TordoirShare and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
 
Python for Big Data Analytics
Python for Big Data AnalyticsPython for Big Data Analytics
Python for Big Data Analytics
 
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 
Hadoop Online training by Keylabs
Hadoop Online training by KeylabsHadoop Online training by Keylabs
Hadoop Online training by Keylabs
 
Why Hadoop is Useful?
Why Hadoop is Useful?Why Hadoop is Useful?
Why Hadoop is Useful?
 

Viewers also liked

OneBusAway Multi-region – Rapidly Expanding Mobile Transit Apps to New Cities
OneBusAway Multi-region – Rapidly Expanding Mobile Transit Apps to New CitiesOneBusAway Multi-region – Rapidly Expanding Mobile Transit Apps to New Cities
OneBusAway Multi-region – Rapidly Expanding Mobile Transit Apps to New CitiesSean Barbeau
 
Social media for business www.mintsocialmedia.com
Social media for business   www.mintsocialmedia.comSocial media for business   www.mintsocialmedia.com
Social media for business www.mintsocialmedia.comKabir Shaikh
 
คู่มือการใช้งานโปรแกรม Quotation center
คู่มือการใช้งานโปรแกรม Quotation centerคู่มือการใช้งานโปรแกรม Quotation center
คู่มือการใช้งานโปรแกรม Quotation centerตุลยวัต พุ่มทิม
 
Bahaya gula berlebihan
Bahaya gula berlebihanBahaya gula berlebihan
Bahaya gula berlebihanNorizan Din
 
Utah Big Mountain Conference: AncestryDNA, HBase, Hadoop (9-7-2013)
Utah Big Mountain Conference:  AncestryDNA, HBase, Hadoop (9-7-2013)Utah Big Mountain Conference:  AncestryDNA, HBase, Hadoop (9-7-2013)
Utah Big Mountain Conference: AncestryDNA, HBase, Hadoop (9-7-2013)William Yetman
 
Tableau Lunch and Learn in SLC on 6-10-2014 (Bill Yetman and Adam Davis)
Tableau Lunch and Learn in SLC on 6-10-2014 (Bill Yetman and Adam Davis)Tableau Lunch and Learn in SLC on 6-10-2014 (Bill Yetman and Adam Davis)
Tableau Lunch and Learn in SLC on 6-10-2014 (Bill Yetman and Adam Davis)William Yetman
 
Create A Vision Board for Future
Create A Vision Board for FutureCreate A Vision Board for Future
Create A Vision Board for FutureShreya Lalwani
 
UNO MARKETING PLAN (INTERNATIONAL)
UNO MARKETING PLAN (INTERNATIONAL)UNO MARKETING PLAN (INTERNATIONAL)
UNO MARKETING PLAN (INTERNATIONAL)Pj Premallon
 
NAPAKAGANDANG NEGOSYO PARA SATIN MGA KAPWA KO FILIPINO
NAPAKAGANDANG NEGOSYO PARA SATIN MGA KAPWA KO FILIPINONAPAKAGANDANG NEGOSYO PARA SATIN MGA KAPWA KO FILIPINO
NAPAKAGANDANG NEGOSYO PARA SATIN MGA KAPWA KO FILIPINOreponia
 
開発者からみたTensor flow
開発者からみたTensor flow開発者からみたTensor flow
開発者からみたTensor flowHideo Kinami
 

Viewers also liked (20)

Narration
NarrationNarration
Narration
 
OneBusAway Multi-region – Rapidly Expanding Mobile Transit Apps to New Cities
OneBusAway Multi-region – Rapidly Expanding Mobile Transit Apps to New CitiesOneBusAway Multi-region – Rapidly Expanding Mobile Transit Apps to New Cities
OneBusAway Multi-region – Rapidly Expanding Mobile Transit Apps to New Cities
 
Avatar
AvatarAvatar
Avatar
 
Social media for business www.mintsocialmedia.com
Social media for business   www.mintsocialmedia.comSocial media for business   www.mintsocialmedia.com
Social media for business www.mintsocialmedia.com
 
Cab advertising
Cab advertisingCab advertising
Cab advertising
 
คู่มือการใช้งานโปรแกรม Quotation center
คู่มือการใช้งานโปรแกรม Quotation centerคู่มือการใช้งานโปรแกรม Quotation center
คู่มือการใช้งานโปรแกรม Quotation center
 
Bahaya gula berlebihan
Bahaya gula berlebihanBahaya gula berlebihan
Bahaya gula berlebihan
 
Answer
AnswerAnswer
Answer
 
advertisement
advertisementadvertisement
advertisement
 
Saad
SaadSaad
Saad
 
Utah Big Mountain Conference: AncestryDNA, HBase, Hadoop (9-7-2013)
Utah Big Mountain Conference:  AncestryDNA, HBase, Hadoop (9-7-2013)Utah Big Mountain Conference:  AncestryDNA, HBase, Hadoop (9-7-2013)
Utah Big Mountain Conference: AncestryDNA, HBase, Hadoop (9-7-2013)
 
Tableau Lunch and Learn in SLC on 6-10-2014 (Bill Yetman and Adam Davis)
Tableau Lunch and Learn in SLC on 6-10-2014 (Bill Yetman and Adam Davis)Tableau Lunch and Learn in SLC on 6-10-2014 (Bill Yetman and Adam Davis)
Tableau Lunch and Learn in SLC on 6-10-2014 (Bill Yetman and Adam Davis)
 
NFC standards
NFC standardsNFC standards
NFC standards
 
14 context clues
14 context clues14 context clues
14 context clues
 
Create A Vision Board for Future
Create A Vision Board for FutureCreate A Vision Board for Future
Create A Vision Board for Future
 
UNO MARKETING PLAN (INTERNATIONAL)
UNO MARKETING PLAN (INTERNATIONAL)UNO MARKETING PLAN (INTERNATIONAL)
UNO MARKETING PLAN (INTERNATIONAL)
 
NAPAKAGANDANG NEGOSYO PARA SATIN MGA KAPWA KO FILIPINO
NAPAKAGANDANG NEGOSYO PARA SATIN MGA KAPWA KO FILIPINONAPAKAGANDANG NEGOSYO PARA SATIN MGA KAPWA KO FILIPINO
NAPAKAGANDANG NEGOSYO PARA SATIN MGA KAPWA KO FILIPINO
 
Big Data, Baby Steps
Big Data, Baby StepsBig Data, Baby Steps
Big Data, Baby Steps
 
開発者からみたTensor flow
開発者からみたTensor flow開発者からみたTensor flow
開発者からみたTensor flow
 
JD Edwards EnterpriseOne Process Manufacturing Presentation Quest Infocus 2016
JD Edwards EnterpriseOne Process Manufacturing Presentation Quest Infocus 2016JD Edwards EnterpriseOne Process Manufacturing Presentation Quest Infocus 2016
JD Edwards EnterpriseOne Process Manufacturing Presentation Quest Infocus 2016
 

Similar to Roots tech 2013 Big Data at Ancestry (3-22-2013) - no animations

Peter Elleby - Big Data, Big Noise, Big Hope - No Miracles
Peter Elleby - Big Data, Big Noise, Big Hope - No MiraclesPeter Elleby - Big Data, Big Noise, Big Hope - No Miracles
Peter Elleby - Big Data, Big Noise, Big Hope - No MiraclesWeAreEsynergy
 
Don't Be Scared. Data Don't Bite. Introduction to Big Data.
Don't Be Scared. Data Don't Bite. Introduction to Big Data.Don't Be Scared. Data Don't Bite. Introduction to Big Data.
Don't Be Scared. Data Don't Bite. Introduction to Big Data.KGMGROUP
 
An Efficient Approach for Clustering High Dimensional Data
An Efficient Approach for Clustering High Dimensional DataAn Efficient Approach for Clustering High Dimensional Data
An Efficient Approach for Clustering High Dimensional DataIJSTA
 
Seminar presentation
Seminar presentationSeminar presentation
Seminar presentationKlawal13
 
Research issues in the big data and its Challenges
Research issues in the big data and its ChallengesResearch issues in the big data and its Challenges
Research issues in the big data and its ChallengesKathirvel Ayyaswamy
 
Big data analytics, survey r.nabati
Big data analytics, survey r.nabatiBig data analytics, survey r.nabati
Big data analytics, survey r.nabatinabati
 
Big Data By Vijay Bhaskar Semwal
Big Data By Vijay Bhaskar SemwalBig Data By Vijay Bhaskar Semwal
Big Data By Vijay Bhaskar SemwalIIIT Allahabad
 
A Comprehensive Study of Clustering Algorithms for Big Data Mining with MapRe...
A Comprehensive Study of Clustering Algorithms for Big Data Mining with MapRe...A Comprehensive Study of Clustering Algorithms for Big Data Mining with MapRe...
A Comprehensive Study of Clustering Algorithms for Big Data Mining with MapRe...KamleshKumar394
 
Clouds, Grids and Data
Clouds, Grids and DataClouds, Grids and Data
Clouds, Grids and DataGuy Coates
 
Better science through superior software
Better science through superior softwareBetter science through superior software
Better science through superior softwareMichael R. Crusoe
 
BDI- The Beginning (Big data training in Coimbatore)
BDI- The Beginning (Big data training in Coimbatore)BDI- The Beginning (Big data training in Coimbatore)
BDI- The Beginning (Big data training in Coimbatore)Ashok Rangaswamy
 
Big Data Lessons from the Cloud
Big Data Lessons from the CloudBig Data Lessons from the Cloud
Big Data Lessons from the CloudMapR Technologies
 
Hadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFSHadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFSpraveen bhat
 
Lecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptLecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptalmaraniabwmalk
 

Similar to Roots tech 2013 Big Data at Ancestry (3-22-2013) - no animations (20)

Peter Elleby - Big Data, Big Noise, Big Hope - No Miracles
Peter Elleby - Big Data, Big Noise, Big Hope - No MiraclesPeter Elleby - Big Data, Big Noise, Big Hope - No Miracles
Peter Elleby - Big Data, Big Noise, Big Hope - No Miracles
 
Don't Be Scared. Data Don't Bite. Introduction to Big Data.
Don't Be Scared. Data Don't Bite. Introduction to Big Data.Don't Be Scared. Data Don't Bite. Introduction to Big Data.
Don't Be Scared. Data Don't Bite. Introduction to Big Data.
 
An Efficient Approach for Clustering High Dimensional Data
An Efficient Approach for Clustering High Dimensional DataAn Efficient Approach for Clustering High Dimensional Data
An Efficient Approach for Clustering High Dimensional Data
 
Seminar presentation
Seminar presentationSeminar presentation
Seminar presentation
 
Research issues in the big data and its Challenges
Research issues in the big data and its ChallengesResearch issues in the big data and its Challenges
Research issues in the big data and its Challenges
 
Big Data
Big Data Big Data
Big Data
 
Big data analytics, survey r.nabati
Big data analytics, survey r.nabatiBig data analytics, survey r.nabati
Big data analytics, survey r.nabati
 
Data mining with big data
Data mining with big dataData mining with big data
Data mining with big data
 
Big Data By Vijay Bhaskar Semwal
Big Data By Vijay Bhaskar SemwalBig Data By Vijay Bhaskar Semwal
Big Data By Vijay Bhaskar Semwal
 
Big Data
Big DataBig Data
Big Data
 
Big data business case
Big data   business caseBig data   business case
Big data business case
 
Big data analytics
Big data analyticsBig data analytics
Big data analytics
 
A Comprehensive Study of Clustering Algorithms for Big Data Mining with MapRe...
A Comprehensive Study of Clustering Algorithms for Big Data Mining with MapRe...A Comprehensive Study of Clustering Algorithms for Big Data Mining with MapRe...
A Comprehensive Study of Clustering Algorithms for Big Data Mining with MapRe...
 
Clouds, Grids and Data
Clouds, Grids and DataClouds, Grids and Data
Clouds, Grids and Data
 
Big Data & Data Mining
Big Data & Data MiningBig Data & Data Mining
Big Data & Data Mining
 
Better science through superior software
Better science through superior softwareBetter science through superior software
Better science through superior software
 
BDI- The Beginning (Big data training in Coimbatore)
BDI- The Beginning (Big data training in Coimbatore)BDI- The Beginning (Big data training in Coimbatore)
BDI- The Beginning (Big data training in Coimbatore)
 
Big Data Lessons from the Cloud
Big Data Lessons from the CloudBig Data Lessons from the Cloud
Big Data Lessons from the Cloud
 
Hadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFSHadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFS
 
Lecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptLecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.ppt
 

Recently uploaded

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 

Recently uploaded (20)

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 

Roots tech 2013 Big Data at Ancestry (3-22-2013) - no animations

  • 1. DNA Big Data at Ancestry Bill Yetman, Sr. Director Commerce, Data, Analytics March 22, 2013
  • 2. Agenda • Introduction • Understanding Big Data Scale • Big Data Story – How it works • What is Big Data? • Ancestry’s Big Data • Three Big Data Examples – DNA Matching – Machine Learning – Data Mining 2
  • 3. Understanding Big Data Scale Dollar Amounts Data Sizes $1 Byte $1,000 (Thousands) Kilobyte (KB: 10^3) $1,000,000 (Millions) Megabyte (MB: 10^6) $1,000,000,000 (Billions) Gigabyte (GB: 10^9) $1,000,000,000,000 (Trillions) Terabyte (TB: 10^12) $1,000,000,000,000,000 (Quadrillion) Petabyte (PB: 10^15) $1,000,000,000,000,000,000 (Quintillion) Exabyte (EB: 10^18) 3 • Before the bubble, $100 million used to be big, now $1 trillion is routine • Data has gone through a similar trend "640K ought to be enough for anybody.“ Bill Gates, 1981
  • 4. Big Data at Work Bill’s Facebook Story – Signed up in early 2007 – Must have been a visionary – Not really… 4 1 Friend for 2 ½ Years Daughter joined in 2010 … • Back in 2007, Facebook did not care! • How Big Data has changed FB …
  • 5. Big Data at Work (cont’d) What happens today… – Fill out your profile and that data is used to hook you into the “social graph” – FB wants you to make 10 friend requests within the first 5 minutes of signing up – Within 15 minutes you understand intrinsically how the service works – Social graph (a huge dataset) is used to engage the individual user 5
  • 6. What is Big Data?  Volume (amount of data)  Exceeds the processing capacity of conventional database systems  Velocity (speed of data in/out)  Needs high performance data pipeline and massively parallel processing  Variety (range of data types and sources)  Fluid data requirements where up-front schema design is a poor fit 6 Big Data Characteristics Volume Variety Velocity
  • 7. Big Data: Why now? • The simplified version of Moore’s Law states that processor speeds, or overall processing power for computers will double every two years. – Intel co-founder Gordon Moore • Over the last 30 years, space per unit cost has doubled roughly every 14 months (increasing by an order of magnitude every 48 months).* – In early 2010, Terabyte drives were introduced at a cost of $0.10 per TB or $0.01 per GB 7 * http://www.mkomo.com/cost-per-gigabyte *
  • 8. Big Data Technologies 8 * http://en.wikipedia.org/wiki/Apache_Hadoop 4 Hadoop definition points are from a presentation by Jeremy Pollack, Ancestry.com 3/2/2013 What is Hadoop? 1. Hadoop* is an open-source platform for processing large amounts of data in a scalable, fault-tolerant, affordable fashion 2. Hadoop specifies a distributed file system called HDFS 3. Hadoop supports a processing methodology known as MapReduce 4. Many tools are built on top of Hadoop, such as HBase, Hive, and Flume
  • 9. What is Ancestry’s Big Data? • Family Trees: 44 million+ • Profiles on the Family Trees: 4 billion+ • Records Attached to Family Trees: 2 billion+ • Photographs, Scanned Documents and Written Stories: 185 million+ • Total # of Records: 11 billion+ • Total # of Titles: 30,000 • Total # of DNA Samples: 100,000+ • 4 Petabytes of structured and unstructured data 9
  • 10. DNA Matching • Framing the Problem – 46 chromosomes, 23 pairs, 3 billion+ base pairs, AGTC – 99.9% of human DNA is shared – Academic programs work at small scale (GermLine) – New samples are matched against all previous samples 10 • Use Big Data technologies to create a scalable matching platform – Hadoop and MapReduce – HBase http://www1.cs.columbia.edu/~gusev/germline/
  • 11. DNA Matching • Raw Data 3 123456789_RZZZZ2_XXXXXXH3Q7U7Q2B_YYYY84598-DNA 0 0 0 -9 C C G G G G G G A A A A C C G G A A A A C C G G G G A A G G G A A A G G A G A A C C A A A A G G A A A G G G G G C C G G A A G G G G G G G A A A A C G A A A A G A G A A A A G G G G G G A G G G G G G G … (continues for 700,000+ snips) • Map File 0 rs10005853 0 0 0 rs10015934 0 0 0 rs1004236 0 0 0 rs10059646 0 0 0 rs10085382 0 0 0 rs10123921 0 0 0 rs10127827 0 0 0 rs10155688 0 0 0 rs10162780 0 0 0 rs1017484 0 0 0 rs10188129 0 0 11
  • 12. DNA Matching • Walk through an example of how the Open Source algorithm Germline works – http://www1.cs.columbia.edu/~gusev/germline/ • Highly simplified – Simple, small example – Uses names from Battlestar Galactica • Two key steps when comparing two samples – Initial “word match” – Second “fuzzy logic” step • Worst case – Identical twins or two samples from the same user 12 Kara Thrace Admiral Adama
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21. DNA Matching with Big Data • Map phase “adds words” to the HBase table for each new sample and saves data to a “fuzzy match” table • Reduce phase uses those tables to create segment matches 21 • Matched Segments are analyzed statistically to determine relationship distance (M0 to M11)
  • 22. DNA Matching – Big Data Results 22 Introduced Big Data Hadoop Matching Process Projected Process vs. Big Data Process
  • 23. Machine Learning – Automated Content Pipeline • OCR to extract the text from an image • Use machine learning to categorize the words and phrases to extract names, dates, place, relationships, and other information • Natural language processing using supervised machine learning – Use a set of data that is annotated for learning – Validation set to test – Run against the extracted text from a collection • Result is a fully automated content delivery process 23
  • 24. Data Mining: Tree Sizes 24 x-axis: tree size, y-axis: number of trees of a specific size More small trees, than large trees, but we also have extremely large trees in the system > 500,000
  • 25. Data Mining: What do trees look like? 25
  • 26. Data Mining: What do trees look like? 26
  • 27. Final Thoughts/Questions • Just three examples of Big Data processing and technologies in use at Ancestry – Hinting – Search – Trees – More… • Big Data technologies to watch – Hadoop and HDFS move from batch to real-time processing – Impala and Apache Drill • Questions? 27