SlideShare ist ein Scribd-Unternehmen logo
1 von 37
Awesome Big Data Algorithms




                 http://xkcd.com/1185/
Awesome Big Data
   Algorithms
                C. Titus Brown
                ctb@msu.edu
   Asst Professor, Michigan State University
(Microbiology, Computer Science, and BEACON)
Welcome!
• More of a computational scientist than a
  computer scientist; will be using simulations
  to demo & explore algorithm behavior.


• Send me questions/comments @ctitusbrown,
  or ctb@msu.edu.
“Features”
• I will be using Python rather than C++, because Python
  is easier to read.


• I will be using IPython Notebook to demo.


• I apologize in advance for not covering your favorite
  data structure or algorithm.
Outline
• The basic idea
• Three examples
   – Skip lists (a fast key/value store)
   – HyperLogLog Counting (counting discrete elements)
   – Bloom filters and CountMin Sketches

• Folding, spindling, and mutilating DNA sequence
• References and further reading
The basic idea
• Problem: you have a lot of data to count, track, or otherwise
   analyze.
• This data is Data of Unusual Size, i.e. you can’t just brute force the
   analysis.
• For example,
    – Count the approximate number of distinct elements in a very large
      (infinite?) data set
    – Optimize queries by using an efficient but approximate prefilter
    – Determine the frequency distribution of distinct elements in a very
      large data set.
Online and streaming vs. offline
          “Large is hard; infinite is much easier.”
• Offline algorithms analyze an entire data set all at
  once.
• Online algorithms analyze data serially, one piece at a
  time.
• Streaming algorithms are online algorithms that can
  be used for very memory & compute limited analysis.
Exact vs random or probabilistic
• Often an approximate answer is sufficient,
  esp if you can place bounds on how wrong the
  approximation is likely to be.
• Often random algorithms or probabilistic data
  structures can be found with good typical
  behavior but bad worst case behavior.
For one (stupid) example
You can trim 8 bits off of integers for the purpose of averaging them
Skip lists
  A randomly indexed improvement on linked lists.

Each node can belong to one or more vertical “levels”,
which allow fast search/insertion/deletion – ~O(log(n))
                       typically!




                                                          wikipedia
Skip lists
        A randomly indexed improvement on linked lists.

     Very easy to implement; asymptotically good behavior.




From reddit, “if someone held a gun to my head and asked
me to implement an efficient set/map storage, I would
implement a skip list.”

(Response: “does this happen to you a lot??”)                wikipedia
Channel randomness!

• If you can construct or rely on randomness,
  then you can easily get good typical behavior.



• Note, a good hash function is essentially the
  same as a good random number generator…
HyperLogLog cardinality counting

• Suppose you have an incoming stream of
 many, many “objects”.

• And you want to track how many distinct
 items there are, and you want to accumulate
 the count of distinct objects over time.
Relevant digression:
• Flip some unknown number of coins. Q: what is
  something simple to track that will tell you
  roughly how many coins you’ve flipped?


• A: longest run of HEADs. Long runs are very rare
  and are correlated with how many coins you’ve
  flipped.
Cardinality counting with HyperLogLog

• Essentially, use longest run of 0-bits observed
  in a hash value.
• Use multiple hash functions so that you can
  take the average.
• Take harmonic mean + low/high sampling
  adjustment => result.
Bloom filters

• A set membership data structure that is
  probabilistic but only yields false positives.

• Trivial to implement; hash function is main
  cost; extremely memory efficient.
My research applications
    Biology is fast becoming a data-driven science.




                      http://www.genome.gov/sequencingcosts/
Shotgun sequencing analogy:
  feeding books into a paper shredder,
digitizing the shreds, and reconstructing
                 the book.




   Although for books, we often know the language and not just the alphabet 
Shotgun sequencing is --


• Randomly ordered.

• Randomly sampled.

• Too big to efficiently do multiple passes
Shotgun sequencing




   “Coverage” is simply the average number of reads that overlap
                    each true base in genome.

Here, the coverage is ~10 – just draw a line straight down from the top
                       through all of the reads.
Random sampling => deep sampling needed




   Typically 10-100x needed for robust recovery (300 Gbp for human)
Random sampling => deep sampling needed




   Typically 10-100x needed for robust recovery (300 Gbp for human)

     But this data is massively redundant!! Only need 5x systematic!
              All the stuff above the red line is unnecessary!
Streaming algorithm to do so:
    digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Storing data this way is better than best-
 possible information-theoretic storage.




                                 Pell et al., PNAS 2012
Use Bloom filter to store graphs
    Graphs only gain nodes because of Bloom filter false positives.




                                                         Pell et al., PNAS 2012
Some assembly details
• This was completely intractable.
• Implemented in C++ and Python; “good practice” (?)
• We’ve changed scaling behavior from data to information.
• Practical scaling for ~soil metagenomics is 10x:
    – need < 1 TB of RAM for ~2 TB of data, ~2 weeks.
    – Before, ~10TB.
• Smaller problems are pretty much solved.
• Just beginning to explore threading, multicore, etc. (BIG DATA grant
   proposal)
• Goal is to scale to 50 Tbp of data (~5-50 TB RAM currently)
Concluding thoughts
• Channel randomness.
• Embrace streaming.
• Live with minor uncertainty.
• Don’t be afraid to discard data.


  (Also, I’m an open source hacker who can confer PhDs, in
    exchange for long years of low pay living in Michigan.
E-mail me! And don’t talk to Brett Cannon about PhDs first.)
References
SkipLists: Wikipedia, and John Shipman’s code:
http://infohost.nmt.edu/tcc/help/lang/python/examples/pyskip/pyskip.pdf

HyperLogLog: Aggregate Knowledge’s blog,
http://blog.aggregateknowledge.com/2012/10/25/sketch-of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure/

And: https://github.com/svpcom/hyperloglog

Bloom Filters: Wikipedia


 Our work: http://ivory.idyll.org/blog/ and http://ged.msu.edu/interests.html
                                                                  ctb@msu.edu

Weitere ähnliche Inhalte

Andere mochten auch

13th Annual Seminar on Professional Responsibility
13th Annual Seminar on Professional Responsibility13th Annual Seminar on Professional Responsibility
13th Annual Seminar on Professional ResponsibilityKegler Brown Hill + Ritter
 
Theoretical framework d1 2016 11-18
Theoretical framework d1 2016 11-18Theoretical framework d1 2016 11-18
Theoretical framework d1 2016 11-18Zafar Ahmad
 
Undangan (Kak Melly n Kak Dicky)
Undangan (Kak Melly n Kak Dicky)Undangan (Kak Melly n Kak Dicky)
Undangan (Kak Melly n Kak Dicky)@rtNya
 
19th annual professional responsibility seminar
19th annual professional responsibility seminar19th annual professional responsibility seminar
19th annual professional responsibility seminarKegler Brown Hill + Ritter
 
2013 stamps-intro-assembly
2013 stamps-intro-assembly2013 stamps-intro-assembly
2013 stamps-intro-assemblyc.titus.brown
 
Children Consultation Report
Children Consultation ReportChildren Consultation Report
Children Consultation ReportZafar Ahmad
 
Rational App Scan&Policy Tester
Rational App Scan&Policy TesterRational App Scan&Policy Tester
Rational App Scan&Policy TesterKristina O'Regan
 
R E V I V A L C O L L E G E S A Presentation
R E V I V A L   C O L L E G E  S A PresentationR E V I V A L   C O L L E G E  S A Presentation
R E V I V A L C O L L E G E S A PresentationIvin
 
2014 whitney-research
2014 whitney-research2014 whitney-research
2014 whitney-researchc.titus.brown
 
Nor Cal Pacific Dimensions Presentation
Nor Cal Pacific Dimensions PresentationNor Cal Pacific Dimensions Presentation
Nor Cal Pacific Dimensions Presentationlmeneley
 
Presentatie De Salesmanagers
Presentatie De SalesmanagersPresentatie De Salesmanagers
Presentatie De Salesmanagersrwarntjes
 
Preserve Plan 4
Preserve Plan 4Preserve Plan 4
Preserve Plan 4lmeneley
 

Andere mochten auch (20)

What Is Eric
What Is EricWhat Is Eric
What Is Eric
 
13th Annual Seminar on Professional Responsibility
13th Annual Seminar on Professional Responsibility13th Annual Seminar on Professional Responsibility
13th Annual Seminar on Professional Responsibility
 
Br10 tekniske installationer
Br10 tekniske installationerBr10 tekniske installationer
Br10 tekniske installationer
 
Theoretical framework d1 2016 11-18
Theoretical framework d1 2016 11-18Theoretical framework d1 2016 11-18
Theoretical framework d1 2016 11-18
 
2014 davis-talk
2014 davis-talk2014 davis-talk
2014 davis-talk
 
Underage Drinking Parties in San Antonio 2016
Underage Drinking Parties in San Antonio 2016Underage Drinking Parties in San Antonio 2016
Underage Drinking Parties in San Antonio 2016
 
IT & Business Centre
IT & Business CentreIT & Business Centre
IT & Business Centre
 
Undangan (Kak Melly n Kak Dicky)
Undangan (Kak Melly n Kak Dicky)Undangan (Kak Melly n Kak Dicky)
Undangan (Kak Melly n Kak Dicky)
 
19th annual professional responsibility seminar
19th annual professional responsibility seminar19th annual professional responsibility seminar
19th annual professional responsibility seminar
 
2013 stamps-intro-assembly
2013 stamps-intro-assembly2013 stamps-intro-assembly
2013 stamps-intro-assembly
 
Children Consultation Report
Children Consultation ReportChildren Consultation Report
Children Consultation Report
 
Ramifications of New USEPA
Ramifications of New USEPARamifications of New USEPA
Ramifications of New USEPA
 
Rational App Scan&Policy Tester
Rational App Scan&Policy TesterRational App Scan&Policy Tester
Rational App Scan&Policy Tester
 
R E V I V A L C O L L E G E S A Presentation
R E V I V A L   C O L L E G E  S A PresentationR E V I V A L   C O L L E G E  S A Presentation
R E V I V A L C O L L E G E S A Presentation
 
2014 pycon-talk
2014 pycon-talk2014 pycon-talk
2014 pycon-talk
 
2014 whitney-research
2014 whitney-research2014 whitney-research
2014 whitney-research
 
Nor Cal Pacific Dimensions Presentation
Nor Cal Pacific Dimensions PresentationNor Cal Pacific Dimensions Presentation
Nor Cal Pacific Dimensions Presentation
 
Presentatie De Salesmanagers
Presentatie De SalesmanagersPresentatie De Salesmanagers
Presentatie De Salesmanagers
 
Preserve Plan 4
Preserve Plan 4Preserve Plan 4
Preserve Plan 4
 
SocialMediaMagic slides
SocialMediaMagic slidesSocialMediaMagic slides
SocialMediaMagic slides
 

Ähnlich wie Awesome Big Data Algorithms Insights

2013 siam-cse-big-data
2013 siam-cse-big-data2013 siam-cse-big-data
2013 siam-cse-big-datac.titus.brown
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibilityc.titus.brown
 
2014 manchester-reproducibility
2014 manchester-reproducibility2014 manchester-reproducibility
2014 manchester-reproducibilityc.titus.brown
 
2013 talk at TGAC, November 4
2013 talk at TGAC, November 42013 talk at TGAC, November 4
2013 talk at TGAC, November 4c.titus.brown
 
2013 bms-retreat-talk
2013 bms-retreat-talk2013 bms-retreat-talk
2013 bms-retreat-talkc.titus.brown
 
Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012c.titus.brown
 
CT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudJan Aerts
 
Probabilistic breakdown of assembly graphs
Probabilistic breakdown of assembly graphsProbabilistic breakdown of assembly graphs
Probabilistic breakdown of assembly graphsc.titus.brown
 
2013 hmp-assembly-webinar
2013 hmp-assembly-webinar2013 hmp-assembly-webinar
2013 hmp-assembly-webinarc.titus.brown
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and HadoopDonald Miner
 
HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017philippbayer
 
Reproducible research - to infinity
Reproducible research - to infinityReproducible research - to infinity
Reproducible research - to infinityPeterMorrell4
 
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...huguk
 
Python as the Zen of Data Science
Python as the Zen of Data SciencePython as the Zen of Data Science
Python as the Zen of Data ScienceTravis Oliphant
 
How to interactively visualise and explore a billion objects (wit vaex)
How to interactively visualise and explore a billion objects (wit vaex)How to interactively visualise and explore a billion objects (wit vaex)
How to interactively visualise and explore a billion objects (wit vaex)Ali-ziane Myriam
 
Scaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAMScaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAMfnothaft
 

Ähnlich wie Awesome Big Data Algorithms Insights (20)

2013 siam-cse-big-data
2013 siam-cse-big-data2013 siam-cse-big-data
2013 siam-cse-big-data
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
2014 manchester-reproducibility
2014 manchester-reproducibility2014 manchester-reproducibility
2014 manchester-reproducibility
 
2013 talk at TGAC, November 4
2013 talk at TGAC, November 42013 talk at TGAC, November 4
2013 talk at TGAC, November 4
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 
2013 bms-retreat-talk
2013 bms-retreat-talk2013 bms-retreat-talk
2013 bms-retreat-talk
 
Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012
 
CT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloud
 
Probabilistic breakdown of assembly graphs
Probabilistic breakdown of assembly graphsProbabilistic breakdown of assembly graphs
Probabilistic breakdown of assembly graphs
 
2013 hmp-assembly-webinar
2013 hmp-assembly-webinar2013 hmp-assembly-webinar
2013 hmp-assembly-webinar
 
2014 nci-edrn
2014 nci-edrn2014 nci-edrn
2014 nci-edrn
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and Hadoop
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
 
HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017
 
Reproducible research - to infinity
Reproducible research - to infinityReproducible research - to infinity
Reproducible research - to infinity
 
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
 
Python as the Zen of Data Science
Python as the Zen of Data SciencePython as the Zen of Data Science
Python as the Zen of Data Science
 
How to interactively visualise and explore a billion objects (wit vaex)
How to interactively visualise and explore a billion objects (wit vaex)How to interactively visualise and explore a billion objects (wit vaex)
How to interactively visualise and explore a billion objects (wit vaex)
 
Vaex talk-pydata-paris
Vaex talk-pydata-parisVaex talk-pydata-paris
Vaex talk-pydata-paris
 
Scaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAMScaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAM
 

Mehr von c.titus.brown

Mehr von c.titus.brown (20)

2016 bergen-sars
2016 bergen-sars2016 bergen-sars
2016 bergen-sars
 
2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
 
2016 davis-biotech
2016 davis-biotech2016 davis-biotech
2016 davis-biotech
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial
 
2015 aem-grs-keynote
2015 aem-grs-keynote2015 aem-grs-keynote
2015 aem-grs-keynote
 
2015 msu-code-review
2015 msu-code-review2015 msu-code-review
2015 msu-code-review
 
2015 mcgill-talk
2015 mcgill-talk2015 mcgill-talk
2015 mcgill-talk
 
2015 pycon-talk
2015 pycon-talk2015 pycon-talk
2015 pycon-talk
 
2015 opencon-webcast
2015 opencon-webcast2015 opencon-webcast
2015 opencon-webcast
 
2015 vancouver-vanbug
2015 vancouver-vanbug2015 vancouver-vanbug
2015 vancouver-vanbug
 
2015 osu-metagenome
2015 osu-metagenome2015 osu-metagenome
2015 osu-metagenome
 
2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenome
 
2015 balti-and-bioinformatics
2015 balti-and-bioinformatics2015 balti-and-bioinformatics
2015 balti-and-bioinformatics
 
2015 pag-chicken
2015 pag-chicken2015 pag-chicken
2015 pag-chicken
 
2015 pag-metagenome
2015 pag-metagenome2015 pag-metagenome
2015 pag-metagenome
 
2014 nyu-bio-talk
2014 nyu-bio-talk2014 nyu-bio-talk
2014 nyu-bio-talk
 
2014 bangkok-talk
2014 bangkok-talk2014 bangkok-talk
2014 bangkok-talk
 
2014 anu-canberra-streaming
2014 anu-canberra-streaming2014 anu-canberra-streaming
2014 anu-canberra-streaming
 
2014 abic-talk
2014 abic-talk2014 abic-talk
2014 abic-talk
 

Kürzlich hochgeladen

Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 

Kürzlich hochgeladen (20)

Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 

Awesome Big Data Algorithms Insights

  • 1. Awesome Big Data Algorithms http://xkcd.com/1185/
  • 2. Awesome Big Data Algorithms C. Titus Brown ctb@msu.edu Asst Professor, Michigan State University (Microbiology, Computer Science, and BEACON)
  • 3. Welcome! • More of a computational scientist than a computer scientist; will be using simulations to demo & explore algorithm behavior. • Send me questions/comments @ctitusbrown, or ctb@msu.edu.
  • 4. “Features” • I will be using Python rather than C++, because Python is easier to read. • I will be using IPython Notebook to demo. • I apologize in advance for not covering your favorite data structure or algorithm.
  • 5. Outline • The basic idea • Three examples – Skip lists (a fast key/value store) – HyperLogLog Counting (counting discrete elements) – Bloom filters and CountMin Sketches • Folding, spindling, and mutilating DNA sequence • References and further reading
  • 6. The basic idea • Problem: you have a lot of data to count, track, or otherwise analyze. • This data is Data of Unusual Size, i.e. you can’t just brute force the analysis. • For example, – Count the approximate number of distinct elements in a very large (infinite?) data set – Optimize queries by using an efficient but approximate prefilter – Determine the frequency distribution of distinct elements in a very large data set.
  • 7. Online and streaming vs. offline “Large is hard; infinite is much easier.” • Offline algorithms analyze an entire data set all at once. • Online algorithms analyze data serially, one piece at a time. • Streaming algorithms are online algorithms that can be used for very memory & compute limited analysis.
  • 8. Exact vs random or probabilistic • Often an approximate answer is sufficient, esp if you can place bounds on how wrong the approximation is likely to be. • Often random algorithms or probabilistic data structures can be found with good typical behavior but bad worst case behavior.
  • 9. For one (stupid) example You can trim 8 bits off of integers for the purpose of averaging them
  • 10. Skip lists A randomly indexed improvement on linked lists. Each node can belong to one or more vertical “levels”, which allow fast search/insertion/deletion – ~O(log(n)) typically! wikipedia
  • 11.
  • 12. Skip lists A randomly indexed improvement on linked lists. Very easy to implement; asymptotically good behavior. From reddit, “if someone held a gun to my head and asked me to implement an efficient set/map storage, I would implement a skip list.” (Response: “does this happen to you a lot??”) wikipedia
  • 13. Channel randomness! • If you can construct or rely on randomness, then you can easily get good typical behavior. • Note, a good hash function is essentially the same as a good random number generator…
  • 14. HyperLogLog cardinality counting • Suppose you have an incoming stream of many, many “objects”. • And you want to track how many distinct items there are, and you want to accumulate the count of distinct objects over time.
  • 15. Relevant digression: • Flip some unknown number of coins. Q: what is something simple to track that will tell you roughly how many coins you’ve flipped? • A: longest run of HEADs. Long runs are very rare and are correlated with how many coins you’ve flipped.
  • 16.
  • 17. Cardinality counting with HyperLogLog • Essentially, use longest run of 0-bits observed in a hash value. • Use multiple hash functions so that you can take the average. • Take harmonic mean + low/high sampling adjustment => result.
  • 18.
  • 19. Bloom filters • A set membership data structure that is probabilistic but only yields false positives. • Trivial to implement; hash function is main cost; extremely memory efficient.
  • 20.
  • 21. My research applications Biology is fast becoming a data-driven science. http://www.genome.gov/sequencingcosts/
  • 22. Shotgun sequencing analogy: feeding books into a paper shredder, digitizing the shreds, and reconstructing the book. Although for books, we often know the language and not just the alphabet 
  • 23. Shotgun sequencing is -- • Randomly ordered. • Randomly sampled. • Too big to efficiently do multiple passes
  • 24. Shotgun sequencing “Coverage” is simply the average number of reads that overlap each true base in genome. Here, the coverage is ~10 – just draw a line straight down from the top through all of the reads.
  • 25. Random sampling => deep sampling needed Typically 10-100x needed for robust recovery (300 Gbp for human)
  • 26. Random sampling => deep sampling needed Typically 10-100x needed for robust recovery (300 Gbp for human) But this data is massively redundant!! Only need 5x systematic! All the stuff above the red line is unnecessary!
  • 27. Streaming algorithm to do so: digital normalization
  • 33. Storing data this way is better than best- possible information-theoretic storage. Pell et al., PNAS 2012
  • 34. Use Bloom filter to store graphs Graphs only gain nodes because of Bloom filter false positives. Pell et al., PNAS 2012
  • 35. Some assembly details • This was completely intractable. • Implemented in C++ and Python; “good practice” (?) • We’ve changed scaling behavior from data to information. • Practical scaling for ~soil metagenomics is 10x: – need < 1 TB of RAM for ~2 TB of data, ~2 weeks. – Before, ~10TB. • Smaller problems are pretty much solved. • Just beginning to explore threading, multicore, etc. (BIG DATA grant proposal) • Goal is to scale to 50 Tbp of data (~5-50 TB RAM currently)
  • 36. Concluding thoughts • Channel randomness. • Embrace streaming. • Live with minor uncertainty. • Don’t be afraid to discard data. (Also, I’m an open source hacker who can confer PhDs, in exchange for long years of low pay living in Michigan. E-mail me! And don’t talk to Brett Cannon about PhDs first.)
  • 37. References SkipLists: Wikipedia, and John Shipman’s code: http://infohost.nmt.edu/tcc/help/lang/python/examples/pyskip/pyskip.pdf HyperLogLog: Aggregate Knowledge’s blog, http://blog.aggregateknowledge.com/2012/10/25/sketch-of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure/ And: https://github.com/svpcom/hyperloglog Bloom Filters: Wikipedia Our work: http://ivory.idyll.org/blog/ and http://ged.msu.edu/interests.html ctb@msu.edu

Hinweis der Redaktion

  1. Obviously not all algorithms can be made into online algorithms, much less streaming.