SlideShare a Scribd company logo
1 of 34
Next-Gen Sequencing: Data Management Guy Coates Wellcome Trust Sanger Institute [email_address]
About the Institute ,[object Object]
~700 employees. ,[object Object],[object Object]
We have active cancer, malaria, pathogen and genomic variation studies. ,[object Object],[object Object]
Previously...at BioIT Europe:
The Scary Graph Instrument upgrades Peak Yearly capillary sequencing
The Scary Graph
Managing Growth ,[object Object]
Sequencing cost: T d =12 months
Classic Sanger “Stealth project” ,[object Object]
Classic Sanger “Stealth project” ,[object Object]
What we learned... ,[object Object]
Nobody stops to tidy up until they have no more disk space. ,[object Object],[object Object]
BAM only. ,[object Object],[object Object]
Historically sequencing and IT were budgeted separately.
Makes Pis aware of the IT costs, even if it does not cover 100%.
Flexible Infrastructure ,[object Object]
Assume from day 1 we will be adding more.
Expand simply by adding more blocks. ,[object Object],[object Object],[object Object],[object Object]
Currently using LSF to manage workflow.  LSF Fast scratch disk Archival / Warehouse disk Network
Our Modules: ,[object Object]
Simple might not be so robust, but it is much simpler and faster to fix if it breaks. More reliable in practice. ,[object Object],[object Object],[object Object],[object Object]
50-100TB chunks. ,[object Object],[object Object],[object Object],[object Object]
Data management ,[object Object],#df -h Filesystem  Size  Used Avail Use% Mounted on lus02-mds1:/lus02  108T  107T  1T  99% /lustre/scratch102 #df -i  Filesystem  Inodes  IUsed  IFree IUse% Mounted on lus02-mds1:/lus02  300296107 136508072 163788035 45% /lustre/scratch102
Sequencing data flow. Automated processing and data management Sequencer Analysis/ alignment Internal  repository EGA / SRA (EBI) compute-farm High-performance storage Manual data movement
Unmanaged data ,[object Object]
Are we keeping control of our “private” datasets?
Managing unstructured data ,[object Object]
Works well for the pipelines where it is currently used. ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
50% reduction in disk utilisation. ,[object Object],[object Object]
Bottlenecks: ,[object Object]
As data sizes increase,  even “smal datal” groups get hit. ,[object Object],[object Object],[object Object],[object Object]
Groups need to exchange data.
Small groups do not have the manpower to hack something together. ,[object Object]

More Related Content

What's hot

Challenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsChallenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsYasin Memari
 
Spark Summit EU talk by Erwin Datema and Roeland van Ham
Spark Summit EU talk by Erwin Datema and Roeland van HamSpark Summit EU talk by Erwin Datema and Roeland van Ham
Spark Summit EU talk by Erwin Datema and Roeland van HamSpark Summit
 
A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...
A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...
A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...Amazon Web Services
 
VariantSpark a library for genomics by Lynn Langit
VariantSpark a library for genomics by Lynn LangitVariantSpark a library for genomics by Lynn Langit
VariantSpark a library for genomics by Lynn LangitData Con LA
 
2015 aem-grs-keynote
2015 aem-grs-keynote2015 aem-grs-keynote
2015 aem-grs-keynotec.titus.brown
 
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...Bonnie Hurwitz
 
Cluster Filesystems and the next 1000 human genomes
Cluster Filesystems and the next 1000 human genomesCluster Filesystems and the next 1000 human genomes
Cluster Filesystems and the next 1000 human genomesGuy Coates
 
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...Spark Summit
 
Hadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreHadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreUri Laserson
 
Empowering Transformational Science
Empowering Transformational ScienceEmpowering Transformational Science
Empowering Transformational ScienceChelle Gentemann
 
PUC Masterclass Big Data
PUC Masterclass Big DataPUC Masterclass Big Data
PUC Masterclass Big DataArjen de Vries
 
VariantSpark: applying Spark-based machine learning methods to genomic inform...
VariantSpark: applying Spark-based machine learning methods to genomic inform...VariantSpark: applying Spark-based machine learning methods to genomic inform...
VariantSpark: applying Spark-based machine learning methods to genomic inform...Denis C. Bauer
 
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Andy Petrella
 
Whitepaper : CHI: Hadoop's Rise in Life Sciences
Whitepaper : CHI: Hadoop's Rise in Life Sciences Whitepaper : CHI: Hadoop's Rise in Life Sciences
Whitepaper : CHI: Hadoop's Rise in Life Sciences EMC
 
The Gordon Data-intensive Supercomputer. Enabling Scientific Discovery
The Gordon Data-intensive Supercomputer. Enabling Scientific DiscoveryThe Gordon Data-intensive Supercomputer. Enabling Scientific Discovery
The Gordon Data-intensive Supercomputer. Enabling Scientific DiscoveryIntel IT Center
 

What's hot (20)

Challenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsChallenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data Genomics
 
Spark Summit EU talk by Erwin Datema and Roeland van Ham
Spark Summit EU talk by Erwin Datema and Roeland van HamSpark Summit EU talk by Erwin Datema and Roeland van Ham
Spark Summit EU talk by Erwin Datema and Roeland van Ham
 
A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...
A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...
A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...
 
VariantSpark a library for genomics by Lynn Langit
VariantSpark a library for genomics by Lynn LangitVariantSpark a library for genomics by Lynn Langit
VariantSpark a library for genomics by Lynn Langit
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
 
2016 bergen-sars
2016 bergen-sars2016 bergen-sars
2016 bergen-sars
 
2015 aem-grs-keynote
2015 aem-grs-keynote2015 aem-grs-keynote
2015 aem-grs-keynote
 
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
 
2016 davis-biotech
2016 davis-biotech2016 davis-biotech
2016 davis-biotech
 
Cluster Filesystems and the next 1000 human genomes
Cluster Filesystems and the next 1000 human genomesCluster Filesystems and the next 1000 human genomes
Cluster Filesystems and the next 1000 human genomes
 
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
 
Genome Big Data
Genome Big DataGenome Big Data
Genome Big Data
 
2014 bangkok-talk
2014 bangkok-talk2014 bangkok-talk
2014 bangkok-talk
 
Hadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreHadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant Store
 
Empowering Transformational Science
Empowering Transformational ScienceEmpowering Transformational Science
Empowering Transformational Science
 
PUC Masterclass Big Data
PUC Masterclass Big DataPUC Masterclass Big Data
PUC Masterclass Big Data
 
VariantSpark: applying Spark-based machine learning methods to genomic inform...
VariantSpark: applying Spark-based machine learning methods to genomic inform...VariantSpark: applying Spark-based machine learning methods to genomic inform...
VariantSpark: applying Spark-based machine learning methods to genomic inform...
 
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
 
Whitepaper : CHI: Hadoop's Rise in Life Sciences
Whitepaper : CHI: Hadoop's Rise in Life Sciences Whitepaper : CHI: Hadoop's Rise in Life Sciences
Whitepaper : CHI: Hadoop's Rise in Life Sciences
 
The Gordon Data-intensive Supercomputer. Enabling Scientific Discovery
The Gordon Data-intensive Supercomputer. Enabling Scientific DiscoveryThe Gordon Data-intensive Supercomputer. Enabling Scientific Discovery
The Gordon Data-intensive Supercomputer. Enabling Scientific Discovery
 

Viewers also liked

NGS: bioinformatic challenges
NGS: bioinformatic challengesNGS: bioinformatic challenges
NGS: bioinformatic challengesLex Nederbragt
 
통계유전학워크샵
통계유전학워크샵통계유전학워크샵
통계유전학워크샵Hong ChangBum
 
Genomics in the Cloud
Genomics in the CloudGenomics in the Cloud
Genomics in the CloudMatt Wood
 
Next-generation sequencing - variation discovery
Next-generation sequencing - variation discoveryNext-generation sequencing - variation discovery
Next-generation sequencing - variation discoveryJan Aerts
 
Data Management for Quantitative Biology - Data sources (Next generation tech...
Data Management for Quantitative Biology - Data sources (Next generation tech...Data Management for Quantitative Biology - Data sources (Next generation tech...
Data Management for Quantitative Biology - Data sources (Next generation tech...QBiC_Tue
 
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...Manikhandan Mudaliar
 
Multiple mouse reference genomes and strain specific gene annotations
Multiple mouse reference genomes and strain specific gene annotationsMultiple mouse reference genomes and strain specific gene annotations
Multiple mouse reference genomes and strain specific gene annotationsThomas Keane
 
Quality Control of NGS Data
Quality Control of NGS Data Quality Control of NGS Data
Quality Control of NGS Data Surya Saha
 
NGS Data Preprocessing
NGS Data PreprocessingNGS Data Preprocessing
NGS Data PreprocessingcursoNGS
 
Workshop NGS data analysis - 2
Workshop NGS data analysis - 2Workshop NGS data analysis - 2
Workshop NGS data analysis - 2Maté Ongenaert
 
Wellcome Trust Advances Course: NGS Course - Lecture1
Wellcome Trust Advances Course: NGS Course - Lecture1Wellcome Trust Advances Course: NGS Course - Lecture1
Wellcome Trust Advances Course: NGS Course - Lecture1Thomas Keane
 
File formats for Next Generation Sequencing
File formats for Next Generation SequencingFile formats for Next Generation Sequencing
File formats for Next Generation SequencingPierre Lindenbaum
 
Unit 9 - DNA, RNA, and Proteins Notes
Unit 9  - DNA, RNA, and Proteins NotesUnit 9  - DNA, RNA, and Proteins Notes
Unit 9 - DNA, RNA, and Proteins Notesasteinman
 
보건산업 진흥 전략(배성윤, 2013.12.11)
보건산업 진흥 전략(배성윤, 2013.12.11)보건산업 진흥 전략(배성윤, 2013.12.11)
보건산업 진흥 전략(배성윤, 2013.12.11)Sung Yoon Bae
 

Viewers also liked (20)

NGS: bioinformatic challenges
NGS: bioinformatic challengesNGS: bioinformatic challenges
NGS: bioinformatic challenges
 
Ngs ppt
Ngs pptNgs ppt
Ngs ppt
 
Introduction to next generation sequencing
Introduction to next generation sequencingIntroduction to next generation sequencing
Introduction to next generation sequencing
 
통계유전학워크샵
통계유전학워크샵통계유전학워크샵
통계유전학워크샵
 
Genomics in the Cloud
Genomics in the CloudGenomics in the Cloud
Genomics in the Cloud
 
Next-generation sequencing - variation discovery
Next-generation sequencing - variation discoveryNext-generation sequencing - variation discovery
Next-generation sequencing - variation discovery
 
Data Management for Quantitative Biology - Data sources (Next generation tech...
Data Management for Quantitative Biology - Data sources (Next generation tech...Data Management for Quantitative Biology - Data sources (Next generation tech...
Data Management for Quantitative Biology - Data sources (Next generation tech...
 
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
 
Multiple mouse reference genomes and strain specific gene annotations
Multiple mouse reference genomes and strain specific gene annotationsMultiple mouse reference genomes and strain specific gene annotations
Multiple mouse reference genomes and strain specific gene annotations
 
Quality Control of NGS Data
Quality Control of NGS Data Quality Control of NGS Data
Quality Control of NGS Data
 
Introduction to Linux
Introduction to LinuxIntroduction to Linux
Introduction to Linux
 
NGS Data Preprocessing
NGS Data PreprocessingNGS Data Preprocessing
NGS Data Preprocessing
 
Workshop NGS data analysis - 2
Workshop NGS data analysis - 2Workshop NGS data analysis - 2
Workshop NGS data analysis - 2
 
Wellcome Trust Advances Course: NGS Course - Lecture1
Wellcome Trust Advances Course: NGS Course - Lecture1Wellcome Trust Advances Course: NGS Course - Lecture1
Wellcome Trust Advances Course: NGS Course - Lecture1
 
NGS - QC & Dataformat
NGS - QC & Dataformat NGS - QC & Dataformat
NGS - QC & Dataformat
 
File formats for Next Generation Sequencing
File formats for Next Generation SequencingFile formats for Next Generation Sequencing
File formats for Next Generation Sequencing
 
Unit 9 - DNA, RNA, and Proteins Notes
Unit 9  - DNA, RNA, and Proteins NotesUnit 9  - DNA, RNA, and Proteins Notes
Unit 9 - DNA, RNA, and Proteins Notes
 
ICT in Healthcare
ICT in HealthcareICT in Healthcare
ICT in Healthcare
 
Health 4.0
Health 4.0Health 4.0
Health 4.0
 
보건산업 진흥 전략(배성윤, 2013.12.11)
보건산업 진흥 전략(배성윤, 2013.12.11)보건산업 진흥 전략(배성윤, 2013.12.11)
보건산업 진흥 전략(배성윤, 2013.12.11)
 

Similar to Next-generation sequencing: Data mangement

Clouds, Grids and Data
Clouds, Grids and DataClouds, Grids and Data
Clouds, Grids and DataGuy Coates
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big DataOmnia Safaan
 
Bigdata processing with Spark
Bigdata processing with SparkBigdata processing with Spark
Bigdata processing with SparkArjen de Vries
 
Unit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxUnit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxAnkitChauhan817826
 
seed block algorithm
seed block algorithmseed block algorithm
seed block algorithmDipak Badhe
 
A Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
A Survey on Approaches for Frequent Item Set Mining on Apache HadoopA Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
A Survey on Approaches for Frequent Item Set Mining on Apache HadoopIJTET Journal
 
60141457-Oracle-Golden-Gate-Presentation.ppt
60141457-Oracle-Golden-Gate-Presentation.ppt60141457-Oracle-Golden-Gate-Presentation.ppt
60141457-Oracle-Golden-Gate-Presentation.pptpadalamail
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodDuncan Hull
 
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xModule 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xNPN Training
 
Google File System
Google File SystemGoogle File System
Google File Systemvivatechijri
 
Big Data - Need of Converged Data Platform
Big Data - Need of Converged Data PlatformBig Data - Need of Converged Data Platform
Big Data - Need of Converged Data PlatformGeekNightHyderabad
 
Bodleian Library's DAMS system
Bodleian Library's DAMS systemBodleian Library's DAMS system
Bodleian Library's DAMS systembenosteen
 
data analytics lecture 3.2.ppt
data analytics lecture 3.2.pptdata analytics lecture 3.2.ppt
data analytics lecture 3.2.pptRutujaPatil247341
 
My other computer_is_a_datacentre
My other computer_is_a_datacentreMy other computer_is_a_datacentre
My other computer_is_a_datacentreSteve Loughran
 
User-space Network Processing
User-space Network ProcessingUser-space Network Processing
User-space Network ProcessingRyousei Takano
 

Similar to Next-generation sequencing: Data mangement (20)

Clouds, Grids and Data
Clouds, Grids and DataClouds, Grids and Data
Clouds, Grids and Data
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big Data
 
Bigdata processing with Spark
Bigdata processing with SparkBigdata processing with Spark
Bigdata processing with Spark
 
Unit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxUnit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptx
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
seed block algorithm
seed block algorithmseed block algorithm
seed block algorithm
 
A mathematical appraisal
A mathematical appraisalA mathematical appraisal
A mathematical appraisal
 
A mathematical appraisal
A mathematical appraisalA mathematical appraisal
A mathematical appraisal
 
A Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
A Survey on Approaches for Frequent Item Set Mining on Apache HadoopA Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
A Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
 
60141457-Oracle-Golden-Gate-Presentation.ppt
60141457-Oracle-Golden-Gate-Presentation.ppt60141457-Oracle-Golden-Gate-Presentation.ppt
60141457-Oracle-Golden-Gate-Presentation.ppt
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xModule 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
 
Google File System
Google File SystemGoogle File System
Google File System
 
4 026
4 0264 026
4 026
 
Big Data - Need of Converged Data Platform
Big Data - Need of Converged Data PlatformBig Data - Need of Converged Data Platform
Big Data - Need of Converged Data Platform
 
Bodleian Library's DAMS system
Bodleian Library's DAMS systemBodleian Library's DAMS system
Bodleian Library's DAMS system
 
data analytics lecture 3.2.ppt
data analytics lecture 3.2.pptdata analytics lecture 3.2.ppt
data analytics lecture 3.2.ppt
 
My other computer_is_a_datacentre
My other computer_is_a_datacentreMy other computer_is_a_datacentre
My other computer_is_a_datacentre
 
User-space Network Processing
User-space Network ProcessingUser-space Network Processing
User-space Network Processing
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 

Recently uploaded

Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 

Recently uploaded (20)

Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 

Next-generation sequencing: Data mangement

  • 1. Next-Gen Sequencing: Data Management Guy Coates Wellcome Trust Sanger Institute [email_address]
  • 2.
  • 3.
  • 4.
  • 6. The Scary Graph Instrument upgrades Peak Yearly capillary sequencing
  • 8.
  • 9. Sequencing cost: T d =12 months
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15. Historically sequencing and IT were budgeted separately.
  • 16. Makes Pis aware of the IT costs, even if it does not cover 100%.
  • 17.
  • 18. Assume from day 1 we will be adding more.
  • 19.
  • 20. Currently using LSF to manage workflow. LSF Fast scratch disk Archival / Warehouse disk Network
  • 21.
  • 22.
  • 23.
  • 24.
  • 25. Sequencing data flow. Automated processing and data management Sequencer Analysis/ alignment Internal repository EGA / SRA (EBI) compute-farm High-performance storage Manual data movement
  • 26.
  • 27. Are we keeping control of our “private” datasets?
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33. Groups need to exchange data.
  • 34.
  • 35. Sequencing data flow. Automated processing and data management Manual Sequencer Analysis/ alignment Internal repository EGA / SRA (EBI) compute-farm High-performance storage Managed data movement
  • 36.
  • 37. iRODS ICAT Catalogue database Rule Engine Implements policies Irods Server Data on disk User interface WebDAV, icommands,fuse Irods Server Data in database Irods Server Data in S3
  • 38.
  • 40.
  • 41. First implementation Automated processing and data management Manual Sequencer Analysis/ alignment Internal repository EGA / SRA (EBI) compute-farm High-performance storage
  • 42.
  • 43.
  • 44.
  • 45. Example access: $ icd /seq/5307 $ ils /seq/5307: 5307_1.bam 5307_2.bam 5307_3.bam $ ils -l 5307_1.bam srpipe 0 res-g2 1987106409 2010-09-24.13:35 & 5307_1.bam srpipe 1 res-r2 1987106409 2010-09-24.13:36 & 5307_1.bam
  • 46. Metadata imeta ls -d /seq/5307/5307_1.bam AVUs defined for dataObj /seq/5307/5307_1.bam: attribute: type value: bam units: ---- attribute: sample value: BG81 units: ---- attribute: id_run value: 5307 units: ---- attribute: lane value: 1 units: ---- attribute: study value: TRANSCRIPTION FACTORS IN HAEMATOPOIESIS - MOUSE units: ---- attribute: library value: BG81 449223 units:
  • 47. Query imeta qu -d study = "TRANSCRIPTION FACTORS IN HAEMATOPOIESIS - MOUSE" collection: /seq/5307 dataObj: 5307_1.bam ---- collection: /seq/5307 dataObj: 5307_2.bam ---- collection: /seq/5307 dataObj: 5307_3.bam ----
  • 49. Next steps Sanger iRODs Datacentre 2 Datacentre 1 Replicate EGA/ERA Automated release/purge Collaborator iRODs Federate
  • 50. Wishlist: HPC Integration Data is staged in/out to filesystem Archive / Metadata system Fast Storage / POSIX filesystem Compute farm Fast Storage / POSIX filesystem + Metadata sytem Compute farm System can do rule/metadata based ops and standard POSIX ops too.
  • 52.
  • 53. Storage and servers spread across several locations. Fast link Storage Storage Storage Storage CPU CPU CPU CPU CPU medium link slow link
  • 54.
  • 55.
  • 56.
  • 57.
  • 58. Hot datasets change over time.
  • 59.
  • 60.
  • 62.
  • 65.
  • 67. Da Xu