Golden Helix develops genetic association software and provides analytic services to hundreds of users worldwide. Their products include SNP & Variation Suite for analyzing SNP, CNV and NGS data, VarSeq for annotating and filtering variants from gene panels, exomes and genomes, and GenomeBrowse for visualizing genomic data. They are working on tools like VarSeq and VSReports to perform clinical-grade tertiary analysis of genomic data and create repeatable workflows for clinical labs.
1. Background
Golden Helix
- Founded in 1998
- Genetic association software
- Analytic services
- Hundreds of users worldwide
- Over 900 customer citations in scientific
journals
Products I Build with My Team
- SNP & Variation Suite (SVS)
- SNP, CNV, NGS tertiary analysis
- Import and deal with all flavors of upstream data
- VarSeq
- Annotate and filter variants in gene panels, exomes and
genomes for clinical labs and researchers.
- GenomeBrowse (Free!)
- Visualization of everything with genomic coordinates.
All standardized file formats.
2. Database Trends
VarSeq
Tertiary analysis to report
in one click
Focused and actionable
data
Modeled on ACMG
guidelines
Hereditary and cancer
templates
OMIM included
VSReports
Command line runner
Integrate with your current
bioinformatics pipeline
Create repeatable clinical
workflows for CLIA and
CAP certified analysis
Supports high throughput
scenarios
VSPipeline
Transactions
Disk structure optimized
Fixed schema
SQL matures
Small mem footprints
Master <-> Slaves
Threaded / Locking
Expensive large
mainframes/servers
90s - SQL
Scale out
First class sharding
Utalize cheap memory
Don’t let disk be
bottleneck
Support stream /
distributed analytics
10s - NewSQL
“Web Scale” - distributed
Eventually consistent
Schema-less, key-based
Avoid joins
Peer-to-peer
Memory cheap
Many cheap commodity
servers in datacenter
configurations
00s - NoSQL
> SELECT * FROM trends GROUP BY decade;
3. The “Database” Market in Thirds
VarSeq
Tertiary analysis to report
in one click
Focused and actionable
data
Modeled on ACMG
guidelines
Hereditary and cancer
templates
OMIM included
VSReports
Command line runner
Integrate with your current
bioinformatics pipeline
Create repeatable clinical
workflows for CLIA and
CAP certified analysis
Supports high throughput
scenarios
VSPipeline
ACID / Transcations
“Traditional” row-based
MySQL
Postgres
Oracle
MSSQL
NewSQL
VoltDB (scale-out)
Google Spanner/F1
MemSQL
Clustrix
OLTP Key and Hiearchical
Based
Wide Columnar Stores
BigTable / HBase
Cassandra
Hiearchical/Document
MongoDB
Couchbase
Key-Value Stores
Redis
Memcachd
FoundationDB
Tuple/Triple-stores
Other
Query Optimized
Amazon Redshift
HP Vertica
Infobright
Google BigQuery
Teradata
Cloudera Impala
Hadoop+Hive
Data Warehousing
http://www.se-radio.net/2013/12/episode-199-michael-stonebraker/
6. Big Data, Small Analytics => Don’t use MapReduce
http://www.slideshare.net/Hapyrus/amazon-redshift-is-10x-faster-and-cheaper-than-hadoop-hive
7. Data Warehousing / Scientific Analysis => Columnar
You’ve got to know what regression
means, what Naïve Bayes means,
what k-Nearest Neighbors means.
It’s all statistics.
All of that stuff turns out to be defined
on arrays. It’s not defined on tables.
The tools of future data scientists are
going to be array-based tools. Those
may live on top of relational database
systems. They may live on top of an
array database system, or perhaps
something else. It’s completely open.
• Columns -> Faster Queries
• Divide columns into chunks
• Compress chunks (better
ratios than rows)
• Pre-compute chunk-level
attributes (min/max etc)
• Flexible storage layer
• Distributed
• Encodings (Parquet,
ORC/Hive, custom)
8. Extract, Transform, Load (ETL)
“Dimensional Moedeling”
- Fact tables & dimensional tables
- Fact tables often measurements over time
- Dimensional table goes into item details
- Denormalized data, complexity hidden
- Often many sources loaded into same warehouse
- Logs
- One or more relational databases (sales, customer-facing etc)
- Vender / Payment information
Example
“Like table”: datetime, user_id, post_id,client_data
“User table”: user_id, subscription_type, last_paid, has_android_app
12. Genomics is Big Data
5,000 public data repositories
Broad Institute:
- Process 40K samples/year
- 1000 people
- 51 High Throughput Sequencers
- 10+ PB of storage
1 Genome in Data
- ~300GB Compressed Sequence Data
- ~150MB Compressed Variant Data
- Seq data went through 5-6 steps
13. We Want Variants
Differences between your DNA
and a reference come in man
sizes:
- Single letter substitutions are called
Single Nucleotide Polymorphisms
(SNPs)
- Small “length polymorphisms” are
called Insertions/Deletions (InDels)
- Large duplications/deletiosn are called
Copy Number Variations
Average European has ~3 million
small variations to the reference.
100K of those in the 30K “gene
coding” regions (~2% of the
genome)
14. Next Generation Sequencing Analysis
Primary
Analysis
Secondary
Analysis
Tertiary
Analysis
“Sense Making”
Analysis of hardware generated data, software built by vendors
Use FPGA and GPUs to handle real-time optical or eletrical signals
from sequencing hardware
Filtering/clipping of “reads” and their qualities
Alignment/Assembly of reads
Recalibrating, de-duplication, variant calling on aligned reads
QA and filtering of variant calls
Annotation (querying) variants to databases, filtering on results
Merging/comparing multiple samples (multiple files)
Visualization of variants in genomic context
Statistics on matrixes
15. Applications of NGS Data in the Clinic
Carrier screening –
prenatal and standard
Lifetime risk prediction
Genetic disorder
diagnostics
Oncology care
PGx – dosage and
care
16. Public Annotations – Left Joins
Exact Matching “Variants”
- “Population Catalogs”
- 1000 Genomes (84M variants)
- NHLBI 6,500 Exomes (2M variants)
- ExAC 61,486 exomes (10M variants)
- Clinical Classifications
- Precomputed predictions / scores
- dbNSFP - 89.6M predictions
Algorithmic Classifciation
- How variant interacts with genes (85K tx)
Region Based
- Disease regions
- Gene Lists
17. Annotations are Hard!
HGVS is a standard that is not standard
- Tries to serve different goals
- Many representations of same variant
- Should not be used as IDs, but not many
good alternatives
Transcripts
- Transcript set choice extremely important,
hard to curate with meaningful attributes as
well.
Public Data Curation
- ClinVar: multi-record lines
- NHLBI: MAF vs AAF, splitting “glob” fields
- 1kG: No genotype counts
- ExAC: Multi-allelic splitting, left-align
- COSMIC: No Ref/Alt, only HGVS
- dbNSFP: Abbreviations and aggregate
scores
Versioning and Issues
- ClinVar missing variants in VCF
- dbSNP patches without version changes
21. N-Glycanase Deficiency
http://www.ngly1.org/
Matthew Might and Matt Wilsey. The
shifting model in clinical diagnostics:
how next-generation sequencing and
families are altering the way rare
diseases are discovered, studied,
and treated. Genetics in Medicine.
March 2014.
22. Personalized Medicine
http://www.ngly1.org/
Matthew Might and Matt Wilsey. The
shifting model in clinical diagnostics:
how next-generation sequencing and
families are altering the way rare
diseases are discovered, studied,
and treated. Genetics in Medicine.
March 2014.
Cancer is a disease of the genome
“Molecular Targeted” drugs effective usually side-effect free
Required genetic testing to direct cancer treatment becoming affordable
24. TSF
Use SQLite as
container.
SQLite has great
cache, multi-
threaded and
read/write properties
Specialized genomic
index, also
lexigraphical
indexes (LevelDB to
do string sorting)
GZIP / BLOSC chunk
compression
Primitive, Enums
and List Types
26. TSF Backed Relational Data Store
More efficient conditional queries
Invisible Joins (i.e. row_id => array
offset)
Size on disk
"NULL [NA, Missing values] values
are part of the domain space,
which avoids auxiliary bit masks at
the expensive of 'loosing' a single
value from the domain.”
SQL front-end allows using as
back-end to existing analytic and
web-stacks
Hinweis der Redaktion
Experience comes from building secondary analysis pipelines for our services and RNA-seq purpose as supporting users of our downstream tools
Our tools start after secondary
Powerful, commercial-grade software designed for local hardware
Largest and most up-to-date repository of Public Annotations
Attention to details in getting public data right
left-aligning, multi-allelic spitting, etc.
Advanced and powerful filtering through rich user interface
Multiple definable outputs and data export options
Data transformations:
VCF/gVCF merging
Intelligent handling of multi-allelic sites
Breaking MNVs to allelic primitives
Expression editor for creating custom variables
Powerful, commercial-grade software designed for local hardware
Largest and most up-to-date repository of Public Annotations
Attention to details in getting public data right
left-aligning, multi-allelic spitting, etc.
Advanced and powerful filtering through rich user interface
Multiple definable outputs and data export options
Data transformations:
VCF/gVCF merging
Intelligent handling of multi-allelic sites
Breaking MNVs to allelic primitives
Expression editor for creating custom variables
Mariposa - federated database over an economic model of resource trading, in which data distributed across multiple organizations could be integrated and queried from a single relational interface
Aurora -> focused on data management for streaming data, using a new data model and query language. Unlike relational systems, which "pull" data and process it a record at a time, in Aurora, data is "pushed", arriving asynchronously from external data sources (such as stock ticks, news feeds, or sensors.)
C-Store -> developed a parallel, shared-nothing column-oriented DBMS for data warehousing. By dividing and storing data in columns, C-Store is able to perform less I/O and get better compression ratios than conventional database systems that store data in rows.
H-Store is a distributed main-memory OLTP system designed to provide very high throughput on transaction processing workloads. First “NewSQL”. Horizontal partitioning (sharding)
SciDB -> array-focused scientific workflows - multidimensional data management and analytics common to scientific, geospatial, financial, and industrial applications.
Hive, on the other hand, works a bit differently. In a nutshell, Hive is a SQL-like data warehouse infrastructure built on HDFS (Hadoop Distributed File System). Instead of MPP, Hadoop uses a distributed processing model called MapReduce, which is also designed to process large data sets quickly.
According to Amazon (so this data point may be somewhat biased), running an old school data warehouse costs $19,000 – $25,000 per terabyte per year. Redshift, on the other hand, boasts that it costs only $1,000 per terabyte per year at its lowest pricing tier.
d
MapReduce is dead, makes a horrible DBMS abstraction (all the HDFS abstractions work around it, and sometimes around HDFS)
MapReduce designed to parse the web-scraped web.
Not even used by Google to do that
Small Analytics: sum/groupby etc, optimized column stores much faster than MapReduce
Starter Template (update to include Primary and Incidental record)
Add Coverage Statistics
Add OMIM and ExaC
Filtering
GB Vization
Ock filter chain
Save template
Re-run with new samples
Open a reports view
Show global config, discuss default templates etc
Select variants for reporting
Classification auto-fill
Change fields
Highly customizable, can include anything, we can customizae as a service or you can do it yourself, example N of One
BUT – all of this is embarassingly parallel
No integration of this data until my product, and then the dataset size is down to single-computer sized work.
Note that for Desktop sequencers, Secondary is often bundled in with the machine (MiSeq, PGM, Proton).
Primary Analysis: Can forget it really exists, Secondary is not there yet. Hence that is the focus of this talk
Note this is DNA-seq focused, but you have similar steps for RNA
Mention ethical matching of 1kg and NHLBI (I used European)
89,617,785 functional predictions in dbNSFP
Maybe browse GB