SlideShare a Scribd company logo
1 of 56
Download to read offline
Democratizing Data-Intensive Science
Bill Howe
Associate Director, eScience Institute
Affiliate Associate Professor
Computer Science & Engineering
Today
• The UW eScience Institute
• Three Observations about Data-Intensive Science
6/23/2015 Bill Howe, UW 2/57
The Fourth Paradigm
1. Empirical + experimental
2. Theoretical
3. Computational
4. Data-Intensive
Jim Gray
“All across our campus, the process of discovery will
increasingly rely on researchers’ ability to extract
knowledge from vast amounts of data… In order to
remain at the forefront, UW must be a leader in
advancing these techniques and technologies, and in
making [them] accessible to researchers in the
broadest imaginable range of fields.”
2005-2008
In other words:
• Data-intensive science will be ubiquitous
• It’s about intellectual infrastructure and software infrastructure,
not only computational infrastructure
http://escience.washington.edu
A 5-year, US $37.8 million cross-institutional
collaboration to create a data science environment
5
2014
6/23/2015 Bill Howe, UW 6
Data Science Kickoff Session:
137 posters from 30+ departments and units
Establish a virtuous cycle
• 6 working groups, each with
• 3-6 faculty from each institution
UW Data Science Education Efforts
6/23/2015 Bill Howe, UW 8
Students Non-Students
CS/Informatics Non-Major
professionals researchers
undergrads grads undergrads grads
UWEO Data Science Certificate
MOOC Intro to Data Science
IGERT: Big Data PhD Track
New CS Courses
Bootcamps and workshops
Intro to Data Programming
Data Science Masters
Incubator: hands-on training
Key Activity: Democratize Access to Big Data
and Big Data Infrastructure
• SQLShare: High Variety Database-as-a-Service for science
• Myria: Scalable Analytics-as-a-Service, Federated Big Data
Environ.
• Other projects in visualization, scalable graph algorithms,
bibliometrics
An observation about
data-intensive science…
We (CS people) design systems for “people like us”
To a first approximation, no one in science is using
state of the art data systems
We should study technology delivery with the same
rigor as technology development
6/23/2015 Bill Howe, UW 10/57
A Typical Data Science Workflow
6/23/2015 Bill Howe, UW 11
1) Preparing to run a model
2) Running the model
3) Interpreting the results
Gathering, cleaning, integrating, restructuring,
transforming, loading, filtering, deleting, combining,
merging, verifying, extracting, shaping, massaging
“80% of the work”
-- Aaron Kimball
“The other 80% of the work”
Where do you store your data?
src: Conversations with Research Leaders (2008)
src: Faculty Technology Survey (2011)
5%
6%
12%
27%
41%
66%
87%
0% 20% 40% 60% 80% 100%
Other
Department-managed data center
External (non-UW) data center
Server managed by research group
Department-managed server
External device (hard drive, thumb drive)
My computer
Lewis et al 2011
How much data do you work with?
Wright 2013
Not everything
is petabytes!
How much time do you spend “handling
data” as opposed to “doing science”?
Mode answer: “90%”
6/23/2015 Bill Howe, UW 14
6/23/2015 Bill Howe, UW
Simple Example
###query length COG hit #1 e-value #1 identity #1 score #1 hit length #1 description #1
chr_4[480001-580000].287 4500
chr_4[560001-660000].1 3556
chr_9[400001-500000].503 4211 COG4547 2.00E-04 19 44.6 620 Cobalamin biosynthesis protein
chr_9[320001-420000].548 2833 COG5406 2.00E-04 38 43.9 1001 Nucleosome binding factor SPN
chr_27[320001-404298].20 3991 COG4547 5.00E-05 18 46.2 620 Cobalamin biosynthesis protein
chr_26[320001-420000].378 3963 COG5099 5.00E-05 17 46.2 777 RNA-binding protein of the Puf
chr_26[400001-441226].196 2949 COG5099 2.00E-04 17 43.9 777 RNA-binding protein of the Puf
chr_24[160001-260000].65 3542
chr_5[720001-820000].339 3141 COG5099 4.00E-09 20 59.3 777 RNA-binding protein of the Puf
chr_9[160001-260000].243 3002 COG5077 1.00E-25 26 114 1089 Ubiquitin carboxyl-terminal hyd
chr_12[720001-820000].86 2895 COG5032 2.00E-09 30 60.5 2105 Phosphatidylinositol kinase and
chr_12[800001-900000].109 1463 COG5032 1.00E-09 30 60.1 2105 Phosphatidylinositol kinase and
chr_11[1-100000].70 2886
chr_11[80001-180000].100 1523
ANNOTATIONSUMMARY-COMBINEDORFANNOTATION16_Phaeo_genome
id query hit e_value identity_ score query_start query_end hit_start hit_end hit_length
1 FHJ7DRN01A0TND.1 COG0414 1.00E-08 28 51 1 74 180 257 285
2 FHJ7DRN01A1AD2.2 COG0092 3.00E-20 47 89.9 6 85 41 120 233
3 FHJ7DRN01A2HWZ.4 COG3889 0.0006 26 35.8 9 94 758 845 872
…
2853 FHJ7DRN02HXTBY.5 COG5077 7.00E-09 37 52.3 3 77 313 388 1089
2854 FHJ7DRN02HZO4J.2 COG0444 2.00E-31 67 127 1 73 135 207 316
…
3566 FHJ7DRN02FUJW3.1 COG5032 1.00E-09 32 54.7 1 75 1965 2038 2105
…
COGAnnotation_coastal_sample.txt
15
Where does the time go? (1)
From an email:
“[This was hard] due to the large amount of data (e.g. data indexes for data retrieval,
dissection into data blocks and processing steps, order in which steps are performed
to match memory/time requirements, file formats required by software used).
In addition we actually spend quite some time in iterations fixing problems with
certain features (e.g. capping ENCODE data), testing features and feature products
to include, identifying useful test data sets, adjusting the training data (e.g. 1000G vs
human-derived variants)
So roughly 50% of the project was testing and improving the model, 30% figuring out
how to do things (engineering) and 20% getting files and getting them into the right
format.
I guess in total [I spent] 6 months [on this project].”
At least 3 months on issues of scale,
file handling, and feature extraction.
Martin Kircher,
Genome SciencesWhy?
3k NSF postdocs in 2010
$50k / postdoc
at least 50% overhead
maybe $75M annually
at NSF alone?
Where does the time go?
What we’re seeing…
• Researchers artifically scale down the problem to
fit their tools and skills
• …or they become dependent on an engineer to
get any work done
• …or researchers morph into engineers
• A better approach:
– Raise the level of abstraction
– Hide complexity without sacrificing performance
– Provide powerful tools; don’t oversimplify
6/23/2015 Bill Howe, UW 17/57
Why is database under-used?
• Not a perfect fit – it’s hard to design a “permanent”
database for a fast-moving research target
– A schema/ontology/standard is some shared consensus about a
model of the world
– Does not exist at the frontier of research, by definition!
• Relatively few queries per dataset – no opportunity to
amortize the cost of installation, design, and loading
• No evidence for conventional explanations
– scientific data is non-relational
– scientists won’t learn SQL
6/23/2015 18
6/23/2015 Bill Howe, UW 19
“As each need is satisfied, the
next higher level in the hierarchy
dominates conscious functioning.”
-- Maslow 43
Maslow’s Needs Hierarchy
A “Needs Hierarchy” of Science Data Management
storage
sharing
6/23/2015 Bill Howe, UW 20
semantic integration
query
analytics
“As each need is satisfied, the
next higher level in the hierarchy
dominates conscious functioning.”
-- Maslow 43
QUERY-AS-A-SERVICE
21
2010 - present
1) Upload data “as is”
Cloud-hosted, secure; no
need to install or design a
database; no pre-defined
schema; schema inference;
some itegration
2) Write Queries
Right in your browser,
writing views on top of
views on top of views ...
SELECT hit, COUNT(*)
FROM tigrfam_surface
GROUP BY hit
ORDER BY cnt DESC
3) Share the results
Make them public, tag them,
share with specific colleagues –
anyone with access can query
http://sqlshare.escience.washington.edu
Key features
• Relaxed Schemas
– No CREATE TABLE statements
– Infer a best-effort schema on upload; upload should always succeed
– Mismatches in number of columns, unnamed columns, mixed-type
columns are all tolerated
• Views as first-class citizens
– Everything is a “dataset”: No distinction between views and tables
– Views created as side-effects of saving queries
– Cleaning, integration,
• Full SQL
– Nested queries, window functions, theta joins, etc.
• SaaS
– In-browser analytics
– REST API, custom clients
6/23/2015 Bill Howe, UW 23/57
6/23/2015 Bill Howe, UW 24
http://sqlshare.escience.washington.edu
SELECT x.strain, x.chr, x.region as snp_region, x.start_bp as snp_start_bp
, x.end_bp as snp_end_bp, w.start_bp as nc_start_bp, w.end_bp as nc_end_bp
, w.category as nc_category
, CASE WHEN (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp)
THEN x.end_bp - x.start_bp + 1
WHEN (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp)
THEN x.end_bp - w.start_bp + 1
WHEN (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp)
THEN w.end_bp - x.start_bp + 1
END AS len_overlap
FROM [koesterj@washington.edu].[hotspots_deserts.tab] x
INNER JOIN [koesterj@washington.edu].[table_noncoding_positions.tab] w
ON x.chr = w.chr
WHERE (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp)
OR (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp)
OR (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp)
ORDER BY x.strain, x.chr ASC, x.start_bp ASC
Non-programmers can write very complex queries
(rather than relying on staff programmers)
Example: Computing the overlaps of two sets of blast results
We see thousands of
queries written by
non-programmers
Find all TIGRFam ids (proteins) that are missing from at least
one of three samples (relations)
SELECT col0 FROM [refseq_hma_fasta_TGIRfam_refs]
UNION
SELECT col0 FROM [est_hma_fasta_TGIRfam_refs]
UNION
SELECT col0 FROM [combo_hma_fasta_TGIRfam_refs]
EXCEPT
SELECT col0 FROM [refseq_hma_fasta_TGIRfam_refs]
INTERSECT
SELECT col0 FROM [est_hma_fasta_TGIRfam_refs]
INTERSECT
SELECT col0 FROM [combo_hma_fasta_TGIRfam_refs]
Howe, et al., CISE 2013
Steven
Roberts
SQL as a lab notebook:
http://bit.ly/16Xj2JP
Calculate #
methylated CGs
Calculate #
all CGs
Calculate
methylation ratio
Link methylation
with gene description
GFF of
methylated
CG locations
GFF of all
genes
GFF of all
CG locations
Gene
descriptions
Join
Reorder
columns
Count Count
JoinJoin
Reorder
columns
Reorder
columns
Compute
Trim
Excel
Join Join
misstep: join
w/ wrong fill
Calculate #
methylated
CGs
Calculate #
all CGs
GFF of
methylated
CG locations
GFF of all
genes
GFF of all
CG locations
Gene
descriptions
Calculate
methylation ratio
and link with gene
description
Popular service for
Bioinformatics Workflows
Halperin, Howe, et al. SSDBM 2013
6/23/2015 Bill Howe, UW 30/57
100-1000 queries,
10-100 tables
roughly equal
number of tables
and queries
Few tables,
20-60 queries
6/23/2015 Bill Howe, UW 31/57
A SQL “learner”
“Data Processor” User Archetype
1000 queries
queries
tables
Feb 2013
250 tables
Mar 2014
“Analyzer” Archetype
900 queries 45 tables
Mar 2011 Mar 2014
queries
tables
Another observation about
data-intensive science
There is no one-size-fits-all solution
We need to make it easy to use multiple
systems transparently
6/23/2015 Bill Howe, UW 34/57
• SIGMOD 2009: Vertica 100x < Hadoop (Grep, Aggregation, Join)
• VLDB 2010: HaLoop ~100x < Hadoop (PageRank, ShortestPath)
• SIGMOD 2010: Pregel (no comparisons)
• HotCloud 2010: Spark ~100x < Hadoop (logistic regression, ALS)
• ICDE 2011: SystemML ~100x < Hadoop
• ICDE 2011: Asterix ~100x < Hadoop (K-Means)
• VLDB 2012: Graphlab ~100x < Hadoop, GraphLab 5x > Pregel, Graphlab ~
MPI (Recommendation/ALS, CoSeq/GMM, NER)
• NSDI 2012: Spark 20x < Hadoop (logistic regression, PageRank)
• VLDB 2012: Asterix (no comparisons)
• SIGMOD 2013: Cumulon 5x < SystemML
• VLDB 2013: Giraph ~ GraphLab (Connected Components)
• SIGMOD 2014: SimSQL vs. Spark vs. GraphLab vs. Giraph (GMM,
bayesian regression, HMM, LDA, imputation)
• VLDB 2014: epiC ~ Impala, epiC ~ Shark, epiC 2x < Hadoop (Grep, Sort,
TPC-H Q3, PageRank)
Lots of work on raising the level of abstraction for big
data without sacrifcing performance…
Pregel
(Malewicz)
Hadoop 2008
2009
2010
2011
2012
2013
2014
HaLoop
(Bu)
Spark
(Zakaria)
VerticaVertica
(Pavlo)
~100x faster
SystemMLSystemML
(Ghoting)
Hyracks
(Borkar)
GraphLab
(Low)
faster
Cumulon
(Huang)
comparable or
inconclusive
Giraph
(Tian)
Dremel
(Melnik)
SimSQL
(Cai)
epiC
(Jiang)
Impala
(Cloudera)
Shark
(Xin)
HIVE
(Thusoo)
“The good old days”
• …to facilitate experiments to determine who performs
well at what
• …to insulate users from heterogeneity in data models,
platforms, algorithmic details
• …to allow multiple systems to co-exist and be used
optimally in production
38
We need big data middleware
Three components:
• MyriaWeb: Analytics-as-a-service: Editor, profiler,
data catalog, logging, all from your browser
• RACO: Optimizing middleware for big data systems –
write once, run anywhere: Hadoop, Spark, MyriaX,
straight C, RDBMS, PGAS, MPI, …
• MyriaExec: An iterative big data execution engine.
Complex analytics at scale, with the DNA of a
database system
39
Big Data Middleware;
Analytics-as-a-Service
http://myria.cs.washington.edu
40
SparkSerial C++
PGAS/HP
CMyriaX RDBMS
SQLDatalogMyriaL SPARQL
Compiler Compiler Compiler Compiler Compiler
Hadoop
Compiler
multiple
languages
multiple big
data systems
multiple
GUIs/apps
Compiling the Myria algebra to bare metal PGAS programs
RADISH
ICDE 15
Brandon
Myers
6/23/2015 Bill Howe, UW 42/57
1% selection microbenchmark, 20GB
Avoid long code paths
6/23/2015 Bill Howe, UW 43/57
Q2 SP2Bench, 100M triples, multiple self-joins
Communication optimization
Graph Patterns
44
• SP2Bench, 100 million triples
• Queries compiled to a PGAS C++ language layer, then
compiled again by a low-level PGAS compiler
• One of Myria’s supported back ends
• Comparison with Shark/Spark, which itself has been shown to
be 100X faster than Hadoop-based systems
• …plus PageRank, Naïve Bayes, and more
RADISH
ICDE 15
Performance / Productivity gap
Productivity
Hand-coded
Query
Third observation about
data-intensive science
Attention span, not cycles, is the limited resource
6/23/2015 Bill Howe, UW 46/57
Time
Amountofdataintheworld
Time
Processingpower
What is the rate-limiting step in data understanding?
Processing power:
Moore’s Law
Amount of data in
the world
Processingpower
Time
What is the rate-limiting step in data understanding?
Processing power:
Moore’s Law
Human cognitive capacity
Idea adapted from “Less is More” by Bill Buxton (2001)
Amount of data in
the world
slide src: Cecilia Aragon, UW HCDE
Science
Productivity
Computational performance
monthsweeksdayshoursminutessecondsmilliseconds
HPC
Systems
Databases
feasibility
threshold
interactivity
threshold
Only two performance
thresholds matter
Lowering barrier to entry
Exposiing Performance Issues
Dominik Moritz
EuroSys 15
Exposing Performance Issues
Dominik Moritz
EuroSys 15
Sourceworker
Destination worker
Kanit "Ham"
Wongsuphasawat
Voyager: Visualization
Recommendation
InfoVis 15
Seung-Hee
BaeScalable Graph Clustering
Version 1
Parallelize Best-known
Serial Algorithm
ICDM 2013
Version 2
Free 30% improvement
for any algorithm
TKDD 2014 SC 2015
Version 3
Distributed approx.
algorithm, 1.5B edges
Viziometrics: Analysis of Visualization
in the Scientific Literature
Proportion of
non-quantitative
figures in paper
Paper impact, grouped into 5% percentiles
Poshen Lee
http://escience.washington.edu
http://myria.cs.washington.edu
http://uwescience.github.io/sqlshare/

More Related Content

What's hot

Bridging Big Data and Data Science Using Scalable Workflows
Bridging Big Data and Data Science Using Scalable WorkflowsBridging Big Data and Data Science Using Scalable Workflows
Bridging Big Data and Data Science Using Scalable WorkflowsIlkay Altintas, Ph.D.
 
Ilkay Altintas: Kepler
Ilkay Altintas: KeplerIlkay Altintas: Kepler
Ilkay Altintas: KeplerDavid LeBauer
 
Large-Scale Inverse Problems
Large-Scale Inverse ProblemsLarge-Scale Inverse Problems
Large-Scale Inverse ProblemsArvind Krishna
 
Machine Learning for Data Extraction
Machine Learning for Data ExtractionMachine Learning for Data Extraction
Machine Learning for Data ExtractionDasha Herrmannova
 
Digital Science: Towards the executable paper
Digital Science: Towards the executable paperDigital Science: Towards the executable paper
Digital Science: Towards the executable paperJose Enrique Ruiz
 
Web analytics webinar
Web analytics webinarWeb analytics webinar
Web analytics webinarJim Jansen
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...Ilkay Altintas, Ph.D.
 
WorDS of Data Science in the Presence of Heterogenous Computing Architectures
WorDS of Data Science in the Presence of Heterogenous Computing ArchitecturesWorDS of Data Science in the Presence of Heterogenous Computing Architectures
WorDS of Data Science in the Presence of Heterogenous Computing ArchitecturesIlkay Altintas, Ph.D.
 
Big Data and the Future of Publishing
Big Data and the Future of PublishingBig Data and the Future of Publishing
Big Data and the Future of PublishingAnita de Waard
 
Ontology-enabled Healthcare Applications exploiting Physical-Cyber-Social Big...
Ontology-enabled Healthcare Applications exploiting Physical-Cyber-Social Big...Ontology-enabled Healthcare Applications exploiting Physical-Cyber-Social Big...
Ontology-enabled Healthcare Applications exploiting Physical-Cyber-Social Big...Amit Sheth
 
Augmenting Healthcare by Supporting General Practitioners and Disclosing Hea...
 Augmenting Healthcare by Supporting General Practitioners and Disclosing Hea... Augmenting Healthcare by Supporting General Practitioners and Disclosing Hea...
Augmenting Healthcare by Supporting General Practitioners and Disclosing Hea...Robin De Croon
 
Open Science for sustainability and inclusiveness: the SKA role model
 Open Science for sustainability and inclusiveness: the SKA role model Open Science for sustainability and inclusiveness: the SKA role model
Open Science for sustainability and inclusiveness: the SKA role modelLourdes Verdes-Montenegro
 
Smart Data in Health – How we will exploit personal, clinical, and social “Bi...
Smart Data in Health – How we will exploit personal, clinical, and social “Bi...Smart Data in Health – How we will exploit personal, clinical, and social “Bi...
Smart Data in Health – How we will exploit personal, clinical, and social “Bi...Amit Sheth
 
Open Data, Big Data and Machine Learning
Open Data, Big Data and Machine LearningOpen Data, Big Data and Machine Learning
Open Data, Big Data and Machine LearningSteven Van Vaerenbergh
 
HathiTrust Research Center Data Capsule Overview 09.10.14
HathiTrust Research Center Data Capsule Overview 09.10.14HathiTrust Research Center Data Capsule Overview 09.10.14
HathiTrust Research Center Data Capsule Overview 09.10.14Robert H. McDonald
 
Evolving Open Health Knowledge Network
Evolving Open Health Knowledge NetworkEvolving Open Health Knowledge Network
Evolving Open Health Knowledge NetworkAmit Sheth
 

What's hot (17)

Bridging Big Data and Data Science Using Scalable Workflows
Bridging Big Data and Data Science Using Scalable WorkflowsBridging Big Data and Data Science Using Scalable Workflows
Bridging Big Data and Data Science Using Scalable Workflows
 
Ilkay Altintas: Kepler
Ilkay Altintas: KeplerIlkay Altintas: Kepler
Ilkay Altintas: Kepler
 
Large-Scale Inverse Problems
Large-Scale Inverse ProblemsLarge-Scale Inverse Problems
Large-Scale Inverse Problems
 
Machine Learning for Data Extraction
Machine Learning for Data ExtractionMachine Learning for Data Extraction
Machine Learning for Data Extraction
 
Digital Science: Towards the executable paper
Digital Science: Towards the executable paperDigital Science: Towards the executable paper
Digital Science: Towards the executable paper
 
Web analytics webinar
Web analytics webinarWeb analytics webinar
Web analytics webinar
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
 
WorDS of Data Science in the Presence of Heterogenous Computing Architectures
WorDS of Data Science in the Presence of Heterogenous Computing ArchitecturesWorDS of Data Science in the Presence of Heterogenous Computing Architectures
WorDS of Data Science in the Presence of Heterogenous Computing Architectures
 
Big Data and the Future of Publishing
Big Data and the Future of PublishingBig Data and the Future of Publishing
Big Data and the Future of Publishing
 
Ontology-enabled Healthcare Applications exploiting Physical-Cyber-Social Big...
Ontology-enabled Healthcare Applications exploiting Physical-Cyber-Social Big...Ontology-enabled Healthcare Applications exploiting Physical-Cyber-Social Big...
Ontology-enabled Healthcare Applications exploiting Physical-Cyber-Social Big...
 
Ι/Ο Data Εngineering
Ι/Ο Data ΕngineeringΙ/Ο Data Εngineering
Ι/Ο Data Εngineering
 
Augmenting Healthcare by Supporting General Practitioners and Disclosing Hea...
 Augmenting Healthcare by Supporting General Practitioners and Disclosing Hea... Augmenting Healthcare by Supporting General Practitioners and Disclosing Hea...
Augmenting Healthcare by Supporting General Practitioners and Disclosing Hea...
 
Open Science for sustainability and inclusiveness: the SKA role model
 Open Science for sustainability and inclusiveness: the SKA role model Open Science for sustainability and inclusiveness: the SKA role model
Open Science for sustainability and inclusiveness: the SKA role model
 
Smart Data in Health – How we will exploit personal, clinical, and social “Bi...
Smart Data in Health – How we will exploit personal, clinical, and social “Bi...Smart Data in Health – How we will exploit personal, clinical, and social “Bi...
Smart Data in Health – How we will exploit personal, clinical, and social “Bi...
 
Open Data, Big Data and Machine Learning
Open Data, Big Data and Machine LearningOpen Data, Big Data and Machine Learning
Open Data, Big Data and Machine Learning
 
HathiTrust Research Center Data Capsule Overview 09.10.14
HathiTrust Research Center Data Capsule Overview 09.10.14HathiTrust Research Center Data Capsule Overview 09.10.14
HathiTrust Research Center Data Capsule Overview 09.10.14
 
Evolving Open Health Knowledge Network
Evolving Open Health Knowledge NetworkEvolving Open Health Knowledge Network
Evolving Open Health Knowledge Network
 

Similar to Democratizing Data-Intensive Science with Query-as-a-Service

XLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and MyriaXLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and MyriaUniversity of Washington
 
Big Data Talent in Academic and Industry R&D
Big Data Talent in Academic and Industry R&DBig Data Talent in Academic and Industry R&D
Big Data Talent in Academic and Industry R&DUniversity of Washington
 
Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013University of Washington
 
HKU Data Curation MLIM7350 Class 8
HKU Data Curation MLIM7350 Class 8HKU Data Curation MLIM7350 Class 8
HKU Data Curation MLIM7350 Class 8Scott Edmunds
 
An Overview of VIEW
An Overview of VIEWAn Overview of VIEW
An Overview of VIEWShiyong Lu
 
20160414 23 Research Data Things
20160414 23 Research Data Things20160414 23 Research Data Things
20160414 23 Research Data ThingsKatina Toufexis
 
2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...
2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...
2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...datacite
 
Smarter Data for Smarter Libraries
Smarter Data for Smarter LibrariesSmarter Data for Smarter Libraries
Smarter Data for Smarter LibrariesOCLC
 
Curation-Friendly Tools for the Scientific Researcher
Curation-Friendly Tools for the Scientific ResearcherCuration-Friendly Tools for the Scientific Researcher
Curation-Friendly Tools for the Scientific Researcherbwestra
 
Docker in Open Science Data Analysis Challenges by Bruce Hoff
Docker in Open Science Data Analysis Challenges by Bruce HoffDocker in Open Science Data Analysis Challenges by Bruce Hoff
Docker in Open Science Data Analysis Challenges by Bruce HoffDocker, Inc.
 
Statistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignoreStatistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignoreTuri, Inc.
 
Lecture_1_Intro.pdf
Lecture_1_Intro.pdfLecture_1_Intro.pdf
Lecture_1_Intro.pdfpaijitk
 
Research methods group accelarating impact by sharing data
Research methods group  accelarating impact by sharing dataResearch methods group  accelarating impact by sharing data
Research methods group accelarating impact by sharing dataWorld Agroforestry (ICRAF)
 
Machine Learning in Modern Medicine with Erin LeDell at Stanford Med
Machine Learning in Modern Medicine with Erin LeDell at Stanford MedMachine Learning in Modern Medicine with Erin LeDell at Stanford Med
Machine Learning in Modern Medicine with Erin LeDell at Stanford MedSri Ambati
 
Intro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data ScientistsIntro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data ScientistsSri Ambati
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodDuncan Hull
 

Similar to Democratizing Data-Intensive Science with Query-as-a-Service (20)

XLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and MyriaXLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and Myria
 
eResearch New Zealand Keynote
eResearch New Zealand KeynoteeResearch New Zealand Keynote
eResearch New Zealand Keynote
 
Big Data Talent in Academic and Industry R&D
Big Data Talent in Academic and Industry R&DBig Data Talent in Academic and Industry R&D
Big Data Talent in Academic and Industry R&D
 
Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013
 
HKU Data Curation MLIM7350 Class 8
HKU Data Curation MLIM7350 Class 8HKU Data Curation MLIM7350 Class 8
HKU Data Curation MLIM7350 Class 8
 
Data science 101
Data science 101Data science 101
Data science 101
 
An Overview of VIEW
An Overview of VIEWAn Overview of VIEW
An Overview of VIEW
 
20160414 23 Research Data Things
20160414 23 Research Data Things20160414 23 Research Data Things
20160414 23 Research Data Things
 
2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...
2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...
2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...
 
Smarter Data for Smarter Libraries
Smarter Data for Smarter LibrariesSmarter Data for Smarter Libraries
Smarter Data for Smarter Libraries
 
Curation-Friendly Tools for the Scientific Researcher
Curation-Friendly Tools for the Scientific ResearcherCuration-Friendly Tools for the Scientific Researcher
Curation-Friendly Tools for the Scientific Researcher
 
Docker in Open Science Data Analysis Challenges by Bruce Hoff
Docker in Open Science Data Analysis Challenges by Bruce HoffDocker in Open Science Data Analysis Challenges by Bruce Hoff
Docker in Open Science Data Analysis Challenges by Bruce Hoff
 
Statistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignoreStatistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignore
 
Lecture_1_Intro.pdf
Lecture_1_Intro.pdfLecture_1_Intro.pdf
Lecture_1_Intro.pdf
 
Research methods group accelarating impact by sharing data
Research methods group  accelarating impact by sharing dataResearch methods group  accelarating impact by sharing data
Research methods group accelarating impact by sharing data
 
Machine Learning in Modern Medicine with Erin LeDell at Stanford Med
Machine Learning in Modern Medicine with Erin LeDell at Stanford MedMachine Learning in Modern Medicine with Erin LeDell at Stanford Med
Machine Learning in Modern Medicine with Erin LeDell at Stanford Med
 
OpenML data@Sheffield
OpenML data@SheffieldOpenML data@Sheffield
OpenML data@Sheffield
 
Final Johnson Research Libraries and Computational Research
Final Johnson Research Libraries and Computational ResearchFinal Johnson Research Libraries and Computational Research
Final Johnson Research Libraries and Computational Research
 
Intro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data ScientistsIntro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data Scientists
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 

More from InfinIT - Innovationsnetværket for it

More from InfinIT - Innovationsnetværket for it (20)

Erfaringer med-c kurt-noermark
Erfaringer med-c kurt-noermarkErfaringer med-c kurt-noermark
Erfaringer med-c kurt-noermark
 
Object orientering, test driven development og c
Object orientering, test driven development og cObject orientering, test driven development og c
Object orientering, test driven development og c
 
Embedded softwaredevelopment hcs
Embedded softwaredevelopment hcsEmbedded softwaredevelopment hcs
Embedded softwaredevelopment hcs
 
C og c++-jens lund jensen
C og c++-jens lund jensenC og c++-jens lund jensen
C og c++-jens lund jensen
 
201811xx foredrag c_cpp
201811xx foredrag c_cpp201811xx foredrag c_cpp
201811xx foredrag c_cpp
 
C som-programmeringssprog-bt
C som-programmeringssprog-btC som-programmeringssprog-bt
C som-programmeringssprog-bt
 
Infinit seminar 060918
Infinit seminar 060918Infinit seminar 060918
Infinit seminar 060918
 
DCR solutions
DCR solutionsDCR solutions
DCR solutions
 
Not your grandfathers BPM
Not your grandfathers BPMNot your grandfathers BPM
Not your grandfathers BPM
 
Kmd workzone - an evolutionary approach to revolution
Kmd workzone - an evolutionary approach to revolutionKmd workzone - an evolutionary approach to revolution
Kmd workzone - an evolutionary approach to revolution
 
EcoKnow - oplæg
EcoKnow - oplægEcoKnow - oplæg
EcoKnow - oplæg
 
Martin Wickins Chatbots i fronten
Martin Wickins Chatbots i frontenMartin Wickins Chatbots i fronten
Martin Wickins Chatbots i fronten
 
Marie Fenger ai kundeservice
Marie Fenger ai kundeserviceMarie Fenger ai kundeservice
Marie Fenger ai kundeservice
 
Mads Kaysen SupWiz
Mads Kaysen SupWizMads Kaysen SupWiz
Mads Kaysen SupWiz
 
Leif Howalt NNIT Service Support Center
Leif Howalt NNIT Service Support CenterLeif Howalt NNIT Service Support Center
Leif Howalt NNIT Service Support Center
 
Jan Neerbek NLP og Chatbots
Jan Neerbek NLP og ChatbotsJan Neerbek NLP og Chatbots
Jan Neerbek NLP og Chatbots
 
Anders Soegaard NLP for Customer Support
Anders Soegaard NLP for Customer SupportAnders Soegaard NLP for Customer Support
Anders Soegaard NLP for Customer Support
 
Stephen Alstrup infinit august 2018
Stephen Alstrup infinit august 2018Stephen Alstrup infinit august 2018
Stephen Alstrup infinit august 2018
 
Innovation og værdiskabelse i it-projekter
Innovation og værdiskabelse i it-projekterInnovation og værdiskabelse i it-projekter
Innovation og værdiskabelse i it-projekter
 
Rokoko infin it presentation
Rokoko infin it presentation Rokoko infin it presentation
Rokoko infin it presentation
 

Recently uploaded

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 

Recently uploaded (20)

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 

Democratizing Data-Intensive Science with Query-as-a-Service

  • 1. Democratizing Data-Intensive Science Bill Howe Associate Director, eScience Institute Affiliate Associate Professor Computer Science & Engineering
  • 2. Today • The UW eScience Institute • Three Observations about Data-Intensive Science 6/23/2015 Bill Howe, UW 2/57
  • 3. The Fourth Paradigm 1. Empirical + experimental 2. Theoretical 3. Computational 4. Data-Intensive Jim Gray
  • 4. “All across our campus, the process of discovery will increasingly rely on researchers’ ability to extract knowledge from vast amounts of data… In order to remain at the forefront, UW must be a leader in advancing these techniques and technologies, and in making [them] accessible to researchers in the broadest imaginable range of fields.” 2005-2008 In other words: • Data-intensive science will be ubiquitous • It’s about intellectual infrastructure and software infrastructure, not only computational infrastructure http://escience.washington.edu
  • 5. A 5-year, US $37.8 million cross-institutional collaboration to create a data science environment 5 2014
  • 6. 6/23/2015 Bill Howe, UW 6 Data Science Kickoff Session: 137 posters from 30+ departments and units
  • 7. Establish a virtuous cycle • 6 working groups, each with • 3-6 faculty from each institution
  • 8. UW Data Science Education Efforts 6/23/2015 Bill Howe, UW 8 Students Non-Students CS/Informatics Non-Major professionals researchers undergrads grads undergrads grads UWEO Data Science Certificate MOOC Intro to Data Science IGERT: Big Data PhD Track New CS Courses Bootcamps and workshops Intro to Data Programming Data Science Masters Incubator: hands-on training
  • 9. Key Activity: Democratize Access to Big Data and Big Data Infrastructure • SQLShare: High Variety Database-as-a-Service for science • Myria: Scalable Analytics-as-a-Service, Federated Big Data Environ. • Other projects in visualization, scalable graph algorithms, bibliometrics
  • 10. An observation about data-intensive science… We (CS people) design systems for “people like us” To a first approximation, no one in science is using state of the art data systems We should study technology delivery with the same rigor as technology development 6/23/2015 Bill Howe, UW 10/57
  • 11. A Typical Data Science Workflow 6/23/2015 Bill Howe, UW 11 1) Preparing to run a model 2) Running the model 3) Interpreting the results Gathering, cleaning, integrating, restructuring, transforming, loading, filtering, deleting, combining, merging, verifying, extracting, shaping, massaging “80% of the work” -- Aaron Kimball “The other 80% of the work”
  • 12. Where do you store your data? src: Conversations with Research Leaders (2008) src: Faculty Technology Survey (2011) 5% 6% 12% 27% 41% 66% 87% 0% 20% 40% 60% 80% 100% Other Department-managed data center External (non-UW) data center Server managed by research group Department-managed server External device (hard drive, thumb drive) My computer Lewis et al 2011
  • 13. How much data do you work with? Wright 2013 Not everything is petabytes!
  • 14. How much time do you spend “handling data” as opposed to “doing science”? Mode answer: “90%” 6/23/2015 Bill Howe, UW 14
  • 15. 6/23/2015 Bill Howe, UW Simple Example ###query length COG hit #1 e-value #1 identity #1 score #1 hit length #1 description #1 chr_4[480001-580000].287 4500 chr_4[560001-660000].1 3556 chr_9[400001-500000].503 4211 COG4547 2.00E-04 19 44.6 620 Cobalamin biosynthesis protein chr_9[320001-420000].548 2833 COG5406 2.00E-04 38 43.9 1001 Nucleosome binding factor SPN chr_27[320001-404298].20 3991 COG4547 5.00E-05 18 46.2 620 Cobalamin biosynthesis protein chr_26[320001-420000].378 3963 COG5099 5.00E-05 17 46.2 777 RNA-binding protein of the Puf chr_26[400001-441226].196 2949 COG5099 2.00E-04 17 43.9 777 RNA-binding protein of the Puf chr_24[160001-260000].65 3542 chr_5[720001-820000].339 3141 COG5099 4.00E-09 20 59.3 777 RNA-binding protein of the Puf chr_9[160001-260000].243 3002 COG5077 1.00E-25 26 114 1089 Ubiquitin carboxyl-terminal hyd chr_12[720001-820000].86 2895 COG5032 2.00E-09 30 60.5 2105 Phosphatidylinositol kinase and chr_12[800001-900000].109 1463 COG5032 1.00E-09 30 60.1 2105 Phosphatidylinositol kinase and chr_11[1-100000].70 2886 chr_11[80001-180000].100 1523 ANNOTATIONSUMMARY-COMBINEDORFANNOTATION16_Phaeo_genome id query hit e_value identity_ score query_start query_end hit_start hit_end hit_length 1 FHJ7DRN01A0TND.1 COG0414 1.00E-08 28 51 1 74 180 257 285 2 FHJ7DRN01A1AD2.2 COG0092 3.00E-20 47 89.9 6 85 41 120 233 3 FHJ7DRN01A2HWZ.4 COG3889 0.0006 26 35.8 9 94 758 845 872 … 2853 FHJ7DRN02HXTBY.5 COG5077 7.00E-09 37 52.3 3 77 313 388 1089 2854 FHJ7DRN02HZO4J.2 COG0444 2.00E-31 67 127 1 73 135 207 316 … 3566 FHJ7DRN02FUJW3.1 COG5032 1.00E-09 32 54.7 1 75 1965 2038 2105 … COGAnnotation_coastal_sample.txt 15 Where does the time go? (1)
  • 16. From an email: “[This was hard] due to the large amount of data (e.g. data indexes for data retrieval, dissection into data blocks and processing steps, order in which steps are performed to match memory/time requirements, file formats required by software used). In addition we actually spend quite some time in iterations fixing problems with certain features (e.g. capping ENCODE data), testing features and feature products to include, identifying useful test data sets, adjusting the training data (e.g. 1000G vs human-derived variants) So roughly 50% of the project was testing and improving the model, 30% figuring out how to do things (engineering) and 20% getting files and getting them into the right format. I guess in total [I spent] 6 months [on this project].” At least 3 months on issues of scale, file handling, and feature extraction. Martin Kircher, Genome SciencesWhy? 3k NSF postdocs in 2010 $50k / postdoc at least 50% overhead maybe $75M annually at NSF alone? Where does the time go?
  • 17. What we’re seeing… • Researchers artifically scale down the problem to fit their tools and skills • …or they become dependent on an engineer to get any work done • …or researchers morph into engineers • A better approach: – Raise the level of abstraction – Hide complexity without sacrificing performance – Provide powerful tools; don’t oversimplify 6/23/2015 Bill Howe, UW 17/57
  • 18. Why is database under-used? • Not a perfect fit – it’s hard to design a “permanent” database for a fast-moving research target – A schema/ontology/standard is some shared consensus about a model of the world – Does not exist at the frontier of research, by definition! • Relatively few queries per dataset – no opportunity to amortize the cost of installation, design, and loading • No evidence for conventional explanations – scientific data is non-relational – scientists won’t learn SQL 6/23/2015 18
  • 19. 6/23/2015 Bill Howe, UW 19 “As each need is satisfied, the next higher level in the hierarchy dominates conscious functioning.” -- Maslow 43 Maslow’s Needs Hierarchy
  • 20. A “Needs Hierarchy” of Science Data Management storage sharing 6/23/2015 Bill Howe, UW 20 semantic integration query analytics “As each need is satisfied, the next higher level in the hierarchy dominates conscious functioning.” -- Maslow 43
  • 22. 1) Upload data “as is” Cloud-hosted, secure; no need to install or design a database; no pre-defined schema; schema inference; some itegration 2) Write Queries Right in your browser, writing views on top of views on top of views ... SELECT hit, COUNT(*) FROM tigrfam_surface GROUP BY hit ORDER BY cnt DESC 3) Share the results Make them public, tag them, share with specific colleagues – anyone with access can query http://sqlshare.escience.washington.edu
  • 23. Key features • Relaxed Schemas – No CREATE TABLE statements – Infer a best-effort schema on upload; upload should always succeed – Mismatches in number of columns, unnamed columns, mixed-type columns are all tolerated • Views as first-class citizens – Everything is a “dataset”: No distinction between views and tables – Views created as side-effects of saving queries – Cleaning, integration, • Full SQL – Nested queries, window functions, theta joins, etc. • SaaS – In-browser analytics – REST API, custom clients 6/23/2015 Bill Howe, UW 23/57
  • 24. 6/23/2015 Bill Howe, UW 24 http://sqlshare.escience.washington.edu
  • 25. SELECT x.strain, x.chr, x.region as snp_region, x.start_bp as snp_start_bp , x.end_bp as snp_end_bp, w.start_bp as nc_start_bp, w.end_bp as nc_end_bp , w.category as nc_category , CASE WHEN (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp) THEN x.end_bp - x.start_bp + 1 WHEN (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp) THEN x.end_bp - w.start_bp + 1 WHEN (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp) THEN w.end_bp - x.start_bp + 1 END AS len_overlap FROM [koesterj@washington.edu].[hotspots_deserts.tab] x INNER JOIN [koesterj@washington.edu].[table_noncoding_positions.tab] w ON x.chr = w.chr WHERE (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp) OR (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp) OR (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp) ORDER BY x.strain, x.chr ASC, x.start_bp ASC Non-programmers can write very complex queries (rather than relying on staff programmers) Example: Computing the overlaps of two sets of blast results We see thousands of queries written by non-programmers
  • 26. Find all TIGRFam ids (proteins) that are missing from at least one of three samples (relations) SELECT col0 FROM [refseq_hma_fasta_TGIRfam_refs] UNION SELECT col0 FROM [est_hma_fasta_TGIRfam_refs] UNION SELECT col0 FROM [combo_hma_fasta_TGIRfam_refs] EXCEPT SELECT col0 FROM [refseq_hma_fasta_TGIRfam_refs] INTERSECT SELECT col0 FROM [est_hma_fasta_TGIRfam_refs] INTERSECT SELECT col0 FROM [combo_hma_fasta_TGIRfam_refs]
  • 27. Howe, et al., CISE 2013
  • 28. Steven Roberts SQL as a lab notebook: http://bit.ly/16Xj2JP Calculate # methylated CGs Calculate # all CGs Calculate methylation ratio Link methylation with gene description GFF of methylated CG locations GFF of all genes GFF of all CG locations Gene descriptions Join Reorder columns Count Count JoinJoin Reorder columns Reorder columns Compute Trim Excel Join Join misstep: join w/ wrong fill Calculate # methylated CGs Calculate # all CGs GFF of methylated CG locations GFF of all genes GFF of all CG locations Gene descriptions Calculate methylation ratio and link with gene description Popular service for Bioinformatics Workflows
  • 29. Halperin, Howe, et al. SSDBM 2013
  • 30. 6/23/2015 Bill Howe, UW 30/57 100-1000 queries, 10-100 tables roughly equal number of tables and queries Few tables, 20-60 queries
  • 31. 6/23/2015 Bill Howe, UW 31/57 A SQL “learner”
  • 32. “Data Processor” User Archetype 1000 queries queries tables Feb 2013 250 tables Mar 2014
  • 33. “Analyzer” Archetype 900 queries 45 tables Mar 2011 Mar 2014 queries tables
  • 34. Another observation about data-intensive science There is no one-size-fits-all solution We need to make it easy to use multiple systems transparently 6/23/2015 Bill Howe, UW 34/57
  • 35. • SIGMOD 2009: Vertica 100x < Hadoop (Grep, Aggregation, Join) • VLDB 2010: HaLoop ~100x < Hadoop (PageRank, ShortestPath) • SIGMOD 2010: Pregel (no comparisons) • HotCloud 2010: Spark ~100x < Hadoop (logistic regression, ALS) • ICDE 2011: SystemML ~100x < Hadoop • ICDE 2011: Asterix ~100x < Hadoop (K-Means) • VLDB 2012: Graphlab ~100x < Hadoop, GraphLab 5x > Pregel, Graphlab ~ MPI (Recommendation/ALS, CoSeq/GMM, NER) • NSDI 2012: Spark 20x < Hadoop (logistic regression, PageRank) • VLDB 2012: Asterix (no comparisons) • SIGMOD 2013: Cumulon 5x < SystemML • VLDB 2013: Giraph ~ GraphLab (Connected Components) • SIGMOD 2014: SimSQL vs. Spark vs. GraphLab vs. Giraph (GMM, bayesian regression, HMM, LDA, imputation) • VLDB 2014: epiC ~ Impala, epiC ~ Shark, epiC 2x < Hadoop (Grep, Sort, TPC-H Q3, PageRank) Lots of work on raising the level of abstraction for big data without sacrifcing performance…
  • 37.
  • 38. • …to facilitate experiments to determine who performs well at what • …to insulate users from heterogeneity in data models, platforms, algorithmic details • …to allow multiple systems to co-exist and be used optimally in production 38 We need big data middleware
  • 39. Three components: • MyriaWeb: Analytics-as-a-service: Editor, profiler, data catalog, logging, all from your browser • RACO: Optimizing middleware for big data systems – write once, run anywhere: Hadoop, Spark, MyriaX, straight C, RDBMS, PGAS, MPI, … • MyriaExec: An iterative big data execution engine. Complex analytics at scale, with the DNA of a database system 39 Big Data Middleware; Analytics-as-a-Service http://myria.cs.washington.edu
  • 40. 40 SparkSerial C++ PGAS/HP CMyriaX RDBMS SQLDatalogMyriaL SPARQL Compiler Compiler Compiler Compiler Compiler Hadoop Compiler multiple languages multiple big data systems multiple GUIs/apps
  • 41. Compiling the Myria algebra to bare metal PGAS programs RADISH ICDE 15 Brandon Myers
  • 42. 6/23/2015 Bill Howe, UW 42/57 1% selection microbenchmark, 20GB Avoid long code paths
  • 43. 6/23/2015 Bill Howe, UW 43/57 Q2 SP2Bench, 100M triples, multiple self-joins Communication optimization
  • 44. Graph Patterns 44 • SP2Bench, 100 million triples • Queries compiled to a PGAS C++ language layer, then compiled again by a low-level PGAS compiler • One of Myria’s supported back ends • Comparison with Shark/Spark, which itself has been shown to be 100X faster than Hadoop-based systems • …plus PageRank, Naïve Bayes, and more RADISH ICDE 15
  • 45. Performance / Productivity gap Productivity Hand-coded Query
  • 46. Third observation about data-intensive science Attention span, not cycles, is the limited resource 6/23/2015 Bill Howe, UW 46/57
  • 47. Time Amountofdataintheworld Time Processingpower What is the rate-limiting step in data understanding? Processing power: Moore’s Law Amount of data in the world
  • 48. Processingpower Time What is the rate-limiting step in data understanding? Processing power: Moore’s Law Human cognitive capacity Idea adapted from “Less is More” by Bill Buxton (2001) Amount of data in the world slide src: Cecilia Aragon, UW HCDE
  • 52. Exposing Performance Issues Dominik Moritz EuroSys 15 Sourceworker Destination worker
  • 54. Seung-Hee BaeScalable Graph Clustering Version 1 Parallelize Best-known Serial Algorithm ICDM 2013 Version 2 Free 30% improvement for any algorithm TKDD 2014 SC 2015 Version 3 Distributed approx. algorithm, 1.5B edges
  • 55. Viziometrics: Analysis of Visualization in the Scientific Literature Proportion of non-quantitative figures in paper Paper impact, grouped into 5% percentiles Poshen Lee