2. Today
• The UW eScience Institute
• Three Observations about Data-Intensive Science
6/23/2015 Bill Howe, UW 2/57
3. The Fourth Paradigm
1. Empirical + experimental
2. Theoretical
3. Computational
4. Data-Intensive
Jim Gray
4. “All across our campus, the process of discovery will
increasingly rely on researchers’ ability to extract
knowledge from vast amounts of data… In order to
remain at the forefront, UW must be a leader in
advancing these techniques and technologies, and in
making [them] accessible to researchers in the
broadest imaginable range of fields.”
2005-2008
In other words:
• Data-intensive science will be ubiquitous
• It’s about intellectual infrastructure and software infrastructure,
not only computational infrastructure
http://escience.washington.edu
5. A 5-year, US $37.8 million cross-institutional
collaboration to create a data science environment
5
2014
6. 6/23/2015 Bill Howe, UW 6
Data Science Kickoff Session:
137 posters from 30+ departments and units
7. Establish a virtuous cycle
• 6 working groups, each with
• 3-6 faculty from each institution
8. UW Data Science Education Efforts
6/23/2015 Bill Howe, UW 8
Students Non-Students
CS/Informatics Non-Major
professionals researchers
undergrads grads undergrads grads
UWEO Data Science Certificate
MOOC Intro to Data Science
IGERT: Big Data PhD Track
New CS Courses
Bootcamps and workshops
Intro to Data Programming
Data Science Masters
Incubator: hands-on training
9. Key Activity: Democratize Access to Big Data
and Big Data Infrastructure
• SQLShare: High Variety Database-as-a-Service for science
• Myria: Scalable Analytics-as-a-Service, Federated Big Data
Environ.
• Other projects in visualization, scalable graph algorithms,
bibliometrics
10. An observation about
data-intensive science…
We (CS people) design systems for “people like us”
To a first approximation, no one in science is using
state of the art data systems
We should study technology delivery with the same
rigor as technology development
6/23/2015 Bill Howe, UW 10/57
11. A Typical Data Science Workflow
6/23/2015 Bill Howe, UW 11
1) Preparing to run a model
2) Running the model
3) Interpreting the results
Gathering, cleaning, integrating, restructuring,
transforming, loading, filtering, deleting, combining,
merging, verifying, extracting, shaping, massaging
“80% of the work”
-- Aaron Kimball
“The other 80% of the work”
12. Where do you store your data?
src: Conversations with Research Leaders (2008)
src: Faculty Technology Survey (2011)
5%
6%
12%
27%
41%
66%
87%
0% 20% 40% 60% 80% 100%
Other
Department-managed data center
External (non-UW) data center
Server managed by research group
Department-managed server
External device (hard drive, thumb drive)
My computer
Lewis et al 2011
13. How much data do you work with?
Wright 2013
Not everything
is petabytes!
14. How much time do you spend “handling
data” as opposed to “doing science”?
Mode answer: “90%”
6/23/2015 Bill Howe, UW 14
15. 6/23/2015 Bill Howe, UW
Simple Example
###query length COG hit #1 e-value #1 identity #1 score #1 hit length #1 description #1
chr_4[480001-580000].287 4500
chr_4[560001-660000].1 3556
chr_9[400001-500000].503 4211 COG4547 2.00E-04 19 44.6 620 Cobalamin biosynthesis protein
chr_9[320001-420000].548 2833 COG5406 2.00E-04 38 43.9 1001 Nucleosome binding factor SPN
chr_27[320001-404298].20 3991 COG4547 5.00E-05 18 46.2 620 Cobalamin biosynthesis protein
chr_26[320001-420000].378 3963 COG5099 5.00E-05 17 46.2 777 RNA-binding protein of the Puf
chr_26[400001-441226].196 2949 COG5099 2.00E-04 17 43.9 777 RNA-binding protein of the Puf
chr_24[160001-260000].65 3542
chr_5[720001-820000].339 3141 COG5099 4.00E-09 20 59.3 777 RNA-binding protein of the Puf
chr_9[160001-260000].243 3002 COG5077 1.00E-25 26 114 1089 Ubiquitin carboxyl-terminal hyd
chr_12[720001-820000].86 2895 COG5032 2.00E-09 30 60.5 2105 Phosphatidylinositol kinase and
chr_12[800001-900000].109 1463 COG5032 1.00E-09 30 60.1 2105 Phosphatidylinositol kinase and
chr_11[1-100000].70 2886
chr_11[80001-180000].100 1523
ANNOTATIONSUMMARY-COMBINEDORFANNOTATION16_Phaeo_genome
id query hit e_value identity_ score query_start query_end hit_start hit_end hit_length
1 FHJ7DRN01A0TND.1 COG0414 1.00E-08 28 51 1 74 180 257 285
2 FHJ7DRN01A1AD2.2 COG0092 3.00E-20 47 89.9 6 85 41 120 233
3 FHJ7DRN01A2HWZ.4 COG3889 0.0006 26 35.8 9 94 758 845 872
…
2853 FHJ7DRN02HXTBY.5 COG5077 7.00E-09 37 52.3 3 77 313 388 1089
2854 FHJ7DRN02HZO4J.2 COG0444 2.00E-31 67 127 1 73 135 207 316
…
3566 FHJ7DRN02FUJW3.1 COG5032 1.00E-09 32 54.7 1 75 1965 2038 2105
…
COGAnnotation_coastal_sample.txt
15
Where does the time go? (1)
16. From an email:
“[This was hard] due to the large amount of data (e.g. data indexes for data retrieval,
dissection into data blocks and processing steps, order in which steps are performed
to match memory/time requirements, file formats required by software used).
In addition we actually spend quite some time in iterations fixing problems with
certain features (e.g. capping ENCODE data), testing features and feature products
to include, identifying useful test data sets, adjusting the training data (e.g. 1000G vs
human-derived variants)
So roughly 50% of the project was testing and improving the model, 30% figuring out
how to do things (engineering) and 20% getting files and getting them into the right
format.
I guess in total [I spent] 6 months [on this project].”
At least 3 months on issues of scale,
file handling, and feature extraction.
Martin Kircher,
Genome SciencesWhy?
3k NSF postdocs in 2010
$50k / postdoc
at least 50% overhead
maybe $75M annually
at NSF alone?
Where does the time go?
17. What we’re seeing…
• Researchers artifically scale down the problem to
fit their tools and skills
• …or they become dependent on an engineer to
get any work done
• …or researchers morph into engineers
• A better approach:
– Raise the level of abstraction
– Hide complexity without sacrificing performance
– Provide powerful tools; don’t oversimplify
6/23/2015 Bill Howe, UW 17/57
18. Why is database under-used?
• Not a perfect fit – it’s hard to design a “permanent”
database for a fast-moving research target
– A schema/ontology/standard is some shared consensus about a
model of the world
– Does not exist at the frontier of research, by definition!
• Relatively few queries per dataset – no opportunity to
amortize the cost of installation, design, and loading
• No evidence for conventional explanations
– scientific data is non-relational
– scientists won’t learn SQL
6/23/2015 18
19. 6/23/2015 Bill Howe, UW 19
“As each need is satisfied, the
next higher level in the hierarchy
dominates conscious functioning.”
-- Maslow 43
Maslow’s Needs Hierarchy
20. A “Needs Hierarchy” of Science Data Management
storage
sharing
6/23/2015 Bill Howe, UW 20
semantic integration
query
analytics
“As each need is satisfied, the
next higher level in the hierarchy
dominates conscious functioning.”
-- Maslow 43
22. 1) Upload data “as is”
Cloud-hosted, secure; no
need to install or design a
database; no pre-defined
schema; schema inference;
some itegration
2) Write Queries
Right in your browser,
writing views on top of
views on top of views ...
SELECT hit, COUNT(*)
FROM tigrfam_surface
GROUP BY hit
ORDER BY cnt DESC
3) Share the results
Make them public, tag them,
share with specific colleagues –
anyone with access can query
http://sqlshare.escience.washington.edu
23. Key features
• Relaxed Schemas
– No CREATE TABLE statements
– Infer a best-effort schema on upload; upload should always succeed
– Mismatches in number of columns, unnamed columns, mixed-type
columns are all tolerated
• Views as first-class citizens
– Everything is a “dataset”: No distinction between views and tables
– Views created as side-effects of saving queries
– Cleaning, integration,
• Full SQL
– Nested queries, window functions, theta joins, etc.
• SaaS
– In-browser analytics
– REST API, custom clients
6/23/2015 Bill Howe, UW 23/57
25. SELECT x.strain, x.chr, x.region as snp_region, x.start_bp as snp_start_bp
, x.end_bp as snp_end_bp, w.start_bp as nc_start_bp, w.end_bp as nc_end_bp
, w.category as nc_category
, CASE WHEN (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp)
THEN x.end_bp - x.start_bp + 1
WHEN (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp)
THEN x.end_bp - w.start_bp + 1
WHEN (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp)
THEN w.end_bp - x.start_bp + 1
END AS len_overlap
FROM [koesterj@washington.edu].[hotspots_deserts.tab] x
INNER JOIN [koesterj@washington.edu].[table_noncoding_positions.tab] w
ON x.chr = w.chr
WHERE (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp)
OR (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp)
OR (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp)
ORDER BY x.strain, x.chr ASC, x.start_bp ASC
Non-programmers can write very complex queries
(rather than relying on staff programmers)
Example: Computing the overlaps of two sets of blast results
We see thousands of
queries written by
non-programmers
26. Find all TIGRFam ids (proteins) that are missing from at least
one of three samples (relations)
SELECT col0 FROM [refseq_hma_fasta_TGIRfam_refs]
UNION
SELECT col0 FROM [est_hma_fasta_TGIRfam_refs]
UNION
SELECT col0 FROM [combo_hma_fasta_TGIRfam_refs]
EXCEPT
SELECT col0 FROM [refseq_hma_fasta_TGIRfam_refs]
INTERSECT
SELECT col0 FROM [est_hma_fasta_TGIRfam_refs]
INTERSECT
SELECT col0 FROM [combo_hma_fasta_TGIRfam_refs]
28. Steven
Roberts
SQL as a lab notebook:
http://bit.ly/16Xj2JP
Calculate #
methylated CGs
Calculate #
all CGs
Calculate
methylation ratio
Link methylation
with gene description
GFF of
methylated
CG locations
GFF of all
genes
GFF of all
CG locations
Gene
descriptions
Join
Reorder
columns
Count Count
JoinJoin
Reorder
columns
Reorder
columns
Compute
Trim
Excel
Join Join
misstep: join
w/ wrong fill
Calculate #
methylated
CGs
Calculate #
all CGs
GFF of
methylated
CG locations
GFF of all
genes
GFF of all
CG locations
Gene
descriptions
Calculate
methylation ratio
and link with gene
description
Popular service for
Bioinformatics Workflows
34. Another observation about
data-intensive science
There is no one-size-fits-all solution
We need to make it easy to use multiple
systems transparently
6/23/2015 Bill Howe, UW 34/57
38. • …to facilitate experiments to determine who performs
well at what
• …to insulate users from heterogeneity in data models,
platforms, algorithmic details
• …to allow multiple systems to co-exist and be used
optimally in production
38
We need big data middleware
39. Three components:
• MyriaWeb: Analytics-as-a-service: Editor, profiler,
data catalog, logging, all from your browser
• RACO: Optimizing middleware for big data systems –
write once, run anywhere: Hadoop, Spark, MyriaX,
straight C, RDBMS, PGAS, MPI, …
• MyriaExec: An iterative big data execution engine.
Complex analytics at scale, with the DNA of a
database system
39
Big Data Middleware;
Analytics-as-a-Service
http://myria.cs.washington.edu
41. Compiling the Myria algebra to bare metal PGAS programs
RADISH
ICDE 15
Brandon
Myers
42. 6/23/2015 Bill Howe, UW 42/57
1% selection microbenchmark, 20GB
Avoid long code paths
43. 6/23/2015 Bill Howe, UW 43/57
Q2 SP2Bench, 100M triples, multiple self-joins
Communication optimization
44. Graph Patterns
44
• SP2Bench, 100 million triples
• Queries compiled to a PGAS C++ language layer, then
compiled again by a low-level PGAS compiler
• One of Myria’s supported back ends
• Comparison with Shark/Spark, which itself has been shown to
be 100X faster than Hadoop-based systems
• …plus PageRank, Naïve Bayes, and more
RADISH
ICDE 15
48. Processingpower
Time
What is the rate-limiting step in data understanding?
Processing power:
Moore’s Law
Human cognitive capacity
Idea adapted from “Less is More” by Bill Buxton (2001)
Amount of data in
the world
slide src: Cecilia Aragon, UW HCDE
54. Seung-Hee
BaeScalable Graph Clustering
Version 1
Parallelize Best-known
Serial Algorithm
ICDM 2013
Version 2
Free 30% improvement
for any algorithm
TKDD 2014 SC 2015
Version 3
Distributed approx.
algorithm, 1.5B edges
55. Viziometrics: Analysis of Visualization
in the Scientific Literature
Proportion of
non-quantitative
figures in paper
Paper impact, grouped into 5% percentiles
Poshen Lee