SlideShare ist ein Scribd-Unternehmen logo
1 von 41
Expressiveness, Simplicity, and Users Craig Chambers Google
A Brief Bio MIT: 82-86 Argus, with Barbara Liskov, Bill Weihl, Mark Day Stanford: 86-91 Self, with David Ungar, UrsHölzle, … U. of Washington: 91-07 Cecil, MultiJava, ArchJava; Vortex, DyC, Rhodium, ... Jeff Dean, Dave Grove, Jonathan Aldrich, Todd Millstein, Sorin Lerner, …  Google: 07- Flume, …
Some Questions What makes an idea successful? Which ideas are adopted most? Which ideas have the most impact?
Outline Some past projects Self language, Self compiler Cecil language, Vortex compiler A current project Flume: data-parallel programming system
Self Language[Ungar & Smith 87] Purified essence of Smalltalk-like languages all data are objects no classes all actions are messages field accesses, control structures Core ideas are very simple widely cited and understood
Self v2[Chambers, Ungar, Chang 91] Added encapsulation and privacy Added prioritized multiple inheritance supported both ordered and unordered mult. inh. Sophisticated, or complicated? Unified, or kitchen sink? Not adopted; dropped from Self v3
Self Compiler[Chambers, Ungar 89-91] Dynamic optimizer (an early JIT compiler) Customization: specialize code for each receiver class Class/type dataflow analysis; lots of inlining Lazy compilation of uncommon code paths 89: customization + simple analysis: effective 90: + complicated analysis: more effective but slow 91: + lazy compilation: still more effective, and fast [Hölzle, … 92-94]: + dynamic type feedback: zowie! Simple analysis + type feedback widely adopted
Cecil Language[Chambers, Leavens, Millstein, Litvinov 92-99] Pure objects, pure messages Multimethods, static typechecking encapsulation modules, modular typechecking constraint-based polymorphic type system integrates F-bounded poly. and “where” clauses later: MultiJava, EML [Lee], Diesel, … Work on multimethods, “open classes” is well-known Multimethods not widely available  
Vortex Compiler[Chambers, Dean, Grove, Lerner, … 94-01] Whole-program optimizer, for Cecil, Java, … Class hierarchy analysis Profile-guided class/type feedback Dataflow analysis, code specialization Interprocedural static class/type analysis Fast context-insensitive [Defouw], context-sensitive Incremental recompilation; composable dataflow analyses Project well-known CHA: my most cited paper; a very simple idea More-sophisticated work less widely adopted
Some Other Work DyC [Grant, Philipose, Mock, Eggers 96-00] Dynamic compilation for C ArchJava, AliasJava, … [Aldrich, Notkin 01-04 …] PL support for software architecture Cobalt, Rhodium [Lerner, Millstein 02-05 …] Provably correct compiler optimizations
Trends Simpler ideas easier to adopt Sophisticated ideas need a simple story to be impactful Ideal: “deceptively simple” Unification != Swiss Army Knife Language papers have had more citations;compiler work has had more practical impact The combination can work well
A Current Project:Flume[Chambers, Raniwala, Perry, ... 10] Make data-parallel MapReduce-like pipelineseasy to write  yetefficient to run
Data-Parallel Programming Analyze & transform large, homogeneous data sets, processing separate elements in parallel Web pages Click logs Purchase records Geographical data sets Census data … Ideal: “embarrassingly parallel” analysis ofpetabytes of data
Challenges Parallel distributed programming is hard To do: Assign machines Distribute program binaries Partition input data across machines Synchronize jobs, communicate data when needed Monitor jobs Deal with faults in programs, machines, network, … Tune: stragglers, work stealing, … What if user is a domain expert, not a systems/PL expert?
MapReduce[Dean & Ghemawat, 04] purchases queries map item -> co-item term -> hour+city shuffle item -> all co-items term-> (hour+city)* reduce item -> recommend term-> what’s hot, when
MapReduce Greatly eases writing fault-tolerant data-parallel programs Handles many tedious and/or tricky details Has excellent (batch) performance Offers a simple programming model Lots of knobs for tuning Pipelines of MapReduces? Additional details to handle temp files pipeline control Programming model becomes low-level
Flume Ease task of writing data-parallel pipelines Offer high-level data-parallel abstractions,as a Java or C++ library Classes for (possibly huge) immutable collections Methods for data-parallel operations Easily composed to form pipelines Entire pipeline in a single program Automatically optimize and execute pipeline,e.g., via a series of MapReduces Manage lower-level details automatically
Flume Classes and Methods Core data-parallel collection classes: PCollection<T>,  PTable<K,V> Core data-parallel methods: parallelDo(DoFn) groupByKey() combineValues(CombineFn) flatten(...) read(Source), writeTo(Sink), … Derive other methods from these primitives: join(...), count(),  top(CompareFn,N), ...
Example: TopWords PCollection<String> lines =read(TextIO.source(“/gfs/corpus/*.txt”)); PCollection<String> words =lines.parallelDo(newExtractWordsFn()); PTable<String, Long> wordCounts =words.count(); PCollection<Pair<String, Long>> topWords =wordCounts.top(newOrderCountsFn(), 1000); PCollection<String>formattedOutput =topWords.parallelDo(newFormatCountFn()); formattedOutput.writeTo(TextIO.sink(“cnts.txt”)); FlumeJava.run();
Example: TopWords read(TextIO.source(“/gfs/corpus/*.txt”)) .parallelDo(newExtractWordsFn()) .count() .top(new OrderCountsFn(), 1000) .parallelDo(new FormatCountFn()) .writeTo(TextIO.sink(“cnts.txt”)); FlumeJava.run();
Execution Graph Data-parallel primitives (e.g., parallelDo) are “lazy” Don’t actually run right away, but wait until demanded Calls to primitives build an execution graph Nodes are operations to be performed Edges are PCollections that will hold the results An unevaluated result PCollection is a “future” Points to the graph that computes it Derived operations (e.g., count, user code) call lazy primitives and so get inlined away Evaluation is “demanded” by FlumeJava.run() Optimizes, then executes
read read(TextIO.source(“/…/*.txt”)) pDo parallelDo(newExtractWordsFn()) pDo count() gbk Execution Graph cv pDo gbk top(new OrderCountsFn(), 1000) pDo pDo parallelDo(new FormatCountFn()) write writeTo(TextIO.sink(“cnts.txt”))
Optimizer Fuse trees of parallelDo operations into one Producer-consumer,co-consumers (“siblings”) Eliminate now-unused intermediate PCollections Form MapReduces pDo + gbk + cv + pDo MapShuffleCombineReduce (MSCR) General: multi-mapper, multi-reducer, multi-output pDo pDo pDo pDo pDo pDo
read read(TextIO.source(“/…/*.txt”)) mscr pDo pDo parallelDo(newExtractWordsFn()) pDo count() gbk Final Pipeline Fusion cv mscr pDo 8 operations 2 operations gbk top(new OrderCountsFn(), 1000) pDo pDo pDo parallelDo(new FormatCountFn()) write writeTo(TextIO.sink(“cnts.txt”))
Executor Runs each optimized MSCR If small data, runs locally, sequentially develop and test in normal IDE If large data, runs remotely, in parallel Handles creating, deleting temp files Supports fast re-execution of incomplete runs Caches, reuses partial pipeline results
Another Example: SiteData GetPScoreFn, GetVerticalFn pDo pDo pDo GetDocInfoFn gbk PickBestFn cv pDo pDo pDo join() gbk pDo pDo MakeDocTraitsFn
Another Example: SiteData pDo pDo pDo pDo mscr mscr pDo gbk cv pDo pDo pDo 11 ops 2 ops gbk pDo pDo pDo
Experience FlumeJava released to Google users in May 2009 Now: hundreds of pipelines run by hundreds of users every month Real pipelines process megabytes <=> petabytes Users find FlumeJava a lot easier than MapReduce Advanced users can exert control over optimizer and executor if/when necessary But when things go wrong, lower abstraction levels intrude
How Well Does It Work? How does FlumeJava compare in speed to: an equally modular Java MapReduce pipeline? a hand-optimized Java MapReduce pipeline? a hand-optimized Sawzall pipeline? Sawzall: language for logs processing How big are pipelines in practice? How much does the optimizer help?
Performance
Optimizer Impact
Current and Future Work FlumeC++ just released to Google users Auto-tuner Profile executions,choose good settings for tuning MapReduces Other execution substrates than MapReduce Continuous/streaming execution? Dynamic code generation and optimization?
A More Advanced Approach Apply advanced PL ideas to the data-parallel domain A custom language tuned to this domain A sophisticated static optimizer and code generator An integrated parallel run-time system
Lumberjack A language designed for data-parallel programming An implicitly parallel model All collections potentially PCollections All loops potentially parallel Functional Mostly side-effect free Concise lambdas Advanced type system to minimize verbosity
Static Optimizer Decide which collections are PCollections,which loops are parallel loops Interprocedural context-sensitive analysis OO type analysis side-effect analysis inlining dead assignment elimination …
Parallel Run-Time System Similar to Flume’s run-time system Schedules MapReduces Manages temp files Handles faults
Result: Not Successful A new language is a hard sell to most developers Language details obscure key new concepts Hard to be proficient in yet another language with yet another syntax Libraries? Increases risk to their projects Optimizer constrained by limits of static analysis
Response: FlumeJava Replace custom language with Java + Flume library More verbose syntactically ,[object Object]
All standard libraries & coding idioms preserved
Much less risk
Easy to try out, easy to like, easy to adopt

Weitere ähnliche Inhalte

Was ist angesagt?

Latent Semantic Analysis of Wikipedia with Spark
Latent Semantic Analysis of Wikipedia with SparkLatent Semantic Analysis of Wikipedia with Spark
Latent Semantic Analysis of Wikipedia with SparkSandy Ryza
 
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016MLconf
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerankgothicane
 
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016MLconf
 
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...Databricks
 
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...Databricks
 
Better {ML} Together: GraphLab Create + Spark
Better {ML} Together: GraphLab Create + Spark Better {ML} Together: GraphLab Create + Spark
Better {ML} Together: GraphLab Create + Spark Turi, Inc.
 
Hadoop and Cascading At AJUG July 2009
Hadoop and Cascading At AJUG July 2009Hadoop and Cascading At AJUG July 2009
Hadoop and Cascading At AJUG July 2009Christopher Curtin
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkCloudera, Inc.
 
Spark Meetup @ Netflix, 05/19/2015
Spark Meetup @ Netflix, 05/19/2015Spark Meetup @ Netflix, 05/19/2015
Spark Meetup @ Netflix, 05/19/2015Yves Raimond
 
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)PyData
 
Deep learning on Hadoop/Spark -NextML
Deep learning on Hadoop/Spark -NextMLDeep learning on Hadoop/Spark -NextML
Deep learning on Hadoop/Spark -NextMLAdam Gibson
 
Natural Language to SQL Query conversion using Machine Learning Techniques on...
Natural Language to SQL Query conversion using Machine Learning Techniques on...Natural Language to SQL Query conversion using Machine Learning Techniques on...
Natural Language to SQL Query conversion using Machine Learning Techniques on...HPCC Systems
 
Building Machine Learning Applications with Sparkling Water
Building Machine Learning Applications with Sparkling WaterBuilding Machine Learning Applications with Sparkling Water
Building Machine Learning Applications with Sparkling WaterSri Ambati
 
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXIntroduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXrhatr
 
H2O World - Sparkling Water - Michal Malohlava
H2O World - Sparkling Water - Michal MalohlavaH2O World - Sparkling Water - Michal Malohlava
H2O World - Sparkling Water - Michal MalohlavaSri Ambati
 
Strata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed databaseStrata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed databaseJulien Le Dem
 
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016MLconf
 

Was ist angesagt? (20)

Latent Semantic Analysis of Wikipedia with Spark
Latent Semantic Analysis of Wikipedia with SparkLatent Semantic Analysis of Wikipedia with Spark
Latent Semantic Analysis of Wikipedia with Spark
 
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
 
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
 
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
 
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
 
Better {ML} Together: GraphLab Create + Spark
Better {ML} Together: GraphLab Create + Spark Better {ML} Together: GraphLab Create + Spark
Better {ML} Together: GraphLab Create + Spark
 
Hadoop and Cascading At AJUG July 2009
Hadoop and Cascading At AJUG July 2009Hadoop and Cascading At AJUG July 2009
Hadoop and Cascading At AJUG July 2009
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache Spark
 
Spark Meetup @ Netflix, 05/19/2015
Spark Meetup @ Netflix, 05/19/2015Spark Meetup @ Netflix, 05/19/2015
Spark Meetup @ Netflix, 05/19/2015
 
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
 
Deep learning on Hadoop/Spark -NextML
Deep learning on Hadoop/Spark -NextMLDeep learning on Hadoop/Spark -NextML
Deep learning on Hadoop/Spark -NextML
 
Natural Language to SQL Query conversion using Machine Learning Techniques on...
Natural Language to SQL Query conversion using Machine Learning Techniques on...Natural Language to SQL Query conversion using Machine Learning Techniques on...
Natural Language to SQL Query conversion using Machine Learning Techniques on...
 
Building Machine Learning Applications with Sparkling Water
Building Machine Learning Applications with Sparkling WaterBuilding Machine Learning Applications with Sparkling Water
Building Machine Learning Applications with Sparkling Water
 
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXIntroduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
 
H2O World - Sparkling Water - Michal Malohlava
H2O World - Sparkling Water - Michal MalohlavaH2O World - Sparkling Water - Michal Malohlava
H2O World - Sparkling Water - Michal Malohlava
 
Strata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed databaseStrata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed database
 
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
L15 Data Source Layer
L15 Data Source LayerL15 Data Source Layer
L15 Data Source Layer
 

Andere mochten auch (11)

Emily_Okonjo_MBA_Certificate_Feb2015
Emily_Okonjo_MBA_Certificate_Feb2015Emily_Okonjo_MBA_Certificate_Feb2015
Emily_Okonjo_MBA_Certificate_Feb2015
 
Maereg CVV
Maereg CVVMaereg CVV
Maereg CVV
 
Using triangles in Technical Analysis
Using triangles  in Technical AnalysisUsing triangles  in Technical Analysis
Using triangles in Technical Analysis
 
Proyecto tic numero 12
Proyecto tic numero 12Proyecto tic numero 12
Proyecto tic numero 12
 
American university ms back side
American university ms back sideAmerican university ms back side
American university ms back side
 
малинин
малининмалинин
малинин
 
Lojas virtuais
Lojas virtuaisLojas virtuais
Lojas virtuais
 
Etpourtantdanslemonde.exercices.fle
Etpourtantdanslemonde.exercices.fleEtpourtantdanslemonde.exercices.fle
Etpourtantdanslemonde.exercices.fle
 
Qu'est-ce qu'une école d'art ?
Qu'est-ce qu'une école d'art ?Qu'est-ce qu'une école d'art ?
Qu'est-ce qu'une école d'art ?
 
C.V.
C.V.C.V.
C.V.
 
O que vem depois do Mobile - Campus party 2016 #CPB9
O que vem depois do Mobile - Campus party 2016 #CPB9O que vem depois do Mobile - Campus party 2016 #CPB9
O que vem depois do Mobile - Campus party 2016 #CPB9
 

Ähnlich wie Expressiveness, Simplicity and Users

Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Andrey Vykhodtsev
 
Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To HadoopAdeel Ahmad
 
20150716 introduction to apache spark v3
20150716 introduction to apache spark v3 20150716 introduction to apache spark v3
20150716 introduction to apache spark v3 Andrey Vykhodtsev
 
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010Yahoo Developer Network
 
Hands on Mahout!
Hands on Mahout!Hands on Mahout!
Hands on Mahout!OSCON Byrum
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010nzhang
 
Getting Started on Hadoop
Getting Started on HadoopGetting Started on Hadoop
Getting Started on HadoopPaco Nathan
 
NoSQL, Hadoop, Cascading June 2010
NoSQL, Hadoop, Cascading June 2010NoSQL, Hadoop, Cascading June 2010
NoSQL, Hadoop, Cascading June 2010Christopher Curtin
 
Building and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache AirflowBuilding and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache AirflowKaxil Naik
 
Vitus Masters Defense
Vitus Masters DefenseVitus Masters Defense
Vitus Masters DefensederDoc
 
Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupesh Bansal
 
Source-to-source transformations: Supporting tools and infrastructure
Source-to-source transformations: Supporting tools and infrastructureSource-to-source transformations: Supporting tools and infrastructure
Source-to-source transformations: Supporting tools and infrastructurekaveirious
 
Hadoop Technologies
Hadoop TechnologiesHadoop Technologies
Hadoop Technologieszahid-mian
 
Boulder/Denver BigData: Cluster Computing with Apache Mesos and Cascading
Boulder/Denver BigData: Cluster Computing with Apache Mesos and CascadingBoulder/Denver BigData: Cluster Computing with Apache Mesos and Cascading
Boulder/Denver BigData: Cluster Computing with Apache Mesos and CascadingPaco Nathan
 
Software engineering
Software engineeringSoftware engineering
Software engineeringFahe Em
 
Software engineering
Software engineeringSoftware engineering
Software engineeringFahe Em
 
Software Abstractions for Parallel Hardware
Software Abstractions for Parallel HardwareSoftware Abstractions for Parallel Hardware
Software Abstractions for Parallel HardwareJoel Falcou
 

Ähnlich wie Expressiveness, Simplicity and Users (20)

Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Programming for Problem Solving
Programming for Problem SolvingProgramming for Problem Solving
Programming for Problem Solving
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To Hadoop
 
20150716 introduction to apache spark v3
20150716 introduction to apache spark v3 20150716 introduction to apache spark v3
20150716 introduction to apache spark v3
 
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010
 
Hands on Mahout!
Hands on Mahout!Hands on Mahout!
Hands on Mahout!
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
 
Getting Started on Hadoop
Getting Started on HadoopGetting Started on Hadoop
Getting Started on Hadoop
 
NoSQL, Hadoop, Cascading June 2010
NoSQL, Hadoop, Cascading June 2010NoSQL, Hadoop, Cascading June 2010
NoSQL, Hadoop, Cascading June 2010
 
Building and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache AirflowBuilding and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache Airflow
 
Vitus Masters Defense
Vitus Masters DefenseVitus Masters Defense
Vitus Masters Defense
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
 
Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata
 
Source-to-source transformations: Supporting tools and infrastructure
Source-to-source transformations: Supporting tools and infrastructureSource-to-source transformations: Supporting tools and infrastructure
Source-to-source transformations: Supporting tools and infrastructure
 
Hadoop Technologies
Hadoop TechnologiesHadoop Technologies
Hadoop Technologies
 
Boulder/Denver BigData: Cluster Computing with Apache Mesos and Cascading
Boulder/Denver BigData: Cluster Computing with Apache Mesos and CascadingBoulder/Denver BigData: Cluster Computing with Apache Mesos and Cascading
Boulder/Denver BigData: Cluster Computing with Apache Mesos and Cascading
 
Software engineering
Software engineeringSoftware engineering
Software engineering
 
Software engineering
Software engineeringSoftware engineering
Software engineering
 
Software Abstractions for Parallel Hardware
Software Abstractions for Parallel HardwareSoftware Abstractions for Parallel Hardware
Software Abstractions for Parallel Hardware
 

Mehr von greenwop

Performance Analysis of Idle Programs
Performance Analysis of Idle ProgramsPerformance Analysis of Idle Programs
Performance Analysis of Idle Programsgreenwop
 
Unifying Remote Data, Remote Procedure, and Service Clients
Unifying Remote Data, Remote Procedure, and Service ClientsUnifying Remote Data, Remote Procedure, and Service Clients
Unifying Remote Data, Remote Procedure, and Service Clientsgreenwop
 
Category theory, Monads, and Duality in the world of (BIG) Data
Category theory, Monads, and Duality in the world of (BIG) DataCategory theory, Monads, and Duality in the world of (BIG) Data
Category theory, Monads, and Duality in the world of (BIG) Datagreenwop
 
A Featherweight Approach to FOOL
A Featherweight Approach to FOOLA Featherweight Approach to FOOL
A Featherweight Approach to FOOLgreenwop
 
The Rise of Dynamic Languages
The Rise of Dynamic LanguagesThe Rise of Dynamic Languages
The Rise of Dynamic Languagesgreenwop
 
Turning a Tower of Babel into a Beautiful Racket
Turning a Tower of Babel into a Beautiful RacketTurning a Tower of Babel into a Beautiful Racket
Turning a Tower of Babel into a Beautiful Racketgreenwop
 
Normal Considered Harmful
Normal Considered HarmfulNormal Considered Harmful
Normal Considered Harmfulgreenwop
 
Programming Language Memory Models: What do Shared Variables Mean?
Programming Language Memory Models: What do Shared Variables Mean?Programming Language Memory Models: What do Shared Variables Mean?
Programming Language Memory Models: What do Shared Variables Mean?greenwop
 
High Performance JavaScript
High Performance JavaScriptHigh Performance JavaScript
High Performance JavaScriptgreenwop
 

Mehr von greenwop (9)

Performance Analysis of Idle Programs
Performance Analysis of Idle ProgramsPerformance Analysis of Idle Programs
Performance Analysis of Idle Programs
 
Unifying Remote Data, Remote Procedure, and Service Clients
Unifying Remote Data, Remote Procedure, and Service ClientsUnifying Remote Data, Remote Procedure, and Service Clients
Unifying Remote Data, Remote Procedure, and Service Clients
 
Category theory, Monads, and Duality in the world of (BIG) Data
Category theory, Monads, and Duality in the world of (BIG) DataCategory theory, Monads, and Duality in the world of (BIG) Data
Category theory, Monads, and Duality in the world of (BIG) Data
 
A Featherweight Approach to FOOL
A Featherweight Approach to FOOLA Featherweight Approach to FOOL
A Featherweight Approach to FOOL
 
The Rise of Dynamic Languages
The Rise of Dynamic LanguagesThe Rise of Dynamic Languages
The Rise of Dynamic Languages
 
Turning a Tower of Babel into a Beautiful Racket
Turning a Tower of Babel into a Beautiful RacketTurning a Tower of Babel into a Beautiful Racket
Turning a Tower of Babel into a Beautiful Racket
 
Normal Considered Harmful
Normal Considered HarmfulNormal Considered Harmful
Normal Considered Harmful
 
Programming Language Memory Models: What do Shared Variables Mean?
Programming Language Memory Models: What do Shared Variables Mean?Programming Language Memory Models: What do Shared Variables Mean?
Programming Language Memory Models: What do Shared Variables Mean?
 
High Performance JavaScript
High Performance JavaScriptHigh Performance JavaScript
High Performance JavaScript
 

Kürzlich hochgeladen

Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 

Kürzlich hochgeladen (20)

Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 

Expressiveness, Simplicity and Users

  • 1. Expressiveness, Simplicity, and Users Craig Chambers Google
  • 2. A Brief Bio MIT: 82-86 Argus, with Barbara Liskov, Bill Weihl, Mark Day Stanford: 86-91 Self, with David Ungar, UrsHölzle, … U. of Washington: 91-07 Cecil, MultiJava, ArchJava; Vortex, DyC, Rhodium, ... Jeff Dean, Dave Grove, Jonathan Aldrich, Todd Millstein, Sorin Lerner, … Google: 07- Flume, …
  • 3. Some Questions What makes an idea successful? Which ideas are adopted most? Which ideas have the most impact?
  • 4. Outline Some past projects Self language, Self compiler Cecil language, Vortex compiler A current project Flume: data-parallel programming system
  • 5. Self Language[Ungar & Smith 87] Purified essence of Smalltalk-like languages all data are objects no classes all actions are messages field accesses, control structures Core ideas are very simple widely cited and understood
  • 6. Self v2[Chambers, Ungar, Chang 91] Added encapsulation and privacy Added prioritized multiple inheritance supported both ordered and unordered mult. inh. Sophisticated, or complicated? Unified, or kitchen sink? Not adopted; dropped from Self v3
  • 7. Self Compiler[Chambers, Ungar 89-91] Dynamic optimizer (an early JIT compiler) Customization: specialize code for each receiver class Class/type dataflow analysis; lots of inlining Lazy compilation of uncommon code paths 89: customization + simple analysis: effective 90: + complicated analysis: more effective but slow 91: + lazy compilation: still more effective, and fast [Hölzle, … 92-94]: + dynamic type feedback: zowie! Simple analysis + type feedback widely adopted
  • 8. Cecil Language[Chambers, Leavens, Millstein, Litvinov 92-99] Pure objects, pure messages Multimethods, static typechecking encapsulation modules, modular typechecking constraint-based polymorphic type system integrates F-bounded poly. and “where” clauses later: MultiJava, EML [Lee], Diesel, … Work on multimethods, “open classes” is well-known Multimethods not widely available 
  • 9. Vortex Compiler[Chambers, Dean, Grove, Lerner, … 94-01] Whole-program optimizer, for Cecil, Java, … Class hierarchy analysis Profile-guided class/type feedback Dataflow analysis, code specialization Interprocedural static class/type analysis Fast context-insensitive [Defouw], context-sensitive Incremental recompilation; composable dataflow analyses Project well-known CHA: my most cited paper; a very simple idea More-sophisticated work less widely adopted
  • 10. Some Other Work DyC [Grant, Philipose, Mock, Eggers 96-00] Dynamic compilation for C ArchJava, AliasJava, … [Aldrich, Notkin 01-04 …] PL support for software architecture Cobalt, Rhodium [Lerner, Millstein 02-05 …] Provably correct compiler optimizations
  • 11. Trends Simpler ideas easier to adopt Sophisticated ideas need a simple story to be impactful Ideal: “deceptively simple” Unification != Swiss Army Knife Language papers have had more citations;compiler work has had more practical impact The combination can work well
  • 12. A Current Project:Flume[Chambers, Raniwala, Perry, ... 10] Make data-parallel MapReduce-like pipelineseasy to write yetefficient to run
  • 13. Data-Parallel Programming Analyze & transform large, homogeneous data sets, processing separate elements in parallel Web pages Click logs Purchase records Geographical data sets Census data … Ideal: “embarrassingly parallel” analysis ofpetabytes of data
  • 14. Challenges Parallel distributed programming is hard To do: Assign machines Distribute program binaries Partition input data across machines Synchronize jobs, communicate data when needed Monitor jobs Deal with faults in programs, machines, network, … Tune: stragglers, work stealing, … What if user is a domain expert, not a systems/PL expert?
  • 15. MapReduce[Dean & Ghemawat, 04] purchases queries map item -> co-item term -> hour+city shuffle item -> all co-items term-> (hour+city)* reduce item -> recommend term-> what’s hot, when
  • 16. MapReduce Greatly eases writing fault-tolerant data-parallel programs Handles many tedious and/or tricky details Has excellent (batch) performance Offers a simple programming model Lots of knobs for tuning Pipelines of MapReduces? Additional details to handle temp files pipeline control Programming model becomes low-level
  • 17. Flume Ease task of writing data-parallel pipelines Offer high-level data-parallel abstractions,as a Java or C++ library Classes for (possibly huge) immutable collections Methods for data-parallel operations Easily composed to form pipelines Entire pipeline in a single program Automatically optimize and execute pipeline,e.g., via a series of MapReduces Manage lower-level details automatically
  • 18. Flume Classes and Methods Core data-parallel collection classes: PCollection<T>, PTable<K,V> Core data-parallel methods: parallelDo(DoFn) groupByKey() combineValues(CombineFn) flatten(...) read(Source), writeTo(Sink), … Derive other methods from these primitives: join(...), count(), top(CompareFn,N), ...
  • 19. Example: TopWords PCollection<String> lines =read(TextIO.source(“/gfs/corpus/*.txt”)); PCollection<String> words =lines.parallelDo(newExtractWordsFn()); PTable<String, Long> wordCounts =words.count(); PCollection<Pair<String, Long>> topWords =wordCounts.top(newOrderCountsFn(), 1000); PCollection<String>formattedOutput =topWords.parallelDo(newFormatCountFn()); formattedOutput.writeTo(TextIO.sink(“cnts.txt”)); FlumeJava.run();
  • 20. Example: TopWords read(TextIO.source(“/gfs/corpus/*.txt”)) .parallelDo(newExtractWordsFn()) .count() .top(new OrderCountsFn(), 1000) .parallelDo(new FormatCountFn()) .writeTo(TextIO.sink(“cnts.txt”)); FlumeJava.run();
  • 21. Execution Graph Data-parallel primitives (e.g., parallelDo) are “lazy” Don’t actually run right away, but wait until demanded Calls to primitives build an execution graph Nodes are operations to be performed Edges are PCollections that will hold the results An unevaluated result PCollection is a “future” Points to the graph that computes it Derived operations (e.g., count, user code) call lazy primitives and so get inlined away Evaluation is “demanded” by FlumeJava.run() Optimizes, then executes
  • 22. read read(TextIO.source(“/…/*.txt”)) pDo parallelDo(newExtractWordsFn()) pDo count() gbk Execution Graph cv pDo gbk top(new OrderCountsFn(), 1000) pDo pDo parallelDo(new FormatCountFn()) write writeTo(TextIO.sink(“cnts.txt”))
  • 23. Optimizer Fuse trees of parallelDo operations into one Producer-consumer,co-consumers (“siblings”) Eliminate now-unused intermediate PCollections Form MapReduces pDo + gbk + cv + pDo MapShuffleCombineReduce (MSCR) General: multi-mapper, multi-reducer, multi-output pDo pDo pDo pDo pDo pDo
  • 24. read read(TextIO.source(“/…/*.txt”)) mscr pDo pDo parallelDo(newExtractWordsFn()) pDo count() gbk Final Pipeline Fusion cv mscr pDo 8 operations 2 operations gbk top(new OrderCountsFn(), 1000) pDo pDo pDo parallelDo(new FormatCountFn()) write writeTo(TextIO.sink(“cnts.txt”))
  • 25. Executor Runs each optimized MSCR If small data, runs locally, sequentially develop and test in normal IDE If large data, runs remotely, in parallel Handles creating, deleting temp files Supports fast re-execution of incomplete runs Caches, reuses partial pipeline results
  • 26. Another Example: SiteData GetPScoreFn, GetVerticalFn pDo pDo pDo GetDocInfoFn gbk PickBestFn cv pDo pDo pDo join() gbk pDo pDo MakeDocTraitsFn
  • 27. Another Example: SiteData pDo pDo pDo pDo mscr mscr pDo gbk cv pDo pDo pDo 11 ops 2 ops gbk pDo pDo pDo
  • 28. Experience FlumeJava released to Google users in May 2009 Now: hundreds of pipelines run by hundreds of users every month Real pipelines process megabytes <=> petabytes Users find FlumeJava a lot easier than MapReduce Advanced users can exert control over optimizer and executor if/when necessary But when things go wrong, lower abstraction levels intrude
  • 29. How Well Does It Work? How does FlumeJava compare in speed to: an equally modular Java MapReduce pipeline? a hand-optimized Java MapReduce pipeline? a hand-optimized Sawzall pipeline? Sawzall: language for logs processing How big are pipelines in practice? How much does the optimizer help?
  • 32. Current and Future Work FlumeC++ just released to Google users Auto-tuner Profile executions,choose good settings for tuning MapReduces Other execution substrates than MapReduce Continuous/streaming execution? Dynamic code generation and optimization?
  • 33. A More Advanced Approach Apply advanced PL ideas to the data-parallel domain A custom language tuned to this domain A sophisticated static optimizer and code generator An integrated parallel run-time system
  • 34. Lumberjack A language designed for data-parallel programming An implicitly parallel model All collections potentially PCollections All loops potentially parallel Functional Mostly side-effect free Concise lambdas Advanced type system to minimize verbosity
  • 35. Static Optimizer Decide which collections are PCollections,which loops are parallel loops Interprocedural context-sensitive analysis OO type analysis side-effect analysis inlining dead assignment elimination …
  • 36. Parallel Run-Time System Similar to Flume’s run-time system Schedules MapReduces Manages temp files Handles faults
  • 37. Result: Not Successful A new language is a hard sell to most developers Language details obscure key new concepts Hard to be proficient in yet another language with yet another syntax Libraries? Increases risk to their projects Optimizer constrained by limits of static analysis
  • 38.
  • 39. All standard libraries & coding idioms preserved
  • 41. Easy to try out, easy to like, easy to adopt
  • 42. Dynamic optimizer less constrained than static optimizer
  • 44.
  • 45. Conclusions Simpler ideas easier to adopt By researchers and by users Sophisticated ideas still needed,to support simple interfaces Doing things dynamically instead of staticallycan be liberating