Use of standards and related issues in predictive analytics

•

3 gefällt mir•2,013 views

My presentation at KDD 2016 in SF, in the "Special Session on Standards in Predictive Analytics In the Era of Big and Fast Data" morning track about PMML and PFA http://dmg.org/kdd2016.html

Technologie

Use of standards and related
issues in predictive analytics
KDD 2016, SF 2016-08-16
Paco Nathan, @pacoid 
Dir, Learning Group @ O’Reilly Media

PMML referenced by 86 publications in Safari, 2001-2016 
https://www.safaribooksonline.com/search/?query=PMML

Pattern: PMML for Cascading and Hadoop 
P Nathan, G Kathalagiri (2013-08-11) 
https://goo.gl/jk7829

Customer
Orders
Classify
Scored
Orders
GroupBy
token
Count
PMML
Model
M R
Failure
Traps
Assert
Confusion
Matrix
Pattern – score a model, using pre-defined Cascading app
cascading.org/projects/pattern

evaluationoptimizationrepresentationcirca 2010
ETL into
cluster/cloud
data
data
visualize,
reporting
Data
Prep
Features
Learners,
Parameters
Unsupervised
Learning
Explore
train set
test set
models
Evaluate
Optimize
Scoring
production
data
use
cases
data pipelines
actionable results
decisions, feedback
bar developers
foo algorithms
Algorithms and developer-centric template thinking
only go so far in real-world workflows…
Results shown in blue, hard problems highlighted in red
Generalized Workflow for ML Use Cases in Big Data

Portable Format for Analytics (PFA)
PFA updates the standards w.r.t. more contemporary issues of
system architectures used for predictive analytics: distributed
processing, in-memory computing, serialization, etc.
http://dmg.org/pfa/docs/motivation/
• much more support for distributed systems
• Avro data types
• forward-looking toward more streaming applications
• fits well with higher layers of abstraction, success of
DSLs, etc.

Tuning Spark Streaming for Throughput
Gerard Maas, Virdata (2014-12-22)
“One Size Fits All” Doesn’t Anymore 
This common architectural pattern requires interchange…

bits.blogs.nytimes.com/2013/06/19/g-e-makes-the-machine-
and-then-uses-sensors-to-listen-to-it/
IoT alters “velocity” and “volume” dramatically 
This growing category of use cases requires interchange…

Lessons from the success of Apache Spark…
interchange is necessary for the ecosystem
major use cases tend to build their own ML libraries – despite a case
where a majority of committers tend to support a common vision and
encourage use of a canonical library (MLLib with DataFrames)
when a successful business grows over time, challenges arise by
definition: managing separated teams, mergers and acquisitions,
increased audits, regulations, etc.
therefore, lack of interchange for analytics represents a serious
technical debt and potential liability

Tungsten Execution
PythonSQL R Streaming
DataFrame
Advanced
Analytics
Physical Execution:
CPU Efficient Data Structures
Keep data closure to CPU cache
Tungsten
Lessons from the success of Apache Spark…
direct use of “compilers” becomes atypical as abstraction layers
become smarter for deferred optimization

What to suggest for existing standards?
microservices: how to compose models + parameters
from multiple/distinct services
support for API definitions in Swaggar http://swagger.io/
consider the benefits of Parquet, e.g., how pushdown
predicates enable better optimization of workflows

What to suggest for existing standards?
additional standards emerging for other aspects of
workflow definition:
Jupyter http://jupyter.org/ 
 
create and share documents that contain live code,
equations, visualizations and explanatory text —  
a network protocol suite, at heart, for distributed REPL
environments, often along with containerization
see usage in Oriole http://oreilly.com/oriole/index.html 
Dat http://dat-data.com/
shares versioned data through a decentralized network

What to suggest for existing standards?
other lingering issues:
• data lineage / provenance
• metadata drift
• public dialog and law: 
https://public.resource.org/about/

presenter:
Just Enough Math
O’Reilly (2014)
justenoughmath.com
monthly newsletter for updates,  
events, conf summaries, etc.:
liber118.com/pxn/

Empfohlen

GalvanizeU Seattle: Eleven Almost-Truisms About DataPaco Nathan

Jupyter for Education: Beyond Gutenberg and ErasmusPaco Nathan

Apache Spark and the Emerging Technology Landscape for Big DataPaco Nathan

Microservices, containers, and machine learningPaco Nathan

Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MorePaco Nathan

Data Science in 2016: Moving UpPaco Nathan

Data Science in Future TensePaco Nathan

How Apache Spark fits into the Big Data landscapePaco Nathan

Empfohlen

GalvanizeU Seattle: Eleven Almost-Truisms About DataPaco Nathan

Jupyter for Education: Beyond Gutenberg and ErasmusPaco Nathan

Apache Spark and the Emerging Technology Landscape for Big DataPaco Nathan

Microservices, containers, and machine learningPaco Nathan

Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MorePaco Nathan

Data Science in 2016: Moving UpPaco Nathan

Data Science in Future TensePaco Nathan

How Apache Spark fits into the Big Data landscapePaco Nathan

Graph Analytics in SparkPaco Nathan

How Apache Spark fits in the Big Data landscapePaco Nathan

A New Year in Data Science: ML UnpausedPaco Nathan

Databricks Meetup @ Los Angeles Apache Spark User GroupPaco Nathan

Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan

Data Science with SparkKrishna Sankar

Gephi, Graphx, and GiraphDoug Needham

Graph Analytics for big dataSigmoid

Architecture in action 01Krishna Sankar

Big Graph Analytics on Neo4j with Apache SparkKenny Bastani

QCon São Paulo: Real-Time Analytics with Spark StreamingPaco Nathan

Strata EU 2014: Spark Streaming Case StudiesPaco Nathan

GraphLab Conference 2014 Keynote - Carlos GuestrinTuri, Inc.

Big Data Analytics with Storm, Spark and GraphLabImpetus Technologies

Spark streamingNoam Shaish

Multiplatform Spark solution for Graph datasources by Javier DominguezBig Data Spain

Microservices, Containers, and Machine LearningPaco Nathan

Agile data science with scalaAndy Petrella

Conference 2014: Rajat Arya - Deployment with GraphLab Create Turi, Inc.

Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use CaseMo Patel

Data Science Reinvents Learning?Paco Nathan

GraphX: Graph analytics for insights about developer communitiesPaco Nathan

Weitere ähnliche Inhalte

Was ist angesagt?

Graph Analytics in SparkPaco Nathan

How Apache Spark fits in the Big Data landscapePaco Nathan

A New Year in Data Science: ML UnpausedPaco Nathan

Databricks Meetup @ Los Angeles Apache Spark User GroupPaco Nathan

Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan

Data Science with SparkKrishna Sankar

Gephi, Graphx, and GiraphDoug Needham

Graph Analytics for big dataSigmoid

Architecture in action 01Krishna Sankar

Big Graph Analytics on Neo4j with Apache SparkKenny Bastani

QCon São Paulo: Real-Time Analytics with Spark StreamingPaco Nathan

Strata EU 2014: Spark Streaming Case StudiesPaco Nathan

GraphLab Conference 2014 Keynote - Carlos GuestrinTuri, Inc.

Big Data Analytics with Storm, Spark and GraphLabImpetus Technologies

Spark streamingNoam Shaish

Multiplatform Spark solution for Graph datasources by Javier DominguezBig Data Spain

Microservices, Containers, and Machine LearningPaco Nathan

Agile data science with scalaAndy Petrella

Conference 2014: Rajat Arya - Deployment with GraphLab Create Turi, Inc.

Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use CaseMo Patel

Was ist angesagt? (20)

Graph Analytics in Spark

How Apache Spark fits in the Big Data landscape

A New Year in Data Science: ML Unpaused

Databricks Meetup @ Los Angeles Apache Spark User Group

Tiny Batches, in the wine: Shiny New Bits in Spark Streaming

Data Science with Spark

Gephi, Graphx, and Giraph

Graph Analytics for big data

Architecture in action 01

Big Graph Analytics on Neo4j with Apache Spark

QCon São Paulo: Real-Time Analytics with Spark Streaming

Strata EU 2014: Spark Streaming Case Studies

GraphLab Conference 2014 Keynote - Carlos Guestrin

Big Data Analytics with Storm, Spark and GraphLab

Spark streaming

Multiplatform Spark solution for Graph datasources by Javier Dominguez

Microservices, Containers, and Machine Learning

Agile data science with scala

Conference 2014: Rajat Arya - Deployment with GraphLab Create

Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case

Andere mochten auch

Data Science Reinvents Learning?Paco Nathan

GraphX: Graph analytics for insights about developer communitiesPaco Nathan

SF Python Meetup: TextRank in PythonPaco Nathan

What's new with Apache Spark?Paco Nathan

PMML - Predictive Model Markup Languageaguazzel

ACM Bay Area Data Mining Workshop: Pattern, PMML, HadoopPaco Nathan

Predictive analytics from a to zalpinedatalabs

If You Are Not Embedding Analytics Into Your Day To Day Processes, You Are Do...Dell World

On the representation and reuse of machine learning (ML) modelsVillu Ruusmann

#MesosCon 2014: Spark on MesosPaco Nathan

OSCON 2014: Data Workflows for Machine LearningPaco Nathan

Big Data is changing abruptly, and where it is likely headingPaco Nathan

Future of data science as a professionJose Quesada

Big data & data science challenges and opportunitiesJose Quesada

Andere mochten auch (14)

Data Science Reinvents Learning?

GraphX: Graph analytics for insights about developer communities

SF Python Meetup: TextRank in Python

What's new with Apache Spark?

PMML - Predictive Model Markup Language

ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

Predictive analytics from a to z

If You Are Not Embedding Analytics Into Your Day To Day Processes, You Are Do...

On the representation and reuse of machine learning (ML) models

#MesosCon 2014: Spark on Mesos

OSCON 2014: Data Workflows for Machine Learning

Big Data is changing abruptly, and where it is likely heading

Future of data science as a profession

Big data & data science challenges and opportunities

Ähnlich wie Use of standards and related issues in predictive analytics

DevOps for DataScienceStepan Pushkarev

03_aiops-1.pptxFarazulHoda2

Machine learning at scale challenges and solutionsStavros Kontopoulos

EPAM ML/AI Accelerator - ODAHUDmitrii Suslov

Deploying Data Science Engines to ProductionMostafa Majidpour

ArangoML Pipeline Cloud - Managed Machine Learning MetadataArangoDB Database

Paige Roberts: Shortcut MLOps with In-Database Machine LearningEdunomica

Pattern -A scoring engineShivanna Madalabhavi

Spark and machine learning in microservices architectureStepan Pushkarev

Architecting an Open Source AI Platform 2018 editionDavid Talby

How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....Databricks

Apache Hadoop Summit 2016: The Future of Apache Hadoop an Enterprise Architec...PwC

The Future of Apache Hadoop an Enterprise Architecture ViewDataWorks Summit/Hadoop Summit

GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...James Anderson

Media_Entertainment_VeriticalsPeyman Mohajerian

Introducing new AIOps innovations in Oracle 19c - San Jose AICUGSandesh Rao

Data Science at Scale on MPP databases - Use Cases & Open Source ToolsEsther Vasiete

Sparkflows.iosparkflows

AzureML Welcome to the future of Predictive Analytics Ruben Pertusa Lopez

Scalable AutoML for Time Series Forecasting using RayDatabricks

Ähnlich wie Use of standards and related issues in predictive analytics (20)

DevOps for DataScience

03_aiops-1.pptx

Machine learning at scale challenges and solutions

EPAM ML/AI Accelerator - ODAHU

Deploying Data Science Engines to Production

ArangoML Pipeline Cloud - Managed Machine Learning Metadata

Paige Roberts: Shortcut MLOps with In-Database Machine Learning

Pattern -A scoring engine

Spark and machine learning in microservices architecture

Architecting an Open Source AI Platform 2018 edition

How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....

Apache Hadoop Summit 2016: The Future of Apache Hadoop an Enterprise Architec...

The Future of Apache Hadoop an Enterprise Architecture View

GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...

Media_Entertainment_Veriticals

Introducing new AIOps innovations in Oracle 19c - San Jose AICUG

Data Science at Scale on MPP databases - Use Cases & Open Source Tools

Sparkflows.io

AzureML Welcome to the future of Predictive Analytics

Scalable AutoML for Time Series Forecasting using Ray

Mehr von Paco Nathan

Human in the loop: a design pattern for managing teams working with MLPaco Nathan

Human-in-the-loop: a design pattern for managing teams that leverage MLPaco Nathan

Human-in-a-loop: a design pattern for managing teams which leverage MLPaco Nathan

Humans in a loop: Jupyter notebooks as a front-end for AIPaco Nathan

Humans in the loop: AI in open source and industryPaco Nathan

Computable ContentPaco Nathan

Computable Content: Lessons LearnedPaco Nathan

Brief Intro to Apache Spark @ Stanford ICMEPaco Nathan

How Apache Spark fits into the Big Data landscapePaco Nathan

Mehr von Paco Nathan (9)

Human in the loop: a design pattern for managing teams working with ML

Human-in-the-loop: a design pattern for managing teams that leverage ML

Human-in-a-loop: a design pattern for managing teams which leverage ML

Humans in a loop: Jupyter notebooks as a front-end for AI

Humans in the loop: AI in open source and industry

Computable Content

Computable Content: Lessons Learned

Brief Intro to Apache Spark @ Stanford ICME

How Apache Spark fits into the Big Data landscape

Kürzlich hochgeladen

WordPress Websites for Engineers: Elevate Your Brandgvaughan

DevEX - reference for building teams, processes, and platformsSergiu Bodiu

How to write a Business Continuity PlanDatabarracks

Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson

"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays

TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc

Take control of your SAP testing with UiPath Test SuiteDianaGray10

DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy

Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi

DMCC Future of Trade Web3 - Special EditionDubai Multi Commodity Centre

DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro

Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB

Advanced Computer Architecture – An IntroductionDilum Bandara

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal

Gen AI in Business - Global Trends Report 2024.pdfAddepto

Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed

Kürzlich hochgeladen (20)

WordPress Websites for Engineers: Elevate Your Brand

DevEX - reference for building teams, processes, and platforms

How to write a Business Continuity Plan

Are Multi-Cloud and Serverless Good or Bad?

"Debugging python applications inside k8s environment", Andrii Soldatenko

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy

Take control of your SAP testing with UiPath Test Suite

DevoxxFR 2024 Reproducible Builds with Apache Maven

Vertex AI Gemini Prompt Engineering Tips

DMCC Future of Trade Web3 - Special Edition

DSPy a system for AI to Write Prompts and Do Fine Tuning

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

Unraveling Multimodality with Large Language Models.pdf

Developer Data Modeling Mistakes: From Postgres to NoSQL

Advanced Computer Architecture – An Introduction

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

SAP Build Work Zone - Overview L2-L3.pptx

Gen AI in Business - Global Trends Report 2024.pdf

Scanning the Internet for External Cloud Exposures via SSL Certs

Use of standards and related issues in predictive analytics

1. Use of standards and related issues in predictive analytics KDD 2016, SF 2016-08-16 Paco Nathan, @pacoid  Dir, Learning Group @ O’Reilly Media

2. PMML referenced by 86 publications in Safari, 2001-2016  https://www.safaribooksonline.com/search/?query=PMML

3. Pattern: PMML for Cascading and Hadoop  P Nathan, G Kathalagiri (2013-08-11)  https://goo.gl/jk7829

4. Customer Orders Classify Scored Orders GroupBy token Count PMML Model M R Failure Traps Assert Confusion Matrix Pattern – score a model, using pre-defined Cascading app cascading.org/projects/pattern

5. evaluationoptimizationrepresentationcirca 2010 ETL into cluster/cloud data data visualize, reporting Data Prep Features Learners, Parameters Unsupervised Learning Explore train set test set models Evaluate Optimize Scoring production data use cases data pipelines actionable results decisions, feedback bar developers foo algorithms Algorithms and developer-centric template thinking only go so far in real-world workflows… Results shown in blue, hard problems highlighted in red Generalized Workflow for ML Use Cases in Big Data

6. Portable Format for Analytics (PFA) PFA updates the standards w.r.t. more contemporary issues of system architectures used for predictive analytics: distributed processing, in-memory computing, serialization, etc. http://dmg.org/pfa/docs/motivation/ • much more support for distributed systems • Avro data types • forward-looking toward more streaming applications • fits well with higher layers of abstraction, success of DSLs, etc.

7. Tuning Spark Streaming for Throughput Gerard Maas, Virdata (2014-12-22) “One Size Fits All” Doesn’t Anymore  This common architectural pattern requires interchange…

8. bits.blogs.nytimes.com/2013/06/19/g-e-makes-the-machine- and-then-uses-sensors-to-listen-to-it/ IoT alters “velocity” and “volume” dramatically  This growing category of use cases requires interchange…

9. Lessons from the success of Apache Spark… interchange is necessary for the ecosystem major use cases tend to build their own ML libraries – despite a case where a majority of committers tend to support a common vision and encourage use of a canonical library (MLLib with DataFrames) when a successful business grows over time, challenges arise by definition: managing separated teams, mergers and acquisitions, increased audits, regulations, etc. therefore, lack of interchange for analytics represents a serious technical debt and potential liability

10. Tungsten Execution PythonSQL R Streaming DataFrame Advanced Analytics Physical Execution: CPU Efficient Data Structures Keep data closure to CPU cache Tungsten Lessons from the success of Apache Spark… direct use of “compilers” becomes atypical as abstraction layers become smarter for deferred optimization

11. What to suggest for existing standards? microservices: how to compose models + parameters from multiple/distinct services support for API definitions in Swaggar http://swagger.io/ consider the benefits of Parquet, e.g., how pushdown predicates enable better optimization of workflows

12. What to suggest for existing standards? additional standards emerging for other aspects of workflow definition: Jupyter http://jupyter.org/    create and share documents that contain live code, equations, visualizations and explanatory text —   a network protocol suite, at heart, for distributed REPL environments, often along with containerization see usage in Oriole http://oreilly.com/oriole/index.html  Dat http://dat-data.com/ shares versioned data through a decentralized network

13. What to suggest for existing standards? other lingering issues: • data lineage / provenance • metadata drift • public dialog and law:  https://public.resource.org/about/

14. presenter: Just Enough Math O’Reilly (2014) justenoughmath.com monthly newsletter for updates,   events, conf summaries, etc.: liber118.com/pxn/