My presentation at KDD 2016 in SF, in the "Special Session on Standards in Predictive Analytics In the Era of Big and Fast Data" morning track about PMML and PFA http://dmg.org/kdd2016.html
6. Portable Format for Analytics (PFA)
PFA updates the standards w.r.t. more contemporary issues of
system architectures used for predictive analytics: distributed
processing, in-memory computing, serialization, etc.
http://dmg.org/pfa/docs/motivation/
• much more support for distributed systems
• Avro data types
• forward-looking toward more streaming applications
• fits well with higher layers of abstraction, success of
DSLs, etc.
7. Tuning Spark Streaming for Throughput
Gerard Maas, Virdata (2014-12-22)
“One Size Fits All” Doesn’t Anymore
This common architectural pattern requires interchange…
9. Lessons from the success of Apache Spark…
interchange is necessary for the ecosystem
major use cases tend to build their own ML libraries – despite a case
where a majority of committers tend to support a common vision and
encourage use of a canonical library (MLLib with DataFrames)
when a successful business grows over time, challenges arise by
definition: managing separated teams, mergers and acquisitions,
increased audits, regulations, etc.
therefore, lack of interchange for analytics represents a serious
technical debt and potential liability
10. Tungsten Execution
PythonSQL R Streaming
DataFrame
Advanced
Analytics
Physical Execution:
CPU Efficient Data Structures
Keep data closure to CPU cache
Tungsten
Lessons from the success of Apache Spark…
direct use of “compilers” becomes atypical as abstraction layers
become smarter for deferred optimization
11. What to suggest for existing standards?
microservices: how to compose models + parameters
from multiple/distinct services
support for API definitions in Swaggar http://swagger.io/
consider the benefits of Parquet, e.g., how pushdown
predicates enable better optimization of workflows
12. What to suggest for existing standards?
additional standards emerging for other aspects of
workflow definition:
Jupyter http://jupyter.org/
create and share documents that contain live code,
equations, visualizations and explanatory text —
a network protocol suite, at heart, for distributed REPL
environments, often along with containerization
see usage in Oriole http://oreilly.com/oriole/index.html
Dat http://dat-data.com/
shares versioned data through a decentralized network
13. What to suggest for existing standards?
other lingering issues:
• data lineage / provenance
• metadata drift
• public dialog and law:
https://public.resource.org/about/