How does the Modern Data Stack enable collaboration for data teams? Collaboration works like a flywheel that harnesses the collective energy of a data team and directs it towards new opportunities and innovation. Outstanding achievements emerge when teams collaborate to integrate and leverage their strengths towards a common goal. We’ll walk through some of the approaches that successful teams employ at Amazon, AWS, and Netflix to succeed on these fronts. We’ll also walk through what we called the Data Collaboration Stack, from DataOps to MLOps.
2. |
Cloud-based technologies centered on data to empower users to explore and use data…
1 Ideal Modern Data Stack
Building Blocks & Best-of-Breed Approach
… BI
[Reverse]
E(L)TL
Workspace
No-code
Catalog &
Governance
Modeling
Warehouse,
Lake, &
Mesh...
Spreadsheet …
Feature Metrics
4. |
4 Most painful issues when interacting with data, by order of priority
Data Quality Issues
Difficulty accessing data and insufficient quantity
Explainability
Lack of ETL Automation / Data Warehousing Issues
Convincing Stakeholders
Reproducibility
Insufficient Hardware
Unsure of best approach or technique to use…?
Need to be able to iterate quickly
5. |
5 What the Data Collaboration Stack addresses:
Data Quality Issues
Difficulty accessing data and insufficient quantity
Explainability
Lack of ETL Automation / Data Warehousing Issues
Convincing Stakeholders
Reproducibility
Insufficient Hardware
Unsure of best approach or technique to use…?
Need to be able to iterate quickly
6. |
4 Head Full of Fresh Ops: Smooth out the Data Workflows and Processes
7. |
4 A Simplified Data Science Workflow
Feature Engineering
Preparation
Selection
Modeling
Data Cleaning and
Labeling
Data Collection
Optimization
Ensembling
Validation
Improvement
Monitoring
Deployment
Productionization
Code is merely 5-10% of any machine learning solution.
8. |
4 Addressing the Skill Gap for Data Science through Collaboration
8
Analytics &
Visualization
Statistics &
Mathematics
Computer
Science
Domain
Expertise
Machine
Learning
Analyst Data Scientist Engineer Researcher PM/Business
9. |
4 A Typical Data Science Project
https://arxiv.org/pdf/2001.06684.pdf
10. |
4 Collaboration at Amazon Core AI (Amazon Artificial Intelligence Group)
● Price Elasticities
● Economic Impact of Abusing Behavior
● Deep learning to describe products
and services
● Debiasing techniques
● Demand across geography to minimize
transportation costs
● “Image” scanning with Optical
Character Recognition (OCR)
● Multi-arm bandit algorithm to improve
predicted revenue
● Methods for inventory management
Data
Engineers
Product
Customer
Decision
Makers
Partners
Legal
Economists
Data
Scientists
Data
Workflow
11. |
4 Data-Centric AI (Garbage I/O) and Bayesian Networks require Collaboration
● A visualization of the structure of the
model and motivate the design of new
models.
● Insights into the presence and
absence of the relationships between
random variables.
● A way to structure complex
probability calculations.
● What are the random variables in
the problem?
● What are the conditional
relationships between the variables?
● What are the probability
distributions for each variable?
Subject-Matter Experts (SMEs) are integral
to the development process.
Provides Requires
12. |
4 Collaboration is required at every single step
Data (science) teams are extremely
collaborative and work with a variety
of stakeholders and tools
13. |
5 Examples: How Does Collaboration Take Place?
Data Scientist: Having members of the same team work simultaneously on the same
notebook document
Finance Analyst: Having versioned reports that can be re-usable by others
Data Engineer: Having running job status to be communicated to many stakeholders
and shareable
Data Scientist: Having a notion of ownership around artifacts (data, code, and models)
Data Scientist: Having the ability to rapidly clone and reproduce experiments
ML Engineer: Having the ability to search, browse, and organize code, data, and models
15. |
4 From Data To Wisdom
Any Data Workflow…
Gather
Clean
Transform
Explore
Represent
Prescribe
Present
Decide
Data Information Knowledge Insight Wisdom
16. |
4 Collaborative Data Workflows
● Data Engineering
● Data Analytics
● Data Science
● Data Visualization
Collaboration in…
Gather
Clean
Transform
Explore
Represent
Prescribe
Present
Decide
Data
Workflow
17. |
4 Collaborative Data Ecosystem
Team B
Team C
Team A
Team D
Team E
Team F
Maintainers Producers Consumers @kafonek
18. |
4 Pierre’s Collaborative Modern Data Stack
● Discover Data
● Share Across
● Secure Governance
● Control Workflows
● Personalized Views
Eliminate Data Silos
Infrastructure
Infrastructure
Infrastructure
Storage, Access,
& Transformation
Management,
Governance, &
Observability
Infrastructure
Explore, Analyze, &
Publish
19. |
5 Collaboration as Simple Rules
● Import & Export
● Search & Navigation
● Annotation (e.g. Comment, Tagging…)
● User Segmentation
● Support (at least) asynchronous teamwork
● Content Management & Sharing (e.g. Version Control, Change…)
Key Elements
20. |
5 What Modern Data Stack is it?
Infrastructure
Storage, Access, & Transformation
Management, Governance, &
Observability
Explore, Analyze, & Publish
22. |
5 Want to read about Data Collaboration…
“Companies that are in control of their own data generation are those who can get the quickest benefit out
of that data collaboration” - Blake Burch, CEO at Shipyard
“tools empowering data collaboration would come in handy.” - Eti Gwirtz, VP Product at GigaSpaces.
“readiness to experiment and engaging with multiple stakeholders across the organization with specific
roles but ones that need collaboration” - Akhilesh Ayer, EVP and Global Head at WNS Triange.
“It isn’t so much a matter of which industries stand to gain from data collaboration, but that most
businesses can optimize their performance and accuracy by embracing data collaboration” - James
Shalhoub, CEO at Finn
Organizations can solve these challenges by improving cross-functional collaboration between team leaders
and their data team to make insights accessible to the broader team while also shining a light on the most
important metrics to analyze” - Ryan G. Smith, CEO at LeafLink
“For companies with one data person, the collaboration is happening with non-data people, so more of the
data collaboration would likely be around communications of the insights and actions that need to be taken.
Whereas in a technical organization, data collaboration may mean that team members are sharing a GitHub
account and sharing code, as well as putting the code through a review process. The data professionals in
these two instances have very different challenges to face” - Emad Hasan, CEO at Retina
Data Ops: Data + DevOps - A set of practices to improve the quality and reduce the cycle time of data analytics. The main tasks in DataOps include data tagging, data testing, data pipeline orchestration, data versioning and data monitoring.ML Ops: ML + DevOps - A set of practices to design, build and manage reproducible, testable and sustainable ML-powered softwareAI Ops: AI + DevOps
Including SMEs who actually understand how to label and curate your data in the loop allows data scientists to inject domain expertise directly into the model. Once done, this expert knowledge can be codified and deployed for programmatic supervision.
70% of respondents to a recent Harvard Business Review survey acknowledged they were not very effective at data sharing.1Organizations that share data externally with their partners generate three times more measurable economic benefits than their counterparts that do not.2