SlideShare ist ein Scribd-Unternehmen logo
1 von 64
Downloaden Sie, um offline zu lesen
Clean Your Data Swamp
By Migration off Hadoop
Speaker
Ron Guerrero
Senior Solutions
Architect
Agenda
● Why modernize?
● Planning your migration off of Hadoop
● Top migration topics
Why migrate off of Hadoop and
onto Databricks?
History of Hadoop
● Created 2005
● Open Source distributed processing and storage
platform running on commodity hardware
● Originally consisted of HDFS, and MapReduce, but
now incorporates numerous open source projects
(Hive, HBase, Spark)
● On-prem and on the cloud
COMPLEX FIXED
Today Hadoop is very hard
● Many tools: Need to understand
multiple technologies.
● Real-time and batch ingestion to
build AI models requires
integrating many components.
Slow Innovation
● 24/7 clusters.
● Fixed capacity: CPU
+ RAM + Disk.
● Costly to upgrade.
Cost Prohibitive
MAINTENANCE
INTENSIVE
● Hadoop ecosystem is
complex and hard to
manage that is prone to
failures.
Low Productivity
X
Enterprises Need a Modern
Data Analytics Architecture
CRITICAL REQUIREMENTS
Cost-effective scale and performance in the cloud
Easy to manage and highly reliable for diverse data
Predictive and real-time insights to drive innovation
Structured Semi-structured Unstructured Streaming
Lakehouse Platform
Data Engineering
BI & SQL
Analytics
Real-time Data
Applications
Data Science
& Machine Learning
Data Management & Governance
Open Data Lake
SIMPLE OPEN COLLABORATIVE
Planning your migration off of
Hadoop and onto Databricks
Migration Planning
● Internal Questions
● Assessment
● Technical Planning
● Enablement and Evaluation
● Migration Execution
Migration Planning
Internal Question
● why?
● who?
● desired start and end dates
● internal stakeholders
● cloud strategy
Migration Planning
Assessment
● Environment inventory
○ compute, data, tooling
● Use case prioritization
● Workload analysis
● Existing TCO
● Projected TCO
● Migration timelines
Migration Planning
Technical Planning
● Target state architecture
● Data migration
● Workload migration
○ Lift and shift, transformative, hybrid
● Data governance approach
● Automated deployment
● Monitoring and Operations
Migration Planning
Enablement and Evaluation
● Workshops,Technical deep dives
● Training
● Proof of technology / MVP
○ Validate assumptions and designs
Migration Planning
Migration Execution
● Environment Deployment
● Iterate of use cases
○ Data Migration
○ Workload Migration
○ Dual Production Deployment - Old and New
○ Validation
○ Cut-over and Decommission of Hadoop
Top Migration Topics
Key Areas of Migration
1. Administration
2. Data Migration
3. Data Processing
4. Security & Governance
5. SQL and BI Layer
Administration
Hadoop Ecosystem to Databricks Concepts
Hadoop
HDFS
c
disk1
disk2
disk3
disk4
disk5
disk6
...
disk
N
YARN
Impala
HBase
c
c
c
c
c MR
mapper
c MR
mapper
c MR
mapper
c
Spark
Worker
(Executor
)
c
c
c
c MR
mapper
c
Spark
Worker
(Executor
)
c
c
c
c
Spark
Worker
(Executor
)
c
c
c
c
c
c
2x12c = 24c
compute
HDFS
c
disk1
disk2
disk3
disk4
disk5
disk6
...
disk
N
YARN
Impala
HBase
c
c
c
c
c MR
mapper
c MR
mapper
c MR
mapper
c
Spark
Worker
(Executor
)
c
c
c
c MR
mapper
c
Spark
Worker
(Executor
)
c
c
c
c
Spark
Worker
(Executor
)
c
c
c
c
c
c
2x12c = 24c
compute
HDFS
c
disk1
disk2
disk3
disk4
disk5
disk6
...
disk
N
YARN
Impala
HBase
c
c
c
c
c MR
mapper
c MR
mapper
c MR
mapper
c
Spark
Worker
(Executor
)
c
c
c
c MR
mapper
c
Spark
Worker
(Executor
)
c
c
c
c
Spark
Driver
c
c
c
c
c
c
2x12c = 24c
compute
...
Node 1 Node 2 Node N
Hive
Metastore
Hive
Server
Impala
(LoadBalancer)
HBase
API
Sentry
Table Metadata +
HDFS ACLs
JDBC/ODBC
Node makeup
▪ Local disks
▪ Cores/Memory carved to services
▪ Submitted jobs compete for resources
▪ Services constrained to accommodate
resources
Metadata and Security
▪ Sentry table metadata permissions combined
with syncing HDFS ACLs OR
▪ Apache Ranger, policy based access control
Endpoints
▪ Direct Access to HDFS / Copied dataset
▪ Hive (on MR or Spark) accepts incoming
connections
▪ Impala for interactive queries
▪ HBase APIs as required
Ranger
Policy based
access control
OR
Hadoop Ecosystem to Databricks Concepts
Hadoop
HDFS
c
disk1
disk2
disk3
disk4
disk5
disk6
...
disk
N
YARN
Impala
HBase
c
c
c
c
c MR
mapper
c MR
mapper
c MR
mapper
c
Spark
Worker
(Executor
)
c
c
c
c MR
mapper
c
Spark
Worker
(Executor
)
c
c
c
c
Spark
Worker
(Executor
)
c
c
c
c
c
c
2x12c = 24c
compute
HDFS
c
disk1
disk2
disk3
disk4
disk5
disk6
...
disk
N
YARN
Impala
HBase
c
c
c
c
c MR
mapper
c MR
mapper
c MR
mapper
c
Spark
Worker
(Executor
)
c
c
c
c MR
mapper
c
Spark
Worker
(Executor
)
c
c
c
c
Spark
Worker
(Executor
)
c
c
c
c
c
c
2x12c = 24c
compute
HDFS
c
disk1
disk2
disk3
disk4
disk5
disk6
...
disk
N
YARN
Impala
HBase
c
c
c
c
c MR
mapper
c MR
mapper
c MR
mapper
c
Spark
Worker
(Executor
)
c
c
c
c MR
mapper
c
Spark
Worker
(Executor
)
c
c
c
c
Spark
Driver
c
c
c
c
c
c
2x12c = 24c
compute
...
Node 1 Node 2 Node N
Hive
Metastore
Hive
Server
Impala
(LoadBalancer)
HBase
API
Sentry/Ranger
Table Metadata +
HDFS ACLs
Hive
Metastore
(managed)
Databricks
SQL Endpoint
JDBC/ODBC
High Conc. Cluster SQL Analytics
CosmosDB/
DynamoDB/
Keyspaces
Object Storage
c
Spark
Worker
(Executor
)
c
c
c
Delta
Engine
c
Spark
Driver
c
c
c
c
Spark
Worker
(Executor
)
c
c
c
Delta
Engine
c
Spark
Worker
(Executor
)
c
c
c
Delta
Engine
Databricks Cluster
Spark ETL
(Batch/Streaming)
c
Spark
Worker
(Executor
)
c
c
c
Delta
Engine
c
Spark
Driver
c
c
c
c
Spark
Worker
(Executor
)
c
c
c
Delta
Engine
c
Spark
Worker
(Executor
)
c
c
c
Delta
Engine
Databricks Cluster
SQL Analytics
c
Spark
Worker
(Executor
)
c
c
c
Delta
Engine
c
Spark
Driver
c
c
c
c
Spark
Worker
(Executor
)
c
c
c
Delta
Engine
c
Spark
Worker
(Executor
)
c
c
c
Delta
Engine
Databricks Cluster
ML Runtime
Table
ACLs
Object Storage ACLs
Ephemeral
Clusters for
All-purpose
or Jobs
JDBC/ODBC
Hadoop Ecosystem to Databricks Concepts
Hive
Metastore
(managed)
Databricks
SQL Endpoint
High Conc. Cluster SQL Analytics
c
Spark
Worker
(Executor
)
c
c
c
Delta
Engine
c
Spark
Driver
c
c
c
c
Spark
Worker
(Executor
)
c
c
c
Delta
Engine
c
Spark
Worker
(Executor
)
c
c
c
Delta
Engine
Databricks Cluster
Spark ETL
(Batch/Streaming)
c
Spark
Worker
(Executor
)
c
c
c
Delta
Engine
c
Spark
Driver
c
c
c
c
Spark
Worker
(Executor
)
c
c
c
Delta
Engine
c
Spark
Worker
(Executor
)
c
c
c
Delta
Engine
Databricks Cluster
SQL Analytics
c
Spark
Worker
(Executor
)
c
c
c
Delta
Engine
c
Spark
Driver
c
c
c
c
Spark
Worker
(Executor
)
c
c
c
Delta
Engine
c
Spark
Worker
(Executor
)
c
c
c
Delta
Engine
Databricks Cluster
ML Runtime
Table
ACLs
Ephemeral
Clusters or
long running
for
All-purpose
or Jobs
JDBC/ODBC
Node makeup
▪ Each Node (VM), maps to single Spark
Driver/Worker
▪ Cluster of nodes completely isolated from other
jobs/compute
▪ De-coupled compute and storage
Metadata and Security
▪ Managed Hive metastore (other options
available)
▪ Table ACLs (Databricks) and Object Storage
permissions
Endpoints
▪ SQL endpoint for both advanced analytics and
simple SQL analytics
▪ Code access to data - Notebooks
▪ HBase → maps to Azure CosmosDB, AWS
DynamoDB/Keyspaces (non-Databricks
solution)
Object Storage Object Storage ACLs
CosmosDB/
DynamoDB/
Keyspaces
Demo - Administration
Data Migration
Data Migration
- On-premise block storage.
- Fixed disk capacity.
- Health checks to validate data
Integrity.
- As data volumes grow, must
add more nodes to cluster and
rebalance data.
MIGRATE
- Fully managed cloud object storage.
- Unlimited capacity.
- No maintenance, no health checks, no rebalancing.
- 99.99% availability, 99.9999999% durability.
- Use native cloud services to migrate data.
- Leverage partner solutions:
Data Migration
Build a Data Lake in cloud storage with Delta Lake
● Open source and uses Parquet file format.
● Performance: Data indexing → Faster queries.
● Reliability: ACID Transactions → Guaranteed data integrity.
● Scalability: Handle petabyte-scale tables with billions of partitions and files at ease.
● Enhanced Spark SQL: UPDATE, MERGE, and DELETE commands.
● Unify Batch and Stream processing → No more LAMBDA architecture.
● Schema Enforcement: Specify schema on write.
● Schema Evolution: Automatically change schemas on the fly.
● Audit History: Full audit trail of the changes.
● Time Travel: Restore data from past versions.
● 100% Compatible with Apache Spark API.
Start with Dual ingestion
● Add a feed to cloud storage
● Enable new use cases with new data
● Introduces options for backup
How to migrate data
● Leverage existing Data Delivery tools to point to cloud storage
● Introduce simplified flows to land data into cloud storage
How to migrate data
● Push the data
○ DistCP
○ 3rd Party Tooling
○ In-house frameworks
○ Cloud Native - Snowmobile , Azure Data Box, Google Transfer Appliance
○ Typically easier to approve (security)
● Pull the data
○ Spark Streaming
○ Spark Batch
■ File Ingest
■ JDBC
○ 3rd Party Tooling
How to migrate data - Pull approach
● Set up connectivity to On Premises
○ AWS Direct Connect
○ Azure ExpressRoute / VPN Gateway
○ This may be needed for some use cases
● Kerberized Hadoop Environments
○ Databricks clusters initialization scripts
■ Kerberos client setup
■ krb5.conf, keytab
■ kinit()
● Shared External Metastore
○ Databricks and Hadoop can share a metastore
Demo - Databricks Pull
Data Processing
Technology Mapping
Migrating Spark Jobs
● Spark versions
● RDD to Dataframes
● Changes to submission
● Hard coded references to hadoop environment
Converting non-Spark workloads
● MapReduce
● Sqoop
● Flume
● Nifi Considerations
Migrating HiveQL
● Hive queries have high compatibility
● Minor changes in DDL
● Serdes, and UDFs
Migration Workflow Orchestration
● Create Airflow, Azure Data Factory, or other, equivalents
● Databricks REST APIs allows integration to any Scheduler
Automated Tooling
● MLens
○ PySpark
○ HiveQL
○ Oozie to Airflow, Azure Data Factory (roadmap)
Security and Governance
Security and Governance
Authentication Authorization Metadata Management
- Single Sign On (SSO) with SAML
2.0 supported corporate
directory.
- Access Control Lists (ACLs) for
Databricks RBAC.
- Table ACLs - Dynamic Views for
Column/Row permissionons
- Leverage cloud native
security: IAM Federation and
AAD passthrough.
- Integration with Ranger an
Immuta for more advanced
RBAC and ABAC.
- Integration with 3rd party
services.
Amazon Glue
Pivacera
Migrating Security Policies from
Hadoop to Databricks
Enabling enterprises to responsibly use their data in the cloud
Powered by Apache Ranger
HADOOP ECOSYSTEM
● 100s and 1000s of tables in
Apache Hive
● 100s of policies in Apache
Ranger
● Variety of policies. Resource
Based, Tag Based, Masking, Row
Level Filters, etc.
● Policies for Users and Groups
from AD/LDAP
PRIVACERA AND
DATABRICKS
Hive MetaStore MetaStore
Dataset
Schema
Policies
SEAMLESS MIGRATION
INSTANTLY TRANSFER
YEARS OF EFFORT
INSTANTLY IMPLEMENT THE SAME
POLICIES IN DATABRICKS AS ON-PREM
● Richer, deeper, and more robust Access Control
● Row/Column level access control in SQL
● Dynamic and Static data de-identification
● File level access control for Dataframes, object level access
● Read/Write operations supported
Object Store
(S3/ADLS)
Privacera
+
Databricks
S3 - Bucket
Level
Y
S3 - Object
Level
Y
ADLS Y
Privacera Value Add - Enhancing Databricks Authorization
Spark SQL and R Privacera +
Databricks
Table Y
Column Y
Column Masking Y
Row Level Filtering Y
Tag Based Policies Y
Attribute based policies Y
Centralized Auditing Y
Databricks SQL/Python Cluster
Spark Driver Ranger Plugin
Spark Executors
Spark Executors Ranger Policy Manager
Privacera Portal
Privacera Audit Server
DB Solr
Apache Kafka
Splunk
Cloud Watch
SIEM
Privacera Cloud
Spark SQL
and/or Spark
Read/Write
Privacera
Anomaly
Detection and
Alerting
Databricks Cluster
Privacera Discovery
Business User
Admin User
Privacera Approval
Workflow
AD/LDAP
3rd Party Catalog
SQL and BI
What about the SQL Community
Hadoop
● HUE
○ Data browsing
○ SQL Editor
○ Visualizations
● Interactive SQL
○ Impala
○ Hive LLAP
Databricks
● SQL Analytics Workspace
○ Data Browser
○ SQL Editor
○ Visualizations
● Interactive SQL
○ Spark optimizations - Adaptive Query Execution
○ Advanced Caching
○ Project Photon
○ Scaling cluster of clusters
SQL & BI Layer
Optimized SQL and BI
Performance BI Integrations Tuned
- Fast queries with Delta Engine
on Delta Engine.
- Support for high-concurrency
with auto-scaling clusters.
- Optimized JDBC/ODBC drivers.
- Optimized and tuned for BI and
and SQL out of the box.
Compatible with any BI client
and tool that supports Spark.
Vision
Give SQL users a home in Databricks
Provide SQL workbench, light
dashboarding, and alerting capabilities
Great BI experience on the data lake
Enable companies to effectively leverage
the data lake from any BI tool without
having to move the data around.
Easy to use & price-performant
Minimal setup & configuration. Data lake
price performance.
SQL-native user interface for
analysts
▪ Familiar SQL Editor
▪ Auto Complete
▪ Built in visualizations
▪ Data Browser
▪ Automatic Alerts
▪ Trigger based upon values
▪ Email or Slack integration
▪ Dashboards
▪ Simply convert queries to
dashboards
▪ Share with Access Control
Built-in connectors for existing
BI tools
Other BI & SQL clients
that support
▪ Supports your favorite tool
▪ Connectors for top BI & SQL clients
▪ Simple connection setup
▪ Optimized performance
▪ OAuth & Single Sign On
▪ Quick and easy authentication
experience. No need to deal with
access tokens.
▪ Power BI Available now
▪ Others coming soon
Performance
Delta Metadata Performance
Improved read performance for cold queries on Delta
tables. Provides interactive metadata performance
regardless of # of Delta tables in a query or table sizes.
New ODBC / JDBC Drivers
Wire protocol re-engineered to provide lower latencies
& higher data transfer speeds:
▪ Lower latency / less overhead (~¼ sec) with reduced
round trips per request
▪ Higher transfer rate (up to 50%) using Apache Arrow
▪ Optimized metadata performance for ODBC/JDBC
APIs (up to 10x for metadata retrieval operations)
Photon - Delta Engine
[Preview]
New MPP engine built from scratch in C++.
Vectorized to exploit data level parallelism and
instruction-level parallelism. Optimized for
modern structured and semi-structured
workloads.
Summary
It all starts with a plan
● Databricks and are partner community can help you
○ Assess
○ Plan
○ Validate
○ Execute
Considerations for your migration to
Databricks
● Administration
● Data Migration
● Data Processing
● Security & Governance
● SQL and BI Layer
Next Steps
Next Steps
● You will receive a follow up email from our teams
● Let us help you with your Hadoop Migration Journey
Follow up materials - Useful links
Databricks Reference Architecture
Databricks Azure Reference Architecture
Databricks AWS Reference Architecture
Demo

Weitere ähnliche Inhalte

Was ist angesagt?

[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar ZecevicDataScienceConferenc1
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerDatabricks
 
Enabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data VirtualizationEnabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data VirtualizationDenodo
 
Data Discovery at Databricks with Amundsen
Data Discovery at Databricks with AmundsenData Discovery at Databricks with Amundsen
Data Discovery at Databricks with AmundsenDatabricks
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureDatabricks
 
Data Mess to Data Mesh | Jay Kreps, CEO, Confluent | Kafka Summit Americas 20...
Data Mess to Data Mesh | Jay Kreps, CEO, Confluent | Kafka Summit Americas 20...Data Mess to Data Mesh | Jay Kreps, CEO, Confluent | Kafka Summit Americas 20...
Data Mess to Data Mesh | Jay Kreps, CEO, Confluent | Kafka Summit Americas 20...HostedbyConfluent
 
Data platform modernization with Databricks.pptx
Data platform modernization with Databricks.pptxData platform modernization with Databricks.pptx
Data platform modernization with Databricks.pptxCalvinSim10
 
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...HostedbyConfluent
 
Building Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureBuilding Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureDmitry Anoshin
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
 
Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & DeltaDatabricks
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...DataScienceConferenc1
 
Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)James Serra
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouseJames Serra
 
Data Mesh for Dinner
Data Mesh for DinnerData Mesh for Dinner
Data Mesh for DinnerKent Graziano
 

Was ist angesagt? (20)

[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
 
Introduction to Azure Data Lake
Introduction to Azure Data LakeIntroduction to Azure Data Lake
Introduction to Azure Data Lake
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
 
Data Mesh
Data MeshData Mesh
Data Mesh
 
Enabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data VirtualizationEnabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data Virtualization
 
Data Discovery at Databricks with Amundsen
Data Discovery at Databricks with AmundsenData Discovery at Databricks with Amundsen
Data Discovery at Databricks with Amundsen
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
 
Data Mess to Data Mesh | Jay Kreps, CEO, Confluent | Kafka Summit Americas 20...
Data Mess to Data Mesh | Jay Kreps, CEO, Confluent | Kafka Summit Americas 20...Data Mess to Data Mesh | Jay Kreps, CEO, Confluent | Kafka Summit Americas 20...
Data Mess to Data Mesh | Jay Kreps, CEO, Confluent | Kafka Summit Americas 20...
 
Data platform modernization with Databricks.pptx
Data platform modernization with Databricks.pptxData platform modernization with Databricks.pptx
Data platform modernization with Databricks.pptx
 
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
 
Building Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureBuilding Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft Azure
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & Delta
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
 
Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouse
 
The delta architecture
The delta architectureThe delta architecture
The delta architecture
 
Data Mesh for Dinner
Data Mesh for DinnerData Mesh for Dinner
Data Mesh for Dinner
 

Ähnlich wie 5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop

Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick viewRajesh Nadipalli
 
HdInsight essentials Hadoop on Microsoft Platform
HdInsight essentials Hadoop on Microsoft PlatformHdInsight essentials Hadoop on Microsoft Platform
HdInsight essentials Hadoop on Microsoft Platformnvvrajesh
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick viewRajesh Nadipalli
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impalamarkgrover
 
Customer Education Webcast: New Features in Data Integration and Streaming CDC
Customer Education Webcast: New Features in Data Integration and Streaming CDCCustomer Education Webcast: New Features in Data Integration and Streaming CDC
Customer Education Webcast: New Features in Data Integration and Streaming CDCPrecisely
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
Hadoop and OpenStack - Hadoop Summit San Jose 2014
Hadoop and OpenStack - Hadoop Summit San Jose 2014Hadoop and OpenStack - Hadoop Summit San Jose 2014
Hadoop and OpenStack - Hadoop Summit San Jose 2014spinningmatt
 
FireEye & Scylla: Intel Threat Analysis Using a Graph Database
FireEye & Scylla: Intel Threat Analysis Using a Graph DatabaseFireEye & Scylla: Intel Threat Analysis Using a Graph Database
FireEye & Scylla: Intel Threat Analysis Using a Graph DatabaseScyllaDB
 
Oracle big data appliance and solutions
Oracle big data appliance and solutionsOracle big data appliance and solutions
Oracle big data appliance and solutionssolarisyougood
 
Key trends in Big Data and new reference architecture from Hewlett Packard En...
Key trends in Big Data and new reference architecture from Hewlett Packard En...Key trends in Big Data and new reference architecture from Hewlett Packard En...
Key trends in Big Data and new reference architecture from Hewlett Packard En...Ontico
 
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Mladen Kovacevic
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)VMware Tanzu
 
Scylla Summit 2019 Keynote - Avi Kivity
Scylla Summit 2019 Keynote - Avi KivityScylla Summit 2019 Keynote - Avi Kivity
Scylla Summit 2019 Keynote - Avi KivityScyllaDB
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analyticsinoshg
 
9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to SchoolAdam Doyle
 

Ähnlich wie 5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop (20)

Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick view
 
HdInsight essentials Hadoop on Microsoft Platform
HdInsight essentials Hadoop on Microsoft PlatformHdInsight essentials Hadoop on Microsoft Platform
HdInsight essentials Hadoop on Microsoft Platform
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick view
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Customer Education Webcast: New Features in Data Integration and Streaming CDC
Customer Education Webcast: New Features in Data Integration and Streaming CDCCustomer Education Webcast: New Features in Data Integration and Streaming CDC
Customer Education Webcast: New Features in Data Integration and Streaming CDC
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Hadoop and OpenStack
Hadoop and OpenStackHadoop and OpenStack
Hadoop and OpenStack
 
Hadoop and OpenStack - Hadoop Summit San Jose 2014
Hadoop and OpenStack - Hadoop Summit San Jose 2014Hadoop and OpenStack - Hadoop Summit San Jose 2014
Hadoop and OpenStack - Hadoop Summit San Jose 2014
 
FireEye & Scylla: Intel Threat Analysis Using a Graph Database
FireEye & Scylla: Intel Threat Analysis Using a Graph DatabaseFireEye & Scylla: Intel Threat Analysis Using a Graph Database
FireEye & Scylla: Intel Threat Analysis Using a Graph Database
 
Oracle big data appliance and solutions
Oracle big data appliance and solutionsOracle big data appliance and solutions
Oracle big data appliance and solutions
 
Key trends in Big Data and new reference architecture from Hewlett Packard En...
Key trends in Big Data and new reference architecture from Hewlett Packard En...Key trends in Big Data and new reference architecture from Hewlett Packard En...
Key trends in Big Data and new reference architecture from Hewlett Packard En...
 
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
 
Scylla Summit 2019 Keynote - Avi Kivity
Scylla Summit 2019 Keynote - Avi KivityScylla Summit 2019 Keynote - Avi Kivity
Scylla Summit 2019 Keynote - Avi Kivity
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
 
9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 

Mehr von Databricks

Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionDatabricks
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityDatabricks
 
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + FugueIntuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + FugueDatabricks
 
Infrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload DeploymentInfrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload DeploymentDatabricks
 

Mehr von Databricks (20)

Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and Quality
 
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + FugueIntuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
 
Infrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload DeploymentInfrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload Deployment
 

Kürzlich hochgeladen

原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一F sss
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGIThomas Poetter
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhYasamin16
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 

Kürzlich hochgeladen (20)

原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop

  • 1. Clean Your Data Swamp By Migration off Hadoop
  • 3. Agenda ● Why modernize? ● Planning your migration off of Hadoop ● Top migration topics
  • 4. Why migrate off of Hadoop and onto Databricks?
  • 5. History of Hadoop ● Created 2005 ● Open Source distributed processing and storage platform running on commodity hardware ● Originally consisted of HDFS, and MapReduce, but now incorporates numerous open source projects (Hive, HBase, Spark) ● On-prem and on the cloud
  • 6. COMPLEX FIXED Today Hadoop is very hard ● Many tools: Need to understand multiple technologies. ● Real-time and batch ingestion to build AI models requires integrating many components. Slow Innovation ● 24/7 clusters. ● Fixed capacity: CPU + RAM + Disk. ● Costly to upgrade. Cost Prohibitive MAINTENANCE INTENSIVE ● Hadoop ecosystem is complex and hard to manage that is prone to failures. Low Productivity X
  • 7. Enterprises Need a Modern Data Analytics Architecture CRITICAL REQUIREMENTS Cost-effective scale and performance in the cloud Easy to manage and highly reliable for diverse data Predictive and real-time insights to drive innovation
  • 8. Structured Semi-structured Unstructured Streaming Lakehouse Platform Data Engineering BI & SQL Analytics Real-time Data Applications Data Science & Machine Learning Data Management & Governance Open Data Lake SIMPLE OPEN COLLABORATIVE
  • 9. Planning your migration off of Hadoop and onto Databricks
  • 10. Migration Planning ● Internal Questions ● Assessment ● Technical Planning ● Enablement and Evaluation ● Migration Execution
  • 11. Migration Planning Internal Question ● why? ● who? ● desired start and end dates ● internal stakeholders ● cloud strategy
  • 12. Migration Planning Assessment ● Environment inventory ○ compute, data, tooling ● Use case prioritization ● Workload analysis ● Existing TCO ● Projected TCO ● Migration timelines
  • 13. Migration Planning Technical Planning ● Target state architecture ● Data migration ● Workload migration ○ Lift and shift, transformative, hybrid ● Data governance approach ● Automated deployment ● Monitoring and Operations
  • 14. Migration Planning Enablement and Evaluation ● Workshops,Technical deep dives ● Training ● Proof of technology / MVP ○ Validate assumptions and designs
  • 15. Migration Planning Migration Execution ● Environment Deployment ● Iterate of use cases ○ Data Migration ○ Workload Migration ○ Dual Production Deployment - Old and New ○ Validation ○ Cut-over and Decommission of Hadoop
  • 17. Key Areas of Migration 1. Administration 2. Data Migration 3. Data Processing 4. Security & Governance 5. SQL and BI Layer
  • 19. Hadoop Ecosystem to Databricks Concepts Hadoop HDFS c disk1 disk2 disk3 disk4 disk5 disk6 ... disk N YARN Impala HBase c c c c c MR mapper c MR mapper c MR mapper c Spark Worker (Executor ) c c c c MR mapper c Spark Worker (Executor ) c c c c Spark Worker (Executor ) c c c c c c 2x12c = 24c compute HDFS c disk1 disk2 disk3 disk4 disk5 disk6 ... disk N YARN Impala HBase c c c c c MR mapper c MR mapper c MR mapper c Spark Worker (Executor ) c c c c MR mapper c Spark Worker (Executor ) c c c c Spark Worker (Executor ) c c c c c c 2x12c = 24c compute HDFS c disk1 disk2 disk3 disk4 disk5 disk6 ... disk N YARN Impala HBase c c c c c MR mapper c MR mapper c MR mapper c Spark Worker (Executor ) c c c c MR mapper c Spark Worker (Executor ) c c c c Spark Driver c c c c c c 2x12c = 24c compute ... Node 1 Node 2 Node N Hive Metastore Hive Server Impala (LoadBalancer) HBase API Sentry Table Metadata + HDFS ACLs JDBC/ODBC Node makeup ▪ Local disks ▪ Cores/Memory carved to services ▪ Submitted jobs compete for resources ▪ Services constrained to accommodate resources Metadata and Security ▪ Sentry table metadata permissions combined with syncing HDFS ACLs OR ▪ Apache Ranger, policy based access control Endpoints ▪ Direct Access to HDFS / Copied dataset ▪ Hive (on MR or Spark) accepts incoming connections ▪ Impala for interactive queries ▪ HBase APIs as required Ranger Policy based access control OR
  • 20. Hadoop Ecosystem to Databricks Concepts Hadoop HDFS c disk1 disk2 disk3 disk4 disk5 disk6 ... disk N YARN Impala HBase c c c c c MR mapper c MR mapper c MR mapper c Spark Worker (Executor ) c c c c MR mapper c Spark Worker (Executor ) c c c c Spark Worker (Executor ) c c c c c c 2x12c = 24c compute HDFS c disk1 disk2 disk3 disk4 disk5 disk6 ... disk N YARN Impala HBase c c c c c MR mapper c MR mapper c MR mapper c Spark Worker (Executor ) c c c c MR mapper c Spark Worker (Executor ) c c c c Spark Worker (Executor ) c c c c c c 2x12c = 24c compute HDFS c disk1 disk2 disk3 disk4 disk5 disk6 ... disk N YARN Impala HBase c c c c c MR mapper c MR mapper c MR mapper c Spark Worker (Executor ) c c c c MR mapper c Spark Worker (Executor ) c c c c Spark Driver c c c c c c 2x12c = 24c compute ... Node 1 Node 2 Node N Hive Metastore Hive Server Impala (LoadBalancer) HBase API Sentry/Ranger Table Metadata + HDFS ACLs Hive Metastore (managed) Databricks SQL Endpoint JDBC/ODBC High Conc. Cluster SQL Analytics CosmosDB/ DynamoDB/ Keyspaces Object Storage c Spark Worker (Executor ) c c c Delta Engine c Spark Driver c c c c Spark Worker (Executor ) c c c Delta Engine c Spark Worker (Executor ) c c c Delta Engine Databricks Cluster Spark ETL (Batch/Streaming) c Spark Worker (Executor ) c c c Delta Engine c Spark Driver c c c c Spark Worker (Executor ) c c c Delta Engine c Spark Worker (Executor ) c c c Delta Engine Databricks Cluster SQL Analytics c Spark Worker (Executor ) c c c Delta Engine c Spark Driver c c c c Spark Worker (Executor ) c c c Delta Engine c Spark Worker (Executor ) c c c Delta Engine Databricks Cluster ML Runtime Table ACLs Object Storage ACLs Ephemeral Clusters for All-purpose or Jobs JDBC/ODBC
  • 21. Hadoop Ecosystem to Databricks Concepts Hive Metastore (managed) Databricks SQL Endpoint High Conc. Cluster SQL Analytics c Spark Worker (Executor ) c c c Delta Engine c Spark Driver c c c c Spark Worker (Executor ) c c c Delta Engine c Spark Worker (Executor ) c c c Delta Engine Databricks Cluster Spark ETL (Batch/Streaming) c Spark Worker (Executor ) c c c Delta Engine c Spark Driver c c c c Spark Worker (Executor ) c c c Delta Engine c Spark Worker (Executor ) c c c Delta Engine Databricks Cluster SQL Analytics c Spark Worker (Executor ) c c c Delta Engine c Spark Driver c c c c Spark Worker (Executor ) c c c Delta Engine c Spark Worker (Executor ) c c c Delta Engine Databricks Cluster ML Runtime Table ACLs Ephemeral Clusters or long running for All-purpose or Jobs JDBC/ODBC Node makeup ▪ Each Node (VM), maps to single Spark Driver/Worker ▪ Cluster of nodes completely isolated from other jobs/compute ▪ De-coupled compute and storage Metadata and Security ▪ Managed Hive metastore (other options available) ▪ Table ACLs (Databricks) and Object Storage permissions Endpoints ▪ SQL endpoint for both advanced analytics and simple SQL analytics ▪ Code access to data - Notebooks ▪ HBase → maps to Azure CosmosDB, AWS DynamoDB/Keyspaces (non-Databricks solution) Object Storage Object Storage ACLs CosmosDB/ DynamoDB/ Keyspaces
  • 24. Data Migration - On-premise block storage. - Fixed disk capacity. - Health checks to validate data Integrity. - As data volumes grow, must add more nodes to cluster and rebalance data. MIGRATE - Fully managed cloud object storage. - Unlimited capacity. - No maintenance, no health checks, no rebalancing. - 99.99% availability, 99.9999999% durability. - Use native cloud services to migrate data. - Leverage partner solutions:
  • 25. Data Migration Build a Data Lake in cloud storage with Delta Lake ● Open source and uses Parquet file format. ● Performance: Data indexing → Faster queries. ● Reliability: ACID Transactions → Guaranteed data integrity. ● Scalability: Handle petabyte-scale tables with billions of partitions and files at ease. ● Enhanced Spark SQL: UPDATE, MERGE, and DELETE commands. ● Unify Batch and Stream processing → No more LAMBDA architecture. ● Schema Enforcement: Specify schema on write. ● Schema Evolution: Automatically change schemas on the fly. ● Audit History: Full audit trail of the changes. ● Time Travel: Restore data from past versions. ● 100% Compatible with Apache Spark API.
  • 26. Start with Dual ingestion ● Add a feed to cloud storage ● Enable new use cases with new data ● Introduces options for backup
  • 27. How to migrate data ● Leverage existing Data Delivery tools to point to cloud storage ● Introduce simplified flows to land data into cloud storage
  • 28. How to migrate data ● Push the data ○ DistCP ○ 3rd Party Tooling ○ In-house frameworks ○ Cloud Native - Snowmobile , Azure Data Box, Google Transfer Appliance ○ Typically easier to approve (security) ● Pull the data ○ Spark Streaming ○ Spark Batch ■ File Ingest ■ JDBC ○ 3rd Party Tooling
  • 29. How to migrate data - Pull approach ● Set up connectivity to On Premises ○ AWS Direct Connect ○ Azure ExpressRoute / VPN Gateway ○ This may be needed for some use cases ● Kerberized Hadoop Environments ○ Databricks clusters initialization scripts ■ Kerberos client setup ■ krb5.conf, keytab ■ kinit() ● Shared External Metastore ○ Databricks and Hadoop can share a metastore
  • 33. Migrating Spark Jobs ● Spark versions ● RDD to Dataframes ● Changes to submission ● Hard coded references to hadoop environment
  • 34. Converting non-Spark workloads ● MapReduce ● Sqoop ● Flume ● Nifi Considerations
  • 35. Migrating HiveQL ● Hive queries have high compatibility ● Minor changes in DDL ● Serdes, and UDFs
  • 36. Migration Workflow Orchestration ● Create Airflow, Azure Data Factory, or other, equivalents ● Databricks REST APIs allows integration to any Scheduler
  • 37. Automated Tooling ● MLens ○ PySpark ○ HiveQL ○ Oozie to Airflow, Azure Data Factory (roadmap)
  • 39. Security and Governance Authentication Authorization Metadata Management - Single Sign On (SSO) with SAML 2.0 supported corporate directory. - Access Control Lists (ACLs) for Databricks RBAC. - Table ACLs - Dynamic Views for Column/Row permissionons - Leverage cloud native security: IAM Federation and AAD passthrough. - Integration with Ranger an Immuta for more advanced RBAC and ABAC. - Integration with 3rd party services. Amazon Glue
  • 41. Migrating Security Policies from Hadoop to Databricks Enabling enterprises to responsibly use their data in the cloud Powered by Apache Ranger
  • 42. HADOOP ECOSYSTEM ● 100s and 1000s of tables in Apache Hive ● 100s of policies in Apache Ranger ● Variety of policies. Resource Based, Tag Based, Masking, Row Level Filters, etc. ● Policies for Users and Groups from AD/LDAP
  • 43. PRIVACERA AND DATABRICKS Hive MetaStore MetaStore Dataset Schema Policies
  • 44. SEAMLESS MIGRATION INSTANTLY TRANSFER YEARS OF EFFORT INSTANTLY IMPLEMENT THE SAME POLICIES IN DATABRICKS AS ON-PREM
  • 45. ● Richer, deeper, and more robust Access Control ● Row/Column level access control in SQL ● Dynamic and Static data de-identification ● File level access control for Dataframes, object level access ● Read/Write operations supported Object Store (S3/ADLS) Privacera + Databricks S3 - Bucket Level Y S3 - Object Level Y ADLS Y Privacera Value Add - Enhancing Databricks Authorization Spark SQL and R Privacera + Databricks Table Y Column Y Column Masking Y Row Level Filtering Y Tag Based Policies Y Attribute based policies Y Centralized Auditing Y
  • 46. Databricks SQL/Python Cluster Spark Driver Ranger Plugin Spark Executors Spark Executors Ranger Policy Manager Privacera Portal Privacera Audit Server DB Solr Apache Kafka Splunk Cloud Watch SIEM Privacera Cloud Spark SQL and/or Spark Read/Write Privacera Anomaly Detection and Alerting Databricks Cluster Privacera Discovery Business User Admin User Privacera Approval Workflow AD/LDAP 3rd Party Catalog
  • 47.
  • 49. What about the SQL Community Hadoop ● HUE ○ Data browsing ○ SQL Editor ○ Visualizations ● Interactive SQL ○ Impala ○ Hive LLAP Databricks ● SQL Analytics Workspace ○ Data Browser ○ SQL Editor ○ Visualizations ● Interactive SQL ○ Spark optimizations - Adaptive Query Execution ○ Advanced Caching ○ Project Photon ○ Scaling cluster of clusters
  • 50. SQL & BI Layer Optimized SQL and BI Performance BI Integrations Tuned - Fast queries with Delta Engine on Delta Engine. - Support for high-concurrency with auto-scaling clusters. - Optimized JDBC/ODBC drivers. - Optimized and tuned for BI and and SQL out of the box. Compatible with any BI client and tool that supports Spark.
  • 51. Vision Give SQL users a home in Databricks Provide SQL workbench, light dashboarding, and alerting capabilities Great BI experience on the data lake Enable companies to effectively leverage the data lake from any BI tool without having to move the data around. Easy to use & price-performant Minimal setup & configuration. Data lake price performance.
  • 52. SQL-native user interface for analysts ▪ Familiar SQL Editor ▪ Auto Complete ▪ Built in visualizations ▪ Data Browser ▪ Automatic Alerts ▪ Trigger based upon values ▪ Email or Slack integration ▪ Dashboards ▪ Simply convert queries to dashboards ▪ Share with Access Control
  • 53. Built-in connectors for existing BI tools Other BI & SQL clients that support ▪ Supports your favorite tool ▪ Connectors for top BI & SQL clients ▪ Simple connection setup ▪ Optimized performance ▪ OAuth & Single Sign On ▪ Quick and easy authentication experience. No need to deal with access tokens. ▪ Power BI Available now ▪ Others coming soon
  • 54. Performance Delta Metadata Performance Improved read performance for cold queries on Delta tables. Provides interactive metadata performance regardless of # of Delta tables in a query or table sizes. New ODBC / JDBC Drivers Wire protocol re-engineered to provide lower latencies & higher data transfer speeds: ▪ Lower latency / less overhead (~¼ sec) with reduced round trips per request ▪ Higher transfer rate (up to 50%) using Apache Arrow ▪ Optimized metadata performance for ODBC/JDBC APIs (up to 10x for metadata retrieval operations) Photon - Delta Engine [Preview] New MPP engine built from scratch in C++. Vectorized to exploit data level parallelism and instruction-level parallelism. Optimized for modern structured and semi-structured workloads.
  • 56. It all starts with a plan ● Databricks and are partner community can help you ○ Assess ○ Plan ○ Validate ○ Execute
  • 57. Considerations for your migration to Databricks ● Administration ● Data Migration ● Data Processing ● Security & Governance ● SQL and BI Layer
  • 59. Next Steps ● You will receive a follow up email from our teams ● Let us help you with your Hadoop Migration Journey
  • 60. Follow up materials - Useful links
  • 63. Databricks AWS Reference Architecture
  • 64. Demo