Data Integration and Data Warehousing for Cloud, Big Data and IoT:  What’s New, What’s Coming … and What’s Missing Right Now

Mark Rittman, Independent Analyst, MJR Analytics
DATA INTEGRATION AND DATA WAREHOUSING
FOR CLOUD, BIG DATA AND IOT:  
WHAT’S NEW, WHAT’S COMING … AND WHAT’S MISSING RIGHT NOW
BIG DATA WORLD, LONDON
London, March 2017

•Oracle ACE Director, Independent Analyst
•Past ODTUG Exec Board Member + Oracle Scene Editor
•Author of two books on Oracle BI
•Co-founder & CTO of Rittman Mead
•15+ Years in Oracle BI, DW, ETL + now Big Data
•Host of the Drill to Detail Podcast (www.drilltodetail.com)
•Based in Brighton & work in London, UK
About The Presenter
2

•Data warehouses provided a unified view of the business
•Single place to store key data and metrics
•Joined-up view of the business
•Aggregates and conformed dimensions
•ETL routines to load, cleanse and conform data
•BI tools for simple, guided access to information
•Tabular data access using SQL-generating tools
•Drill paths, hierarchies, facts, attributes
•Fast access to pre-computed aggregates
•Packaged BI for fast-start ERP analytics
4
Oracle
MongoDB
Oracle
Sybase
IBM DB/2
MS SQL
MS SQL Server
Core ERP Platform
Retail
Banking
Call Center
E-Commerce
CRM
 
Business
Intelligence
Tools
 
Data Warehouse
Access & 
Performance 
Layer
ODS / 
Foundation 
Layer
4
Data Warehousing Back in Mid-2000’s

How Traditional RDBMS Data Warehousing Scaled-Up
5
Shared-Everything Architectures (i.e.
Oracle RAC, Exadata)
Shared-Nothing Architectures 
(e.g. Teradata, Netezza)

•Google needed to store and query their vast amount of server log files
•And wanted to do so using cheap, commodity hardware
•Google File System and MapReduce designed together for this use
Around the Same Time…
6

•GFS optimised for particular task at hand -
computing PageRank for sites
•Streaming reads for PageRank calcs, block writes for
crawler whole-site dumps
•Master node only holds metadata
•Stops client/master I/O being bottleneck, also acts as
traffic controller for clients
•Simple design, optimised for specific Google Need
•MapReduce focused on simple computations on
abstraction framework
•Select & filter (MAP) and reduce (aggregate) functions,
easily to distribute on cluster
•MapReduce abstracted cluster compute, HDFS
abstracted cluster storage
•Projects that inspired Apache Hadoop + HDFS
Google File System + MapReduce Key Innovations
7

•A way of storing (non-relational) data cheaply and easily expandable
•Gave us a way of scaling beyond TB-size without paying $$$
•First use-cases were offline storage, active archive of data
Hadoop’s Original Appeal to Data Warehouse Owners
8
(c) 2013

•Driven by pace of business, and user demands for more agility and control
•Traditional IT-governed data loading not always appropriate
•Not all data needed to be modelled right-away
•Not all data suited storing in tabular form
•New ways of analyzing data beyond SQL
•Graph analysis
•Machine learning
Data Warehousing and ETL Needed Some Agility
9

•Hadoop started by being synonymous with MapReduce, and Java coding
•But YARN (Yet another Resource Negotiator) broke this dependency
•Hadoop now just handles resource management
•Multiple different query engines can run against data in-place
•General-purpose (e.g. MapReduce)
•Graph processing
•Machine Learning
•Real-Time Processing
Hadoop 2.0 - Enabling Multiple Query Engines
10

•Storing data in format it arrived in, and then applying schema at query time
•Suits data that may be analysed in different ways by different tools
•In addition, some datatypes may have schema embedded in file format
•Key benefit - fast arriving data of unknown value can get to users earlier
•Made possible by tools such as Apache Hive + SerDes, 
Apache Drill and self-describing file formats, HDFS storage
Advent of Schema-on-Read, and Data Lakes
11

•Data now landed in Hadoop clusters, NoSQL databases and Cloud Storage
•Flexible data storage platform with cheap storage, flexible schema support + compute
•Solves the problem of how to store new types of data + choose best time/way to process it
•Hadoop/NoSQL increasingly used for all store/transform/query tasks
Data Warehousing Circa 2010 : The “Data Lake”
12
Data Transfer Data Access
Data Factory
Data Reservoir
Business
Intelligence Tools
Hadoop Platform
File Based
Integration
Stream
Based
Integration
Data streams
Discovery & Development Labs
Safe & secure Discovery and Development
environment
Data sets and
samples
Models and
programs
Marketing /
Sales Applications
Models
Machine
Learning
Segments
Operational Data
Transactions
Customer
Master ata
Unstructured Data
Voice + Chat
Transcripts
ETL Based
Integration
Raw
Customer Data
Data stored in
the original
format (usually
files) such as
SS7, ASN.1,
JSON etc.
Mapped
Customer Data
Data sets
produced by
mapping and
transforming
raw data

DATA WAREHOUSING  
& BIG DATA TODAY…
13

•On-premise Hadoop, even with simple resilient clustering, will hit limits
•Clusters can reach 5000+ nodes, need to scale-up for demand peaks etc
•Scale limits are encountered way beyond those for DWs…
•… but future is elastically-scaled, query and compute-as-a-service
On-Premise Big Data Analytics Hits Its Limits
14
Oracle Big Data Cloud Compute Edition
Free $300 developer credit at: 
https://cloud.oracle.com/en_US/tryit

•New generation of big data platform services from Google, Amazon, Oracle
•Combines three key innovations from earlier technologies:
•Organising of data into tables and columns (from RDBMS DWs)
•Massively-scalable and distributed storage and query (from Big Data)
•Elastically-scalable Platform-as-a-Service (from Cloud)
Elastically-Scalable Data Warehouse-as-a-Service
15

Example Architecture : Google BigQuery
16

•And things come full-circle … analytics
typically requires tabular data
•Google BigQuery based-on DremelX
massively-parallel query engine
•But stores data columnar and provides SQL
interface
•Solves the problem of providing DW-like
functionality at scale, as-a-service
•This is the future … ;-)
BigQuery : Big Data Meets Data Warehousing
17

DATAFLOW PIPELINES  
ARE THE NEW ETL…
18

MACHINE LEARNING & SEARCH FOR  
“AUTOMAGIC” SCHEMA DISCOVERY
21

•By definition there's lots of data in a big data system ... so how do you find the data you
want?
•Google's own internal solution - GOODS ("Google Dataset Search")
•Uses crawler to discover new datasets
•ML classification routines to infer domain
•Data provenance and lineage
•Indexes and catalogs 26bn datasets
•Other users, vendors also have solutions
•Oracle Big Data Discovery
•Datameer
•Platfora
•Cloudera Navigator
Google GOODS - Catalog + Search At Google-Scale
23

•Came out if the data science movement, as a way to "show workings"
•A set of reproducible steps that tell a story about the data
•as well as being a better command-line environment for data analysis
•One example is Jupyter, evolution of iPython notebook
•supports pySpark, Pandas etc
•See also Apache Zepplin
Web-Based Data Analysis Notebooks
25

AND EMERGING OPEN-SOURCE 
BI TOOLS AND PLATFORMS
26

And Emerging Open-Source 
BI Tools and Platforms
wp-content/uploads/2016/05/paper.pdf

And Emerging Open-Source 
BI Tools and Platforms

… Which Is What I’m Working On Right Now
30

Data Integration and Data Warehousing for Cloud, Big Data and IoT:  What’s New, What’s Coming … and What’s Missing Right Now

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Data Integration and Data Warehousing for Cloud, Big Data and IoT:  What’s New, What’s Coming … and What’s Missing Right Now

Similar to Data Integration and Data Warehousing for Cloud, Big Data and IoT:  What’s New, What’s Coming … and What’s Missing Right Now (20)

More from Rittman Analytics

More from Rittman Analytics (16)

Recently uploaded

Recently uploaded (20)