SlideShare ist ein Scribd-Unternehmen logo
1 von 59
Downloaden Sie, um offline zu lesen
Data Warehouses and
Multi-Dimensional
Data Analysis
Raimonds Simanovskis
@rsim
Vampires
live here
500km
long beach
Other vampires
live here
(310.686 miles)
Data Warehouses and
Multi-Dimensional
Data Analysis
Raimonds Simanovskis
@rsim
Sales app example
class Customer < ActiveRecord::Base
has_many :orders
end
class Order < ActiveRecord::Base
belongs_to :customer
has_many :order_items
end
class OrderItem < ActiveRecord::Base
belongs_to :order
belongs_to :product
end
class Product < ActiveRecord::Base
belongs_to :product_class
has_many :order_items
end
class ProductClass < ActiveRecord::Base
has_many :products
end
Database schema
One day CEO asks
a question…
What were the

total sales amounts

in California

in Q1 2014

by product families?
Total sales amount …
OrderItem.sum("amount")
… in California …
OrderItem.joins(:order => :customer).
where("customers.country" => "USA", "customers.state_province" => "CA").
sum("order_items.amount")
… in Q1 2014 …
OrderItem.joins(:order => :customer).
where("customers.country" => "USA", "customers.state_province" => "CA").
where("extract(year from orders.order_date) = ?", 2014).
where("extract(quarter from orders.order_date) = ?", 1).
sum("order_items.amount")
… by product families
OrderItem.joins(:order => :customer).
where("customers.country" => "USA", "customers.state_province" => "CA").
where("extract(year from orders.order_date) = ?", 2014).
where("extract(quarter from orders.order_date) = ?", 1).
joins(:product => :product_class).
group("product_classes.product_family").
sum("order_items.amount")
Generated SQL
OrderItem.joins(:order => :customer).
where("customers.country" => "USA", "customers.state_province" => "CA").
where("extract(year from orders.order_date) = ?", 2014).
where("extract(quarter from orders.order_date) = ?", 1).
joins(:product => :product_class).
group("product_classes.product_family").
sum("order_items.amount")
SELECT SUM(order_items.amount) AS sum_order_items_amount,
product_classes.product_family AS product_classes_product_family
FROM "order_items"
INNER JOIN "orders" ON "orders"."id" = "order_items"."order_id"
INNER JOIN "customers" ON "customers"."id" = "orders"."customer_id"
INNER JOIN "products" ON "products"."id" = "order_items"."product_id"
INNER JOIN "product_classes" ON "product_classes"."id" = "products"."product_class_id"
WHERE "customers"."country" = 'USA'
AND "customers"."state_province" = 'CA'
AND (extract(YEAR FROM orders.order_date) = 2014)
AND (extract(quarter FROM orders.order_date) = 1)
GROUP BY product_classes.product_family
OrderItem.joins(:order => :customer).
where("customers.country" => "USA", "customers.state_province" => "CA").
where("extract(year from orders.order_date) = ?", 2014).
where("extract(quarter from orders.order_date) = ?", 1).
joins(:product => :product_class).
group("product_classes.product_family").
select("product_classes.product_family,"+
"SUM(order_items.amount) AS sales_amount,"+
"SUM(order_items.cost) AS sales_cost”).
map{|i| i.attributes.compact}
… and also

sales cost?
OrderItem.joins(:order => :customer).
where("customers.country" => "USA", "customers.state_province" => "CA").
where("extract(year from orders.order_date) = ?", 2014).
where("extract(quarter from orders.order_date) = ?", 1).
joins(:product => :product_class).
group("product_classes.product_family").
select("product_classes.product_family,"+
"SUM(order_items.amount) AS sales_amount,"+
"SUM(order_items.cost) AS sales_cost,"+
"COUNT(DISTINCT customers.id) AS customers_count").
map{|i| i.attributes.compact}
… and unique

customers

count?
Is it
clear?
#@%$^&
OrderItem.joins(:order => :customer).
where("customers.country" => "USA",
"customers.state_province" => "CA").
where("extract(year from orders.order_date)
= ?", 2014).
where("extract(quarter from orders.order_date)
= ?", 1).
joins(:product => :product_class).
group("product_classes.product_family").
select("product_classes.product_family,"+
"SUM(order_items.amount) AS sales_amount,"+
"SUM(order_items.cost) AS sales_cost,"+
"COUNT(DISTINCT customers.id) AS
customers_count").
map{|i| i.attributes.compact}
Performance slows down
on larger data volumes
$ rails console
>> OrderItem.count
(677.0ms) SELECT COUNT(*) FROM "order_items"
=> 6218022
>> Order.count
(126.0ms) SELECT COUNT(*) FROM "orders"
=> 642362
>> OrderItem.joins(:order => :customer).
joins(:product => :product_class).
group("product_classes.product_family").
select("product_classes.product_family,"+
"SUM(order_items.amount) AS sales_amount,"+
"SUM(order_items.cost) AS sales_cost,"+
"COUNT(DISTINCT customers.id) AS customers_count").
map{|i| i.attributes.compact}
OrderItem Load (25437.0ms) ...
6 million rows
25 seconds
You should
use NoSQL !
You should
use NoSQL !
Dimensional Modeling
Deliver data that’s
understandable to the business users
Deliver fast query performance
Dimensional Modeling
What were the

total sales amounts

in California

in Q1 2014

by product families?
fact or measure
Customer / Region dimension
Time dimension
Product dimension
Data Warehouse
“Star schema” with
fact and dimension tables
“Snowflake schema”
Data Warehouse Models
class Dwh::SalesFact < Dwh::Fact
belongs_to :customer, class_name: "Dwh::CustomerDimension"
belongs_to :product, class_name: "Dwh::ProductDimension"
belongs_to :time, class_name: "Dwh::TimeDimension"
end
class Dwh::CustomerDimension < Dwh::Dimension
has_many :sales_facts, class_name: “Dwh::SalesFact",
foreign_key: "customer_id"
end
class Dwh::ProductDimension < Dwh::Dimension
has_many :sales_facts, class_name: "Dwh::SalesFact", foreign_key: "product_id"
belongs_to :product_class, class_name: "Dwh::ProductClassDimension"
end
class Dwh::ProductClassDimension < Dwh::Dimension
has_many :products, class_name: "Dwh::ProductDimension", foreign_key: "product_class_id"
end
class Dwh::TimeDimension < Dwh::Dimension
has_many :sales_facts, class_name: “Dwh::SalesFact",
foreign_key: "time_id"
end
Load Dimension
class Dwh::CustomerDimension < Dwh::Dimension
# ...
def self.truncate!
connection.execute "TRUNCATE TABLE #{table_name}"
end
def self.load!
truncate!
column_names = %w(id full_name city state_province country
birth_date gender created_at updated_at)
connection.insert %[
INSERT INTO #{table_name} (#{column_names.join(',')})
SELECT #{column_names.join(',')}
FROM #{::Customer.table_name}
]
end
end
Generate
Time
Dimension
class Dwh::TimeDimension < Dwh::Dimension
def self.load!
connection.select_values(%[
SELECT DISTINCT order_date FROM #{Order.table_name}
WHERE order_date NOT IN
(SELECT date_value FROM #{table_name})
]).each do |date|
year, month, day = date.year, date.month, date.day
quarter = ((month-1)/3)+1
quarter_name = "Q#{quarter} #{year}"
month_name = date.strftime("%b %Y")
day_name = date.strftime("%b %d %Y")
sql = send :sanitize_sql_array, [
%[
INSERT INTO #{table_name}
(id, date_value, year, quarter, month, day,
year_name, quarter_name, month_name, day_name)
VALUES
(?, ?, ?, ?, ?, ?,
?, ?, ?, ?)
],
date_to_id(date), date, year, quarter, month, day,
year.to_s, quarter_name, month_name, day_name
]
connection.insert sql
end
end
end
Load Facts
class Dwh::SalesFact < Dwh::Fact
def self.load!
truncate!
connection.insert %[
INSERT INTO #{table_name}
(customer_id, product_id, time_id,
sales_quantity, sales_amount, sales_cost)
SELECT
o.customer_id, oi.product_id,
CAST(to_char(o.order_date, 'YYYYMMDD') AS INTEGER),
oi.quantity, oi.amount, oi.cost
FROM
#{OrderItem.table_name} oi
INNER JOIN #{Order.table_name} o ON o.id = oi.order_id
]
end
end
What were the

total sales amounts

in California

in Q1 2014

by product families?
Dwh::SalesFact.
joins(:customer).joins(:product => :product_class).joins(:time).
where("d_customers.country" => “USA",
"d_customers.state_province" => "CA").
where("d_time.year" => 2014, "d_time.quarter" => 1).
group("d_product_classes.product_family").
sum("sales_amount")
Two-Dimensional Table
CellRows
Columns
Multi-Dimensional Data Model
Dim
ensionDim
ension
Dimension
Measures
Data cube
Multi-Dimensional Data Model
Tim
e
Product
Customer
Measures

Sales quantity

Sales amount

Sales cost

Customers count
Sales cube
Dimension Hierarchies
All Customers
USA Canada
WA CA OR
San Francisco Los Angeles
Country
All
State
City
Levels
Time Dimension
All Times
2014 2015
Q2 Q3 Q4
AUG SEP
Year
All
Quarter
Month
AUG 01 AUG 02 Day
Q1
JUL
Default

hierarchy
All Times
2014 2015
W2 W3 W4
JAN 18 JAN 19
Year
All
Week
Day
W1
JAN 17
Weekly

hierarchy
OLAP Technologies
On-Line Analytical Processing
Mondrian
http://community.pentaho.com/projects/mondrian/
https://github.com/rsim/mondrian-olap
mondrian-olap gem
Mondrian::OLAP::Schema.define do
cube 'Sales' do
table 'f_sales', schema: 'dwh'
dimension 'Customer', foreign_key: 'customer_id' do
hierarchy all_member_name: 'All Customers', primary_key: 'id' do
table 'd_customers', schema: 'dwh'
level 'Country', column: 'country'
level 'State Province', column: 'state_province'
level 'City', column: 'city'
level 'Name', column: 'full_name'
end
end
dimension 'Product', foreign_key: 'product_id' do
hierarchy all_member_name: 'All Products', primary_key: 'id', primary_key_table: 'd_products' do
join left_key: 'product_class_id', right_key: 'id' do
table 'd_products', schema: 'dwh'
table 'd_product_classes', schema: 'dwh'
end
level 'Product Family', table: 'd_product_classes', column: 'product_family'
level 'Product Department', table: 'd_product_classes', column: 'product_department'
level 'Product Category', table: 'd_product_classes', column: 'product_category'
level 'Product Subcategory', table: 'd_product_classes', column: 'product_subcategory'
level 'Brand Name', table: 'd_products', column: 'brand_name'
level 'Product Name', table: 'd_products', column: 'product_name'
end
end
dimension 'Time', foreign_key: 'time_id', type: 'TimeDimension' do
hierarchy all_member_name: 'All Time', primary_key: 'id' do
table 'd_time', schema: 'dwh'
level 'Year', column: 'year', type: 'Numeric', name_column: 'year_name', level_type: 'TimeYears'
level 'Quarter', column: 'quarter', type: 'Numeric', name_column: 'quarter_name', level_type: 'TimeQuarters'
level 'Month', column: 'month', type: 'Numeric', name_column: 'month_name', level_type: 'TimeMonths'
level 'Day', column: 'day', type: 'Numeric', name_column: 'day_name', level_type: 'TimeDays'
end
end
measure 'Sales Quantity', column: 'sales_quantity', aggregator: 'sum'
measure 'Sales Amount', column: 'sales_amount', aggregator: 'sum'
measure 'Sales Cost', column: 'sales_cost', aggregator: ‘sum'
measure ‘Customers Count', column: ‘customer_id', aggregator: ‘distinct-count'
end
end
mondrian-olap
schema
definition
What were the

total sales amounts

in California

in Q1 2014

by product families?
olap.from("Sales").
columns("[Measures].[Sales Amount]").
rows("[Product].[Product Family].Members").
where("[Customer].[USA].[CA]", "[Time].[Quarter].[Q1 2014]")
MDX Query Language
olap.from("Sales").
columns("[Measures].[Sales Amount]").
rows("[Product].[Product Family].Members").
where("[Customer].[USA].[CA]", "[Time].[Quarter].[Q1 2014]")
SELECT {[Measures].[Sales Amount]} ON COLUMNS,
[Product].[Product Family].Members ON ROWS
FROM [Sales]
WHERE ([Customer].[USA].[CA], [Time].[Quarter].[Q1 2014])
Results Caching
SELECT {[Measures].[Sales Amount], [Measures].[Sales Cost],
[Measures].[Customers Count]} ON COLUMNS,
[Product].[Product Family].Members ON ROWS
FROM [Sales] (21713.0ms)
SELECT {[Measures].[Sales Amount], [Measures].[Sales Cost],
[Measures].[Customers Count]} ON COLUMNS,
[Product].[Product Family].Members ON ROWS
FROM [Sales] (10.0ms)
Additional Attribute Dimension
dimension 'Gender', foreign_key: 'customer_id' do
hierarchy all_member_name: 'All Genders', primary_key: 'id' do
table 'd_customers', schema: 'dwh'
level 'Gender', column: 'gender' do
name_expression do
sql "CASE d_customers.gender
WHEN 'F' THEN ‘Female'
WHEN 'M' THEN ‘Male'
END"
end
end
end
end
olap.from("Sales").
columns("[Measures].[Sales Amount]").
rows("[Gender].[Gender].Members")
Dynamic Attribute Dimension
dimension 'Age interval', foreign_key: 'customer_id' do
hierarchy all_member_name: 'All Age', primary_key: 'id' do
table 'd_customers', schema: 'dwh'
level 'Age interval' do
key_expression do
sql %[
CASE
WHEN age(d_customers.birth_date) < interval '20 years'
THEN '< 20 years'
WHEN age(d_customers.birth_date) < interval '30 years'
THEN '20-30 years'
WHEN age(d_customers.birth_date) < interval '40 years'
THEN '30-40 years'
WHEN age(d_customers.birth_date) < interval '50 years'
THEN '40-50 years'
ELSE '50+ years'
END
]
end
end
end
end
[Age interval].[<20 years]
[Age interval].[20-30 years]
[Age interval].[30-40 years]
[Age interval].[40-50 years]
[Age interval].[50+ years]
Calculation Formulas
calculated_member 'Profit', dimension: 'Measures', format_string: '#,##0.00',
formula: '[Measures].[Sales Amount] - [Measures].[Sales Cost]'
calculated_member 'Margin %', dimension: 'Measures', format_string: '#,##0.00%',
formula: '[Measures].[Profit] / [Measures].[Sales Amount]'
olap.from("Sales").
columns("[Measures].[Profit]", "[Measures].[Margin %]").
rows("[Product].[Product Family].Members").
where("[Customer].[USA].[CA]", "[Time].[Quarter].[Q1 2014]")
Enables Ad-hoc Queries by Users
ETL process
Data

Warehouse
Measures
Dimension1 Dimension2
Dimension4Dimension3
Database
REST API
Extract Transform Load
Ruby Tools for ETL
Kiba http://www.kiba-etl.org/
https://github.com/square/ETLETL
Kiba example
# declare a ruby method here, for quick reusable logic
def parse_french_date(date)
Date.strptime(date, '%d/%m/%Y')
end
# or better, include a ruby file which loads reusable assets
# eg: commonly used sources / destinations / transforms, under unit-test
require_relative 'common'
# declare a source where to take data from (you implement it - see notes below)
source MyCsvSource, 'input.csv'
# declare a row transform to process a given field
transform do |row|
row[:birth_date] = parse_french_date(row[:birth_date])
# return to keep in the pipeline
row
end
# declare another row transform, dismissing rows conditionally by returning nil
transform do |row|
row[:birth_date].year < 2000 ? row : nil
end
# declare a row transform as a class, which can be tested properly
transform ComplianceCheckTransform, eula: 2015
Multithreaded ETL
https://github.com/ruby-concurrency/concurrent-ruby
Extract

ThreadPool
Transform

ThreadPool
Load

ThreadPool
Data
source
Extracted
data
Transformed
data
Pro-tip: Use
Single
threaded
ETL
class Dwh::TimeDimension < Dwh::Dimension
def self.load!
logger.silence do
connection.select_values(%[
SELECT DISTINCT order_date FROM #{Order.table_name}
WHERE order_date NOT IN (SELECT date_value FROM #{table_name})
]).each do |date|
insert_date(date)
end
end
end
def self.insert_date(date)
year, month, day = date.year, date.month, date.day
quarter = ((month-1)/3)+1
quarter_name = "Q#{quarter} #{year}"
month_name = date.strftime("%b %Y")
day_name = date.strftime("%b %d %Y")
sql = send :sanitize_sql_array, [
%[
INSERT INTO #{table_name}
(id, date_value, year, quarter, month, day,
year_name, quarter_name, month_name, day_name)
VALUES
(?, ?, ?, ?, ?, ?,
?, ?, ?, ?)
],
date_to_id(date), date, year, quarter, month, day,
year.to_s, quarter_name, month_name, day_name
]
connection.insert sql
end
end
require 'concurrent/executors'
class Dwh::TimeDimension < Dwh::Dimension
def self.parallel_load!(pool_size = 4)
logger.silence do
insert_date_pool = Concurrent::FixedThreadPool.new(pool_size)
connection.select_values(%[
SELECT DISTINCT order_date FROM #{Order.table_name}
WHERE order_date NOT IN (SELECT date_value FROM #{table_name})
]).each do |date|
insert_date_pool.post(date) do |date|
connection_pool.with_connection do
insert_date(date)
end
end
end
insert_date_pool.shutdown
insert_date_pool.wait_for_termination
end
end
end
ETL with
Thread Pool
Benchmark!
Dwh::TimeDimension.load! (5236.0ms)
Dwh::TimeDimension.parallel_load!(2) (3450.0ms)
Dwh::TimeDimension.parallel_load!(4) (2142.0ms)
Dwh::TimeDimension.parallel_load!(6) (2361.0ms)
Dwh::TimeDimension.parallel_load!(8) (2826.0ms)
optimal size
in this case
Java Mission Control
Traditional vs Analytical
Relational Databases
Optimized for
transaction processing
Optimized for
analytical queries
Row-based Storage
Columnar Storage
http://docs.aws.amazon.com/redshift/latest/dg/c_columnar_storage_disk_mem_mgmnt.html
Analytical Query Performance
SELECT d_product_classes.product_family,
SUM(f_sales.sales_amount) AS sales_amount,
SUM(f_sales.sales_cost) AS sales_cost,
COUNT(DISTINCT f_sales.customer_id) AS customers_count
FROM "dwh"."f_sales"
INNER JOIN "dwh"."d_products" ON "dwh"."d_products"."id" =
"dwh"."f_sales"."product_id"
INNER JOIN "dwh"."d_product_classes" ON "dwh"."d_product_classes"."id" =
"dwh"."d_products"."product_class_id"
GROUP BY d_product_classes.product_family
always ~18 seconds
first ~9 seconds
next ~1.5 seconds
6 million rows
When to use what?
Fact table size
Traditional
transactional
databases
Analytical
columnar
databases
< 1M rows OK No big win
1-10M rows
Complex
queries slower
OK
10-100M rows Slow OK
>100M rows Very slow OK with tuning
What did we cover?
Problems with analytical queries
Dimensional modeling
Star schemas
Mondrian OLAP and MDX
ETL – Extract, Transform, Load
Analytical columnar databases
Questions?
raimonds.simanovskis@gmail.com
@rsim github.com/rsim
https://github.com/rsim/sales_app_demo

Weitere ähnliche Inhalte

Ähnlich wie Data Warehouses and Multi-Dimensional Data Analysis

Real-Time Personalized Customer Experiences at Bonobos (RET203) - AWS re:Inve...
Real-Time Personalized Customer Experiences at Bonobos (RET203) - AWS re:Inve...Real-Time Personalized Customer Experiences at Bonobos (RET203) - AWS re:Inve...
Real-Time Personalized Customer Experiences at Bonobos (RET203) - AWS re:Inve...Amazon Web Services
 
An introduction to Machine Learning
An introduction to Machine LearningAn introduction to Machine Learning
An introduction to Machine LearningJulien SIMON
 
on SQL Managment studio(For the following exercise, use the Week 5.pdf
on SQL Managment studio(For the following exercise, use the Week 5.pdfon SQL Managment studio(For the following exercise, use the Week 5.pdf
on SQL Managment studio(For the following exercise, use the Week 5.pdfformaxekochi
 
Chris Seebacher Portfolio
Chris Seebacher PortfolioChris Seebacher Portfolio
Chris Seebacher Portfolioguest3ea163
 
Business Intelligence Portfolio
Business Intelligence PortfolioBusiness Intelligence Portfolio
Business Intelligence PortfolioChris Seebacher
 
SQL coding at Sydney Measure Camp 2018
SQL coding at Sydney Measure Camp 2018SQL coding at Sydney Measure Camp 2018
SQL coding at Sydney Measure Camp 2018Adilson Mendonca
 
MongoDB World 2018: Keynote
MongoDB World 2018: KeynoteMongoDB World 2018: Keynote
MongoDB World 2018: KeynoteMongoDB
 
Samples of Database and Website Design
Samples of Database and Website DesignSamples of Database and Website Design
Samples of Database and Website DesignSherri Orwick Ogden
 
PHPUnit Episode iv.iii: Return of the tests
PHPUnit Episode iv.iii: Return of the testsPHPUnit Episode iv.iii: Return of the tests
PHPUnit Episode iv.iii: Return of the testsMichelangelo van Dam
 
Agile Database Development with JSON
Agile Database Development with JSONAgile Database Development with JSON
Agile Database Development with JSONChris Saxon
 
Slides for PUG 2018 - DAX CALCULATE
Slides for PUG 2018 - DAX CALCULATESlides for PUG 2018 - DAX CALCULATE
Slides for PUG 2018 - DAX CALCULATEIke Ellis
 
Business Intelligence Portfolio
Business Intelligence PortfolioBusiness Intelligence Portfolio
Business Intelligence Portfolioeileensauer
 
Business Intelligence Portfolio
Business Intelligence PortfolioBusiness Intelligence Portfolio
Business Intelligence Portfolioeileensauer
 
Joining the Club: Using Spark to Accelerate Big Data at Dollar Shave Club
Joining the Club: Using Spark to Accelerate Big Data at Dollar Shave ClubJoining the Club: Using Spark to Accelerate Big Data at Dollar Shave Club
Joining the Club: Using Spark to Accelerate Big Data at Dollar Shave ClubData Con LA
 
Introduction To Msbi By Yasir
Introduction To Msbi By YasirIntroduction To Msbi By Yasir
Introduction To Msbi By Yasirguest7c8e5f
 
IT301-Datawarehousing (1) and its sub topics.pptx
IT301-Datawarehousing (1) and its sub topics.pptxIT301-Datawarehousing (1) and its sub topics.pptx
IT301-Datawarehousing (1) and its sub topics.pptxReneeClintGortifacio
 

Ähnlich wie Data Warehouses and Multi-Dimensional Data Analysis (20)

Real-Time Personalized Customer Experiences at Bonobos (RET203) - AWS re:Inve...
Real-Time Personalized Customer Experiences at Bonobos (RET203) - AWS re:Inve...Real-Time Personalized Customer Experiences at Bonobos (RET203) - AWS re:Inve...
Real-Time Personalized Customer Experiences at Bonobos (RET203) - AWS re:Inve...
 
An introduction to Machine Learning
An introduction to Machine LearningAn introduction to Machine Learning
An introduction to Machine Learning
 
on SQL Managment studio(For the following exercise, use the Week 5.pdf
on SQL Managment studio(For the following exercise, use the Week 5.pdfon SQL Managment studio(For the following exercise, use the Week 5.pdf
on SQL Managment studio(For the following exercise, use the Week 5.pdf
 
Introduction to SQL
Introduction to SQLIntroduction to SQL
Introduction to SQL
 
Chris Seebacher Portfolio
Chris Seebacher PortfolioChris Seebacher Portfolio
Chris Seebacher Portfolio
 
Business Intelligence Portfolio
Business Intelligence PortfolioBusiness Intelligence Portfolio
Business Intelligence Portfolio
 
ORACLE_23-03-31_en.pdf
ORACLE_23-03-31_en.pdfORACLE_23-03-31_en.pdf
ORACLE_23-03-31_en.pdf
 
SQL coding at Sydney Measure Camp 2018
SQL coding at Sydney Measure Camp 2018SQL coding at Sydney Measure Camp 2018
SQL coding at Sydney Measure Camp 2018
 
Super spike
Super spikeSuper spike
Super spike
 
MongoDB World 2018: Keynote
MongoDB World 2018: KeynoteMongoDB World 2018: Keynote
MongoDB World 2018: Keynote
 
Samples of Database and Website Design
Samples of Database and Website DesignSamples of Database and Website Design
Samples of Database and Website Design
 
PHPUnit Episode iv.iii: Return of the tests
PHPUnit Episode iv.iii: Return of the testsPHPUnit Episode iv.iii: Return of the tests
PHPUnit Episode iv.iii: Return of the tests
 
Agile Database Development with JSON
Agile Database Development with JSONAgile Database Development with JSON
Agile Database Development with JSON
 
Slides for PUG 2018 - DAX CALCULATE
Slides for PUG 2018 - DAX CALCULATESlides for PUG 2018 - DAX CALCULATE
Slides for PUG 2018 - DAX CALCULATE
 
Business Intelligence Portfolio
Business Intelligence PortfolioBusiness Intelligence Portfolio
Business Intelligence Portfolio
 
Business Intelligence Portfolio
Business Intelligence PortfolioBusiness Intelligence Portfolio
Business Intelligence Portfolio
 
T-SQL-Assignment
T-SQL-AssignmentT-SQL-Assignment
T-SQL-Assignment
 
Joining the Club: Using Spark to Accelerate Big Data at Dollar Shave Club
Joining the Club: Using Spark to Accelerate Big Data at Dollar Shave ClubJoining the Club: Using Spark to Accelerate Big Data at Dollar Shave Club
Joining the Club: Using Spark to Accelerate Big Data at Dollar Shave Club
 
Introduction To Msbi By Yasir
Introduction To Msbi By YasirIntroduction To Msbi By Yasir
Introduction To Msbi By Yasir
 
IT301-Datawarehousing (1) and its sub topics.pptx
IT301-Datawarehousing (1) and its sub topics.pptxIT301-Datawarehousing (1) and its sub topics.pptx
IT301-Datawarehousing (1) and its sub topics.pptx
 

Mehr von Raimonds Simanovskis

Profiling Mondrian MDX Requests in a Production Environment
Profiling Mondrian MDX Requests in a Production EnvironmentProfiling Mondrian MDX Requests in a Production Environment
Profiling Mondrian MDX Requests in a Production EnvironmentRaimonds Simanovskis
 
Improve Mondrian MDX usability with user defined functions
Improve Mondrian MDX usability with user defined functionsImprove Mondrian MDX usability with user defined functions
Improve Mondrian MDX usability with user defined functionsRaimonds Simanovskis
 
Analyze and Visualize Git Log for Fun and Profit - DevTernity 2015
Analyze and Visualize Git Log for Fun and Profit - DevTernity 2015Analyze and Visualize Git Log for Fun and Profit - DevTernity 2015
Analyze and Visualize Git Log for Fun and Profit - DevTernity 2015Raimonds Simanovskis
 
eazyBI Overview - Embedding Mondrian in other applications
eazyBI Overview - Embedding Mondrian in other applicationseazyBI Overview - Embedding Mondrian in other applications
eazyBI Overview - Embedding Mondrian in other applicationsRaimonds Simanovskis
 
Atvērto datu izmantošanas pieredze Latvijā
Atvērto datu izmantošanas pieredze LatvijāAtvērto datu izmantošanas pieredze Latvijā
Atvērto datu izmantošanas pieredze LatvijāRaimonds Simanovskis
 
JavaScript Unit Testing with Jasmine
JavaScript Unit Testing with JasmineJavaScript Unit Testing with Jasmine
JavaScript Unit Testing with JasmineRaimonds Simanovskis
 
JRuby - Programmer's Best Friend on JVM
JRuby - Programmer's Best Friend on JVMJRuby - Programmer's Best Friend on JVM
JRuby - Programmer's Best Friend on JVMRaimonds Simanovskis
 
Agile Operations or How to sleep better at night
Agile Operations or How to sleep better at nightAgile Operations or How to sleep better at night
Agile Operations or How to sleep better at nightRaimonds Simanovskis
 
Analyze and Visualize Git Log for Fun and Profit
Analyze and Visualize Git Log for Fun and ProfitAnalyze and Visualize Git Log for Fun and Profit
Analyze and Visualize Git Log for Fun and ProfitRaimonds Simanovskis
 
opendata.lv Case Study - Promote Open Data with Analytics and Visualizations
opendata.lv Case Study - Promote Open Data with Analytics and Visualizationsopendata.lv Case Study - Promote Open Data with Analytics and Visualizations
opendata.lv Case Study - Promote Open Data with Analytics and VisualizationsRaimonds Simanovskis
 
Extending Oracle E-Business Suite with Ruby on Rails
Extending Oracle E-Business Suite with Ruby on RailsExtending Oracle E-Business Suite with Ruby on Rails
Extending Oracle E-Business Suite with Ruby on RailsRaimonds Simanovskis
 
Rails-like JavaScript Using CoffeeScript, Backbone.js and Jasmine
Rails-like JavaScript Using CoffeeScript, Backbone.js and JasmineRails-like JavaScript Using CoffeeScript, Backbone.js and Jasmine
Rails-like JavaScript Using CoffeeScript, Backbone.js and JasmineRaimonds Simanovskis
 
Why Every Tester Should Learn Ruby
Why Every Tester Should Learn RubyWhy Every Tester Should Learn Ruby
Why Every Tester Should Learn RubyRaimonds Simanovskis
 
Rails-like JavaScript using CoffeeScript, Backbone.js and Jasmine
Rails-like JavaScript using CoffeeScript, Backbone.js and JasmineRails-like JavaScript using CoffeeScript, Backbone.js and Jasmine
Rails-like JavaScript using CoffeeScript, Backbone.js and JasmineRaimonds Simanovskis
 
How to Adopt Agile at Your Organization
How to Adopt Agile at Your OrganizationHow to Adopt Agile at Your Organization
How to Adopt Agile at Your OrganizationRaimonds Simanovskis
 

Mehr von Raimonds Simanovskis (20)

Profiling Mondrian MDX Requests in a Production Environment
Profiling Mondrian MDX Requests in a Production EnvironmentProfiling Mondrian MDX Requests in a Production Environment
Profiling Mondrian MDX Requests in a Production Environment
 
Improve Mondrian MDX usability with user defined functions
Improve Mondrian MDX usability with user defined functionsImprove Mondrian MDX usability with user defined functions
Improve Mondrian MDX usability with user defined functions
 
Analyze and Visualize Git Log for Fun and Profit - DevTernity 2015
Analyze and Visualize Git Log for Fun and Profit - DevTernity 2015Analyze and Visualize Git Log for Fun and Profit - DevTernity 2015
Analyze and Visualize Git Log for Fun and Profit - DevTernity 2015
 
mondrian-olap JRuby library
mondrian-olap JRuby librarymondrian-olap JRuby library
mondrian-olap JRuby library
 
eazyBI Overview - Embedding Mondrian in other applications
eazyBI Overview - Embedding Mondrian in other applicationseazyBI Overview - Embedding Mondrian in other applications
eazyBI Overview - Embedding Mondrian in other applications
 
Atvērto datu izmantošanas pieredze Latvijā
Atvērto datu izmantošanas pieredze LatvijāAtvērto datu izmantošanas pieredze Latvijā
Atvērto datu izmantošanas pieredze Latvijā
 
JavaScript Unit Testing with Jasmine
JavaScript Unit Testing with JasmineJavaScript Unit Testing with Jasmine
JavaScript Unit Testing with Jasmine
 
JRuby - Programmer's Best Friend on JVM
JRuby - Programmer's Best Friend on JVMJRuby - Programmer's Best Friend on JVM
JRuby - Programmer's Best Friend on JVM
 
Agile Operations or How to sleep better at night
Agile Operations or How to sleep better at nightAgile Operations or How to sleep better at night
Agile Operations or How to sleep better at night
 
TDD - Why and How?
TDD - Why and How?TDD - Why and How?
TDD - Why and How?
 
Analyze and Visualize Git Log for Fun and Profit
Analyze and Visualize Git Log for Fun and ProfitAnalyze and Visualize Git Log for Fun and Profit
Analyze and Visualize Git Log for Fun and Profit
 
PL/SQL Unit Testing Can Be Fun
PL/SQL Unit Testing Can Be FunPL/SQL Unit Testing Can Be Fun
PL/SQL Unit Testing Can Be Fun
 
opendata.lv Case Study - Promote Open Data with Analytics and Visualizations
opendata.lv Case Study - Promote Open Data with Analytics and Visualizationsopendata.lv Case Study - Promote Open Data with Analytics and Visualizations
opendata.lv Case Study - Promote Open Data with Analytics and Visualizations
 
Extending Oracle E-Business Suite with Ruby on Rails
Extending Oracle E-Business Suite with Ruby on RailsExtending Oracle E-Business Suite with Ruby on Rails
Extending Oracle E-Business Suite with Ruby on Rails
 
Rails-like JavaScript Using CoffeeScript, Backbone.js and Jasmine
Rails-like JavaScript Using CoffeeScript, Backbone.js and JasmineRails-like JavaScript Using CoffeeScript, Backbone.js and Jasmine
Rails-like JavaScript Using CoffeeScript, Backbone.js and Jasmine
 
Why Every Tester Should Learn Ruby
Why Every Tester Should Learn RubyWhy Every Tester Should Learn Ruby
Why Every Tester Should Learn Ruby
 
Rails on Oracle 2011
Rails on Oracle 2011Rails on Oracle 2011
Rails on Oracle 2011
 
Rails-like JavaScript using CoffeeScript, Backbone.js and Jasmine
Rails-like JavaScript using CoffeeScript, Backbone.js and JasmineRails-like JavaScript using CoffeeScript, Backbone.js and Jasmine
Rails-like JavaScript using CoffeeScript, Backbone.js and Jasmine
 
How to Adopt Agile at Your Organization
How to Adopt Agile at Your OrganizationHow to Adopt Agile at Your Organization
How to Adopt Agile at Your Organization
 
PL/SQL Unit Testing Can Be Fun!
PL/SQL Unit Testing Can Be Fun!PL/SQL Unit Testing Can Be Fun!
PL/SQL Unit Testing Can Be Fun!
 

Kürzlich hochgeladen

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
Vision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxVision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxellehsormae
 
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024Timothy Spann
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGIThomas Poetter
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 

Kürzlich hochgeladen (20)

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
Vision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxVision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptx
 
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 

Data Warehouses and Multi-Dimensional Data Analysis

  • 1. Data Warehouses and Multi-Dimensional Data Analysis Raimonds Simanovskis @rsim
  • 2.
  • 3.
  • 4.
  • 7.
  • 8. Data Warehouses and Multi-Dimensional Data Analysis Raimonds Simanovskis @rsim
  • 9. Sales app example class Customer < ActiveRecord::Base has_many :orders end class Order < ActiveRecord::Base belongs_to :customer has_many :order_items end class OrderItem < ActiveRecord::Base belongs_to :order belongs_to :product end class Product < ActiveRecord::Base belongs_to :product_class has_many :order_items end class ProductClass < ActiveRecord::Base has_many :products end
  • 11. One day CEO asks a question… What were the
 total sales amounts
 in California
 in Q1 2014
 by product families?
  • 12. Total sales amount … OrderItem.sum("amount")
  • 13. … in California … OrderItem.joins(:order => :customer). where("customers.country" => "USA", "customers.state_province" => "CA"). sum("order_items.amount")
  • 14. … in Q1 2014 … OrderItem.joins(:order => :customer). where("customers.country" => "USA", "customers.state_province" => "CA"). where("extract(year from orders.order_date) = ?", 2014). where("extract(quarter from orders.order_date) = ?", 1). sum("order_items.amount")
  • 15. … by product families OrderItem.joins(:order => :customer). where("customers.country" => "USA", "customers.state_province" => "CA"). where("extract(year from orders.order_date) = ?", 2014). where("extract(quarter from orders.order_date) = ?", 1). joins(:product => :product_class). group("product_classes.product_family"). sum("order_items.amount")
  • 16. Generated SQL OrderItem.joins(:order => :customer). where("customers.country" => "USA", "customers.state_province" => "CA"). where("extract(year from orders.order_date) = ?", 2014). where("extract(quarter from orders.order_date) = ?", 1). joins(:product => :product_class). group("product_classes.product_family"). sum("order_items.amount") SELECT SUM(order_items.amount) AS sum_order_items_amount, product_classes.product_family AS product_classes_product_family FROM "order_items" INNER JOIN "orders" ON "orders"."id" = "order_items"."order_id" INNER JOIN "customers" ON "customers"."id" = "orders"."customer_id" INNER JOIN "products" ON "products"."id" = "order_items"."product_id" INNER JOIN "product_classes" ON "product_classes"."id" = "products"."product_class_id" WHERE "customers"."country" = 'USA' AND "customers"."state_province" = 'CA' AND (extract(YEAR FROM orders.order_date) = 2014) AND (extract(quarter FROM orders.order_date) = 1) GROUP BY product_classes.product_family
  • 17. OrderItem.joins(:order => :customer). where("customers.country" => "USA", "customers.state_province" => "CA"). where("extract(year from orders.order_date) = ?", 2014). where("extract(quarter from orders.order_date) = ?", 1). joins(:product => :product_class). group("product_classes.product_family"). select("product_classes.product_family,"+ "SUM(order_items.amount) AS sales_amount,"+ "SUM(order_items.cost) AS sales_cost”). map{|i| i.attributes.compact} … and also
 sales cost?
  • 18. OrderItem.joins(:order => :customer). where("customers.country" => "USA", "customers.state_province" => "CA"). where("extract(year from orders.order_date) = ?", 2014). where("extract(quarter from orders.order_date) = ?", 1). joins(:product => :product_class). group("product_classes.product_family"). select("product_classes.product_family,"+ "SUM(order_items.amount) AS sales_amount,"+ "SUM(order_items.cost) AS sales_cost,"+ "COUNT(DISTINCT customers.id) AS customers_count"). map{|i| i.attributes.compact} … and unique
 customers
 count?
  • 19. Is it clear? #@%$^& OrderItem.joins(:order => :customer). where("customers.country" => "USA", "customers.state_province" => "CA"). where("extract(year from orders.order_date) = ?", 2014). where("extract(quarter from orders.order_date) = ?", 1). joins(:product => :product_class). group("product_classes.product_family"). select("product_classes.product_family,"+ "SUM(order_items.amount) AS sales_amount,"+ "SUM(order_items.cost) AS sales_cost,"+ "COUNT(DISTINCT customers.id) AS customers_count"). map{|i| i.attributes.compact}
  • 20. Performance slows down on larger data volumes $ rails console >> OrderItem.count (677.0ms) SELECT COUNT(*) FROM "order_items" => 6218022 >> Order.count (126.0ms) SELECT COUNT(*) FROM "orders" => 642362 >> OrderItem.joins(:order => :customer). joins(:product => :product_class). group("product_classes.product_family"). select("product_classes.product_family,"+ "SUM(order_items.amount) AS sales_amount,"+ "SUM(order_items.cost) AS sales_cost,"+ "COUNT(DISTINCT customers.id) AS customers_count"). map{|i| i.attributes.compact} OrderItem Load (25437.0ms) ... 6 million rows 25 seconds
  • 23. Dimensional Modeling Deliver data that’s understandable to the business users Deliver fast query performance
  • 24. Dimensional Modeling What were the
 total sales amounts
 in California
 in Q1 2014
 by product families? fact or measure Customer / Region dimension Time dimension Product dimension
  • 25. Data Warehouse “Star schema” with fact and dimension tables
  • 27. Data Warehouse Models class Dwh::SalesFact < Dwh::Fact belongs_to :customer, class_name: "Dwh::CustomerDimension" belongs_to :product, class_name: "Dwh::ProductDimension" belongs_to :time, class_name: "Dwh::TimeDimension" end class Dwh::CustomerDimension < Dwh::Dimension has_many :sales_facts, class_name: “Dwh::SalesFact", foreign_key: "customer_id" end class Dwh::ProductDimension < Dwh::Dimension has_many :sales_facts, class_name: "Dwh::SalesFact", foreign_key: "product_id" belongs_to :product_class, class_name: "Dwh::ProductClassDimension" end class Dwh::ProductClassDimension < Dwh::Dimension has_many :products, class_name: "Dwh::ProductDimension", foreign_key: "product_class_id" end class Dwh::TimeDimension < Dwh::Dimension has_many :sales_facts, class_name: “Dwh::SalesFact", foreign_key: "time_id" end
  • 28. Load Dimension class Dwh::CustomerDimension < Dwh::Dimension # ... def self.truncate! connection.execute "TRUNCATE TABLE #{table_name}" end def self.load! truncate! column_names = %w(id full_name city state_province country birth_date gender created_at updated_at) connection.insert %[ INSERT INTO #{table_name} (#{column_names.join(',')}) SELECT #{column_names.join(',')} FROM #{::Customer.table_name} ] end end
  • 29. Generate Time Dimension class Dwh::TimeDimension < Dwh::Dimension def self.load! connection.select_values(%[ SELECT DISTINCT order_date FROM #{Order.table_name} WHERE order_date NOT IN (SELECT date_value FROM #{table_name}) ]).each do |date| year, month, day = date.year, date.month, date.day quarter = ((month-1)/3)+1 quarter_name = "Q#{quarter} #{year}" month_name = date.strftime("%b %Y") day_name = date.strftime("%b %d %Y") sql = send :sanitize_sql_array, [ %[ INSERT INTO #{table_name} (id, date_value, year, quarter, month, day, year_name, quarter_name, month_name, day_name) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?) ], date_to_id(date), date, year, quarter, month, day, year.to_s, quarter_name, month_name, day_name ] connection.insert sql end end end
  • 30. Load Facts class Dwh::SalesFact < Dwh::Fact def self.load! truncate! connection.insert %[ INSERT INTO #{table_name} (customer_id, product_id, time_id, sales_quantity, sales_amount, sales_cost) SELECT o.customer_id, oi.product_id, CAST(to_char(o.order_date, 'YYYYMMDD') AS INTEGER), oi.quantity, oi.amount, oi.cost FROM #{OrderItem.table_name} oi INNER JOIN #{Order.table_name} o ON o.id = oi.order_id ] end end
  • 31. What were the
 total sales amounts
 in California
 in Q1 2014
 by product families? Dwh::SalesFact. joins(:customer).joins(:product => :product_class).joins(:time). where("d_customers.country" => “USA", "d_customers.state_province" => "CA"). where("d_time.year" => 2014, "d_time.quarter" => 1). group("d_product_classes.product_family"). sum("sales_amount")
  • 34. Multi-Dimensional Data Model Tim e Product Customer Measures Sales quantity Sales amount Sales cost Customers count Sales cube
  • 35. Dimension Hierarchies All Customers USA Canada WA CA OR San Francisco Los Angeles Country All State City Levels
  • 36. Time Dimension All Times 2014 2015 Q2 Q3 Q4 AUG SEP Year All Quarter Month AUG 01 AUG 02 Day Q1 JUL Default hierarchy All Times 2014 2015 W2 W3 W4 JAN 18 JAN 19 Year All Week Day W1 JAN 17 Weekly hierarchy
  • 37. OLAP Technologies On-Line Analytical Processing Mondrian http://community.pentaho.com/projects/mondrian/ https://github.com/rsim/mondrian-olap mondrian-olap gem
  • 38. Mondrian::OLAP::Schema.define do cube 'Sales' do table 'f_sales', schema: 'dwh' dimension 'Customer', foreign_key: 'customer_id' do hierarchy all_member_name: 'All Customers', primary_key: 'id' do table 'd_customers', schema: 'dwh' level 'Country', column: 'country' level 'State Province', column: 'state_province' level 'City', column: 'city' level 'Name', column: 'full_name' end end dimension 'Product', foreign_key: 'product_id' do hierarchy all_member_name: 'All Products', primary_key: 'id', primary_key_table: 'd_products' do join left_key: 'product_class_id', right_key: 'id' do table 'd_products', schema: 'dwh' table 'd_product_classes', schema: 'dwh' end level 'Product Family', table: 'd_product_classes', column: 'product_family' level 'Product Department', table: 'd_product_classes', column: 'product_department' level 'Product Category', table: 'd_product_classes', column: 'product_category' level 'Product Subcategory', table: 'd_product_classes', column: 'product_subcategory' level 'Brand Name', table: 'd_products', column: 'brand_name' level 'Product Name', table: 'd_products', column: 'product_name' end end dimension 'Time', foreign_key: 'time_id', type: 'TimeDimension' do hierarchy all_member_name: 'All Time', primary_key: 'id' do table 'd_time', schema: 'dwh' level 'Year', column: 'year', type: 'Numeric', name_column: 'year_name', level_type: 'TimeYears' level 'Quarter', column: 'quarter', type: 'Numeric', name_column: 'quarter_name', level_type: 'TimeQuarters' level 'Month', column: 'month', type: 'Numeric', name_column: 'month_name', level_type: 'TimeMonths' level 'Day', column: 'day', type: 'Numeric', name_column: 'day_name', level_type: 'TimeDays' end end measure 'Sales Quantity', column: 'sales_quantity', aggregator: 'sum' measure 'Sales Amount', column: 'sales_amount', aggregator: 'sum' measure 'Sales Cost', column: 'sales_cost', aggregator: ‘sum' measure ‘Customers Count', column: ‘customer_id', aggregator: ‘distinct-count' end end mondrian-olap schema definition
  • 39. What were the
 total sales amounts
 in California
 in Q1 2014
 by product families? olap.from("Sales"). columns("[Measures].[Sales Amount]"). rows("[Product].[Product Family].Members"). where("[Customer].[USA].[CA]", "[Time].[Quarter].[Q1 2014]")
  • 40. MDX Query Language olap.from("Sales"). columns("[Measures].[Sales Amount]"). rows("[Product].[Product Family].Members"). where("[Customer].[USA].[CA]", "[Time].[Quarter].[Q1 2014]") SELECT {[Measures].[Sales Amount]} ON COLUMNS, [Product].[Product Family].Members ON ROWS FROM [Sales] WHERE ([Customer].[USA].[CA], [Time].[Quarter].[Q1 2014])
  • 41. Results Caching SELECT {[Measures].[Sales Amount], [Measures].[Sales Cost], [Measures].[Customers Count]} ON COLUMNS, [Product].[Product Family].Members ON ROWS FROM [Sales] (21713.0ms) SELECT {[Measures].[Sales Amount], [Measures].[Sales Cost], [Measures].[Customers Count]} ON COLUMNS, [Product].[Product Family].Members ON ROWS FROM [Sales] (10.0ms)
  • 42. Additional Attribute Dimension dimension 'Gender', foreign_key: 'customer_id' do hierarchy all_member_name: 'All Genders', primary_key: 'id' do table 'd_customers', schema: 'dwh' level 'Gender', column: 'gender' do name_expression do sql "CASE d_customers.gender WHEN 'F' THEN ‘Female' WHEN 'M' THEN ‘Male' END" end end end end olap.from("Sales"). columns("[Measures].[Sales Amount]"). rows("[Gender].[Gender].Members")
  • 43. Dynamic Attribute Dimension dimension 'Age interval', foreign_key: 'customer_id' do hierarchy all_member_name: 'All Age', primary_key: 'id' do table 'd_customers', schema: 'dwh' level 'Age interval' do key_expression do sql %[ CASE WHEN age(d_customers.birth_date) < interval '20 years' THEN '< 20 years' WHEN age(d_customers.birth_date) < interval '30 years' THEN '20-30 years' WHEN age(d_customers.birth_date) < interval '40 years' THEN '30-40 years' WHEN age(d_customers.birth_date) < interval '50 years' THEN '40-50 years' ELSE '50+ years' END ] end end end end [Age interval].[<20 years] [Age interval].[20-30 years] [Age interval].[30-40 years] [Age interval].[40-50 years] [Age interval].[50+ years]
  • 44. Calculation Formulas calculated_member 'Profit', dimension: 'Measures', format_string: '#,##0.00', formula: '[Measures].[Sales Amount] - [Measures].[Sales Cost]' calculated_member 'Margin %', dimension: 'Measures', format_string: '#,##0.00%', formula: '[Measures].[Profit] / [Measures].[Sales Amount]' olap.from("Sales"). columns("[Measures].[Profit]", "[Measures].[Margin %]"). rows("[Product].[Product Family].Members"). where("[Customer].[USA].[CA]", "[Time].[Quarter].[Q1 2014]")
  • 47. Ruby Tools for ETL Kiba http://www.kiba-etl.org/ https://github.com/square/ETLETL
  • 48. Kiba example # declare a ruby method here, for quick reusable logic def parse_french_date(date) Date.strptime(date, '%d/%m/%Y') end # or better, include a ruby file which loads reusable assets # eg: commonly used sources / destinations / transforms, under unit-test require_relative 'common' # declare a source where to take data from (you implement it - see notes below) source MyCsvSource, 'input.csv' # declare a row transform to process a given field transform do |row| row[:birth_date] = parse_french_date(row[:birth_date]) # return to keep in the pipeline row end # declare another row transform, dismissing rows conditionally by returning nil transform do |row| row[:birth_date].year < 2000 ? row : nil end # declare a row transform as a class, which can be tested properly transform ComplianceCheckTransform, eula: 2015
  • 50. Single threaded ETL class Dwh::TimeDimension < Dwh::Dimension def self.load! logger.silence do connection.select_values(%[ SELECT DISTINCT order_date FROM #{Order.table_name} WHERE order_date NOT IN (SELECT date_value FROM #{table_name}) ]).each do |date| insert_date(date) end end end def self.insert_date(date) year, month, day = date.year, date.month, date.day quarter = ((month-1)/3)+1 quarter_name = "Q#{quarter} #{year}" month_name = date.strftime("%b %Y") day_name = date.strftime("%b %d %Y") sql = send :sanitize_sql_array, [ %[ INSERT INTO #{table_name} (id, date_value, year, quarter, month, day, year_name, quarter_name, month_name, day_name) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?) ], date_to_id(date), date, year, quarter, month, day, year.to_s, quarter_name, month_name, day_name ] connection.insert sql end end
  • 51. require 'concurrent/executors' class Dwh::TimeDimension < Dwh::Dimension def self.parallel_load!(pool_size = 4) logger.silence do insert_date_pool = Concurrent::FixedThreadPool.new(pool_size) connection.select_values(%[ SELECT DISTINCT order_date FROM #{Order.table_name} WHERE order_date NOT IN (SELECT date_value FROM #{table_name}) ]).each do |date| insert_date_pool.post(date) do |date| connection_pool.with_connection do insert_date(date) end end end insert_date_pool.shutdown insert_date_pool.wait_for_termination end end end ETL with Thread Pool
  • 52. Benchmark! Dwh::TimeDimension.load! (5236.0ms) Dwh::TimeDimension.parallel_load!(2) (3450.0ms) Dwh::TimeDimension.parallel_load!(4) (2142.0ms) Dwh::TimeDimension.parallel_load!(6) (2361.0ms) Dwh::TimeDimension.parallel_load!(8) (2826.0ms) optimal size in this case Java Mission Control
  • 53. Traditional vs Analytical Relational Databases Optimized for transaction processing Optimized for analytical queries
  • 56. Analytical Query Performance SELECT d_product_classes.product_family, SUM(f_sales.sales_amount) AS sales_amount, SUM(f_sales.sales_cost) AS sales_cost, COUNT(DISTINCT f_sales.customer_id) AS customers_count FROM "dwh"."f_sales" INNER JOIN "dwh"."d_products" ON "dwh"."d_products"."id" = "dwh"."f_sales"."product_id" INNER JOIN "dwh"."d_product_classes" ON "dwh"."d_product_classes"."id" = "dwh"."d_products"."product_class_id" GROUP BY d_product_classes.product_family always ~18 seconds first ~9 seconds next ~1.5 seconds 6 million rows
  • 57. When to use what? Fact table size Traditional transactional databases Analytical columnar databases < 1M rows OK No big win 1-10M rows Complex queries slower OK 10-100M rows Slow OK >100M rows Very slow OK with tuning
  • 58. What did we cover? Problems with analytical queries Dimensional modeling Star schemas Mondrian OLAP and MDX ETL – Extract, Transform, Load Analytical columnar databases