SlideShare ist ein Scribd-Unternehmen logo
1 von 86
Downloaden Sie, um offline zu lesen
Using Clustering Of Electricity Consumers
To Produce More Accurate Predictions
#1 R<-Slovakia meetup
Peter Laurinec
22.3.2017
FIIT STU
R <- Slovakia
0
Why are we here? Rrrr
CRAN - about 10300 packages -
https://cran.r-project.org/web/packages/.
Popularity - http://redmonk.com/sogrady/:
1
Why prediction of electricity consumption?
Important for:
• Distribution (utility)
companies. Producers of
electricity. Overload.
• Deregularization of market
• Buy and sell of electricity.
• Source of energy - wind and
photo-voltaic power-plant.
Hardly predictable.
• Active consumers. producer
+ consumer ⇒ prosumer.
• Optimal settings of bill.
2
Smart grid
• smart meter
• demand response
• smart cities
Good for:
• Sustainability - Blackouts...
• Green energy
• Saving energy
• Dynamic tariffs ⇒ saving
money
3
Slovak data
• Number of consumers are 21502. From that 11281 are OK. Enterprises.
• Time interval of measurements: 01.07.2011 – 14.05.2015. But mainly from
the 01.07.2013. 96 measurements per day.
Irish data:
• Number of consumers are 6435. From that 3639 are residences.
• Time interval: 14.7.2009 – 31.12.2010. 48 measurements per day.
4
Smart meter data
OOM_ID DIAGRAM_ID TIME LOAD TYPE_OF_OOM DATE ZIP
1: 11 202004 −45 4.598 O 01/01/2014 4013
2: 11 202004 195 4.087 O 01/01/2014 4013
3: 11 202004 −30 5.108 O 01/01/2014 4013
4: 11 202004 345 4.598 O 01/01/2014 4013
5: 11 202004 825 2.554 O 01/01/2014 4013
6: 11 202004 870 2.554 O 01/01/2014 4013
41312836: 20970 14922842 90 18.783 O 14/02/2015 4011
41312837: 20970 14922842 75 20.581 O 14/02/2015 4011
41312838: 20970 14922842 60 18.583 O 14/02/2015 4011
41312839: 20970 14922842 45 18.983 O 14/02/2015 4011
41312840: 20970 14922842 30 17.384 O 14/02/2015 4011
41312841: 20970 14922842 15 18.583 O 14/02/2015 4011 5
Random consumer from Kosice vol. 1
6
Random consumer from Kosice vol. 2
7
Aggregate load from Zilina
8
Random consumer from Ireland
9
Aggregate load from Ireland
10
R <- ”Forecasting”
10
STLF - Short Term Load Forecasting
• Prediction (forecast) for 1 day ahead (96 resp. 48
measurements). Short term forecasting.
• Strong time dependence. Seasonalities (daily, weekly,
yearly). Holidays. Weather.
• Accuracy of predictions (in %) - MAPE (Mean Absolute
Percentage Error).
MAPE = 100 ×
1
n
n∑
t=1
|xt − ˆxt|
xt
11
Methods for prediction of time series
Time series analysis
ARIMA, Holt-Winters exponential smoothing, decomposition TS...
Linear regression
Multiple linear regression, robust LR, GAM (Generalized Additive
Model).
Machine learning
• Neural nets
• Support Vector Regression
• Regression trees and forests
Ensemble learning
Linear combination of predictions.
12
Weekly profile
13
Time series analysis methods
ARIMA, exponential smoothing and co.
• Not best for multiple seasonal time series.
• Solution: Creation of separate models for different days
in week.
• Solution: Set higher period - 7 ∗ period. We need at least 3
periods in training set.
• Solution: Double-seasonal Holt-Winters (dshw, tbats).
• Solution: SARIMA. ARIMAX - ARIMA plus extra predictors by FFT
(auto.arima(ts, xreg)).
14
Fourier transform
fourier in forecast
load_msts <- msts(load,
seasonal.periods = c(freq, freq * 7))
fourier(load_msts, K = c(6, 6*2))
15
Time series analysis
• Solution: Decomposition of time series to three parts - seasonal, trend,
remainder (noise).
• Many different methods. STL decomposition (seasonal decomposition
of time series by loess).
• Moving Average (decompose). Exponential smoothing...
16
Decomposition - STL
stl, ggseas with period 48
17
Decomposition - STL vol.2
stl, ggseas with period 48 ∗ 7
18
Decomposition - EXP
tbats in forecast
19
Temporal Hierarchical Forecasting
thief - http://robjhyndman.com/hyndsight/thief/
20
Regression methods
• MLR (OLS), RLM 1 (M-estimate), SVR 2
• Dummy (binary) variables3
Load D1 D2 D3 D4 D5 D6 D7 D8 D9 … W1 W2 …
1: 0.402 1 0 0 0 0 0 0 0 0 1 0
2: 0.503 0 1 0 0 0 0 0 0 0 1 0
3: 0.338 0 0 1 0 0 0 0 0 0 1 0
4: 0.337 0 0 0 1 0 0 0 0 0 1 0
5: 0.340 0 0 0 0 1 0 0 0 0 1 0
6: 0.340 0 0 0 0 0 1 0 0 0 1 0
7: 0.340 0 0 0 0 0 0 1 0 0 1 0
8: 0.338 0 0 0 0 0 0 0 1 0 1 0
9: 0.339 0 0 0 0 0 0 0 0 1 1 0
...
...
...
...
1
rlm - MASS
2
ksvm(..., eps, C) - kernlab
3
model.matrix(∼ 0 + as.factor(Day) + as.factor(Week))
21
Linear Regression
Interactions! MLR and GAM 4.
4
https://petolau.github.io/
22
Trees and Forests
• Dummy variables aren’t appropriate.
• CART 5, Extremely Randomized Trees 6, Bagging 7.
Daily and Weekly seasonal vector:
Day = (1, 2, . . . , freq, 1, 2, . . . , freq, . . . )
Week = (1, 1, . . . , 1, 2, 2, . . . , 2, . . . , 7, 1, 1, . . . ),
where freq is a period (48 resp. 96).
5
rpart
6
extraTrees
7
RWeka
23
rpart
rpart.plot::rpart.plot
24
Forests and ctree
• Ctree (party), Extreme Gradient Boosting (xgboost), Random
Forest.
Daily and weekly seasonal vector in form of sine and cosine:
Day = (1, 2, . . . , freq, 1, 2, . . . , freq, . . . )
Week = (1, 1, . . . , 1, 2, 2, . . . , 2, . . . , 7, 1, 1, . . . )
sin(2π Day
freq ) + 1
2
resp.
cos(2π Day
freq ) + 1
2
sin(2π Week
7 ) + 1
2
resp.
cos(2π Week
7 ) + 1
2
,
where freq is a period (48 resp. 96).
• lag, seasonal part from decomposition
25
Neural nets
• Feedforward, recurrent, multilayer perceptron (MLP), deep.
• Lagged load. Denoised (detrended) load.
• Model for different days - similar day approach.
• Sine and cosine
• Fourier
Figure 1: Day Figure 2: Week
26
Ensemble learning
• We can’t determine which model is best. Overfitting.
• Wisdom of crowds - set of models. Model weighting ⇒ linear
combination.
• Adaptation of weights of prediction methods based on
prediction error - median weighting.
et
j = median(| xt
− ˆx
t
j |)
wt+1
j = wt
j
median(et
)
et
j
Training set – sliding window.
27
Ensemble learning
• Heterogeneous ensemble model.
• Improvement of 8 − 12%.
• Weighting of methods – optimization task.
Genetic algorithm, PSO and others.
• Improvement of 5, 5%. Possible overfit.
28
Ensemble learning with clustering
ctree + rpart trained on different data sets (sampling with
replacement). ”Weak learners”.
Pre-process - PCA → K-means. + unsupervised.
29
RSlovakia #1 meetup
R <- ”Cluster Analysis”
30
RSlovakia #1 meetup
Creation of more predictable groups of consumers
Cluster Analysis! Unsupervised learning...
32
Cluster analysis in smart grids
Good for:
33
Cluster analysis in smart grids
Good for:
• Creation of consumer profiles, which can help by
recommendation to customers to save electricity.
33
Cluster analysis in smart grids
Good for:
• Creation of consumer profiles, which can help by
recommendation to customers to save electricity.
• Detection of anomalies.
33
Cluster analysis in smart grids
Good for:
• Creation of consumer profiles, which can help by
recommendation to customers to save electricity.
• Detection of anomalies.
• Neater monitoring of the smart grid (as a whole).
33
Cluster analysis in smart grids
Good for:
• Creation of consumer profiles, which can help by
recommendation to customers to save electricity.
• Detection of anomalies.
• Neater monitoring of the smart grid (as a whole).
• Emergency analysis.
33
Cluster analysis in smart grids
Good for:
• Creation of consumer profiles, which can help by
recommendation to customers to save electricity.
• Detection of anomalies.
• Neater monitoring of the smart grid (as a whole).
• Emergency analysis.
• Generation of more realistic synthetic data.
33
Cluster analysis in smart grids
Good for:
• Creation of consumer profiles, which can help by
recommendation to customers to save electricity.
• Detection of anomalies.
• Neater monitoring of the smart grid (as a whole).
• Emergency analysis.
• Generation of more realistic synthetic data.
• Last but not least, improve prediction methods to
produce more accurate predictions.
33
Approach
Aggregation with clustering
34
Approach
Aggregation with clustering
1. Set of time series of electricity consumption
34
Approach
Aggregation with clustering
1. Set of time series of electricity consumption
2. Normalization (z-score)
34
Approach
Aggregation with clustering
1. Set of time series of electricity consumption
2. Normalization (z-score)
3. Computation of representations of time series
34
Approach
Aggregation with clustering
1. Set of time series of electricity consumption
2. Normalization (z-score)
3. Computation of representations of time series
4. Determination of optimal number of clusters K
34
Approach
Aggregation with clustering
1. Set of time series of electricity consumption
2. Normalization (z-score)
3. Computation of representations of time series
4. Determination of optimal number of clusters K
5. Clustering of representations
34
Approach
Aggregation with clustering
1. Set of time series of electricity consumption
2. Normalization (z-score)
3. Computation of representations of time series
4. Determination of optimal number of clusters K
5. Clustering of representations
6. Summation of K time series by found clusters
34
Approach
Aggregation with clustering
1. Set of time series of electricity consumption
2. Normalization (z-score)
3. Computation of representations of time series
4. Determination of optimal number of clusters K
5. Clustering of representations
6. Summation of K time series by found clusters
7. Training of K forecast models and the following forecast
34
Approach
Aggregation with clustering
1. Set of time series of electricity consumption
2. Normalization (z-score)
3. Computation of representations of time series
4. Determination of optimal number of clusters K
5. Clustering of representations
6. Summation of K time series by found clusters
7. Training of K forecast models and the following forecast
8. Summation of forecasts and evaluation
34
Pre-processing
Normalization of time series - consumers.
Best choice is to use z-score:
x − µ
σ
Alternative:
x
max(x)
35
Representations of time series
36
Representations of time series
Why time series representations?
1. Reduce memory load.
2. Accelerate subsequent machine learning algorithms.
3. Implicitly remove noise from the data.
4. Emphasize the essential characteristics of the data.
37
Methods of representations of time series
PAA - Piecewise Aggregate Approximation. Non data adaptive.
n -> d. ˆX = (ˆx1, . . . ,ˆxd).
ˆxi =
d
n
(n/d)i
∑
j=(n/d)(i−1)+1
xj.
Not only average. Median, standard deviation, maximum...
DWT, DFT...
38
Back to models - model based methods
• Representations based on statistical model.
• Extraction of regression coefficients ⇒ creation of daily
profiles.
• Creation of representation which is long as frequency of
time series.
yi = β1xi1 + β2xi2 + · · · + βfreqxifreq + βweek1xiweek1 + · · · + βweek7xiweek7 + εi,
where i = 1, . . . , n
New representation: ˆβ = (ˆβ1, . . . , ˆβfreq, ˆβweek1, . . . , ˆβweek7).
Possible applicable methods:
MLR, Robust Linear Model, L1-regression, Generalized Additive
Model (GAM)...
39
Model based methods
• Triple Holt-Winters Exponential Smoothing.
Last seasonal coefficients as representation.
1. Smoothing factors can be set automatically or manually.
• Mean and median daily profile.
40
Comparison
41
Comparison
42
Clipping - bit level representation
Data dictated.
ˆxt =
{
1 if xt > µ
0 otherwise
43
Clipping - RLE
RLE - Run Length Encoding. Sliding window - one day.
0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 0 1 0 1 0 1 0 1 0 1 0 1 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0
6 1 12 7 2 1 1 10 2 6 20 7 2 2 1 1 5 1 7 1 1 4 1 15 3 13 1 2 1 3 1 2 1 1 3 1 5 1 6 1 6 4 5 1 3 3 3 1 2 1 2
44
Clipping - final representation
1 7 1 10 6 7 2 1 1 1 1 3 1 1 1 1 1 1 1 4 1 3 1 1
6 12 2 1 2 20 2 1 5 7 0 4 15 13 2 3 2 1 3 5 6 6 5 3
Feature extraction: average, maximum, -average, -maximum …
45
Figure 3: Slovakia - factories Figure 4: Ireland - residential
Figure 5: Ireland - median daily profile
K-means
Euclidean distance. Centroids.
K-means++ - careful seeding of centroids.
K-medoids, DBSCAN... 48
Finding optimal K
Manually or automatically set number of clusters. By intern
validity measure. Davies-Bouldin index...
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
123 90 139 477 11 468 263 184 94 136 765 224 195 55 81 82 160 90 2
49
RSlovakia #1 meetup
RSlovakia #1 meetup
RSlovakia #1 meetup
RSlovakia #1 meetup
RSlovakia #1 meetup
Results
Static data set
One clustering → prediction with sliding window.
Irish. GAM, RLM, PAA - 5.23% → okolo 4.22% MAPE.
Slovak. Clip10, LM, DWT - 4.15% → okolo 3.97% MAPE.
STL + EXP, STL + ARIMA, SVR, Bagging. Comparison of 17
representations.
Batch processing
Batch clustering → prediction with sliding window (14 days).
Irish. RLM, Median, GAM. Comparison of 10 prediction methods.
Significant improvement in DSHW, STL + ARIMA, STL + ETS, SVR,
RandomForest, Bagging and MLP.
Best result: Bagging with GAM 3.68% against 3.82% MAPE.
55
R <- Tips and Tricks
55
Packages
List of all packages by name + short description.
(https://cran.r-project.org/web/packages/available_
packages_by_name.html)
56
Task Views
List of specific tasks (domains) - description of available packages and
functions. (https://cran.r-project.org/web/views/)
57
Matrix Calculus
MRO 8
. The Intel Math Kernel Library (MKL).
ˆβ = (XT
X)−1
XT
y
8
https://mran.microsoft.com/documents/rro/multithread/
58
Fast! SVD and PCA
Alternative to prcomp:
library(rARPACK)
pc <- function(x, k) {
# First center data
xc <- scale(x, center = TRUE, scale = TRUE)
# Partial SVD
decomp <- svds(xc, k, nu = 0, nv = k)
return(xc %*% decomp$v)
}
pc_mat <- pc(your_matrix, 2)
59
data.table
Nice syntax9:
DT[i, j, by]
R i j by
SQL where select or update group by
9
https:
//github.com/Rdatatable/data.table/wiki/Getting-started
60
data.table
Advantages
• Memory efficient
• Fast
• Less code and functions
Reading large amount of .csv:
files <- list.files(pattern = "*.csv")
DT <- rbindlist(lapply(files, function(x)
cbind(fread(x), gsub(".csv", "", x))))
61
data.table
Easy subsetting
data.frame style:
DF[DF$col1 > 4 & DF$col2 == "R", "col3"]
data.table style:
DT[col1 > 4 & col2 == "R", .(col3)]
Expressions
.N, .SD
DT[, .N, by = .(INDUSTRY, SUB_INDUSTRY)]
DT[, lapply(.SD, sumNA)]
62
data.table
Changing DT by reference10
:=, set(), setkey(), setnames()
DF$R <- rep("love", Inf)
DT[, R := rep("love", Inf)]
for(i in from:to)
set(DT, row, column, new value)
10
shallow vs deep copy
63
Vectorization and apply
library(parallel)
lmCoef <- function(X, Y) {
beta <- solve(t(X) %*% X) %*% t(X) %*% as.vector(Y)
as.vector(beta)
}
cl <- makeCluster(detectCores())
clusterExport(cl, varlist = c("lmCoef", "modelMatrix",
"loadMatrix"), envir = environment())
lm_mat <- parSapply(cl, 1:nrow(loadMatrix), function(i)
lmCoef(modelMatrix, as.vector(loadMatrix[i,])))
stopCluster(cl)
64
Spark + R
sparklyr - Perform SQL queries through the sparklyr dplyr
interface. MLlib + extensions.
sparklyr http://spark.rstudio.com/index.html
SparkR https://spark.apache.org/docs/2.0.1/sparkr.html
Comparison http://stackoverflow.com/questions/39494484/
sparkr-vs-sparklyr
HPC https://cran.r-project.org/web/views/
HighPerformanceComputing.html
65
Graphics
ggplot2, googleVis, ggVis, dygraphs, plotly, animation, shiny...
66
Summary
• More accurate predictions are important.
• Clustering of electricity consumers have lot of applications.
• Prediction methods - different methods needs different
approaches.
• Tips for more effective computing (programming) in R.
Blog: https://petolau.github.io/
Publications:
https://www.researchgate.net/profile/Peter_Laurinec
Contact: laurinec.peter@gmail.com
67

Weitere ähnliche Inhalte

Andere mochten auch

Visual Design with Data
Visual Design with DataVisual Design with Data
Visual Design with DataSeth Familian
 
The Safety And Sanity Of Entrepreneurship
The Safety And Sanity Of EntrepreneurshipThe Safety And Sanity Of Entrepreneurship
The Safety And Sanity Of EntrepreneurshipTaylorPearson.me
 
Learning About At-Risk Veterans Using 
Online Network Surveys
Learning About At-Risk Veterans Using 
Online Network SurveysLearning About At-Risk Veterans Using 
Online Network Surveys
Learning About At-Risk Veterans Using 
Online Network SurveysSean Taylor
 
Restitution de l'étude - le musée de demain - GECE - 2017
Restitution de l'étude  - le musée de demain  - GECE - 2017Restitution de l'étude  - le musée de demain  - GECE - 2017
Restitution de l'étude - le musée de demain - GECE - 2017Institut de sondages
 
Outlier analysis for Temporal Datasets
Outlier analysis for Temporal DatasetsOutlier analysis for Temporal Datasets
Outlier analysis for Temporal DatasetsQuantUniversity
 
STP in Residential complexes
STP in Residential complexesSTP in Residential complexes
STP in Residential complexesADDA
 
Bioo Scientific - Reduced Bias Small RNA Library Prep with Gel-Free or Low-In...
Bioo Scientific - Reduced Bias Small RNA Library Prep with Gel-Free or Low-In...Bioo Scientific - Reduced Bias Small RNA Library Prep with Gel-Free or Low-In...
Bioo Scientific - Reduced Bias Small RNA Library Prep with Gel-Free or Low-In...Bioo Scientific
 
R in finance: Introduction to R and Its Applications in Finance
R in finance: Introduction to R and Its Applications in FinanceR in finance: Introduction to R and Its Applications in Finance
R in finance: Introduction to R and Its Applications in FinanceLiang C. Zhang (張良丞)
 
Analisis La estrategia del Caracol
Analisis La estrategia del Caracol Analisis La estrategia del Caracol
Analisis La estrategia del Caracol allyson17h
 
Intro, nouns & pronouns
Intro, nouns & pronounsIntro, nouns & pronouns
Intro, nouns & pronounsSandy Anthony
 
Nouns pronouns
Nouns pronounsNouns pronouns
Nouns pronounsjkmurton
 
Putting the Magic in Data Science
Putting the Magic in Data SciencePutting the Magic in Data Science
Putting the Magic in Data ScienceSean Taylor
 

Andere mochten auch (18)

Customer Journey Analytics and Big Data
Customer Journey Analytics and Big DataCustomer Journey Analytics and Big Data
Customer Journey Analytics and Big Data
 
Bill Aulet GEC2016 keynote speech March 16 2016 Medellin Colombia
Bill Aulet GEC2016 keynote speech March 16 2016 Medellin ColombiaBill Aulet GEC2016 keynote speech March 16 2016 Medellin Colombia
Bill Aulet GEC2016 keynote speech March 16 2016 Medellin Colombia
 
Visual Design with Data
Visual Design with DataVisual Design with Data
Visual Design with Data
 
The Safety And Sanity Of Entrepreneurship
The Safety And Sanity Of EntrepreneurshipThe Safety And Sanity Of Entrepreneurship
The Safety And Sanity Of Entrepreneurship
 
Learning About At-Risk Veterans Using 
Online Network Surveys
Learning About At-Risk Veterans Using 
Online Network SurveysLearning About At-Risk Veterans Using 
Online Network Surveys
Learning About At-Risk Veterans Using 
Online Network Surveys
 
Rug hogan-10-03-2012
Rug hogan-10-03-2012Rug hogan-10-03-2012
Rug hogan-10-03-2012
 
Jsm big-data
Jsm big-dataJsm big-data
Jsm big-data
 
Restitution de l'étude - le musée de demain - GECE - 2017
Restitution de l'étude  - le musée de demain  - GECE - 2017Restitution de l'étude  - le musée de demain  - GECE - 2017
Restitution de l'étude - le musée de demain - GECE - 2017
 
Presentación
PresentaciónPresentación
Presentación
 
Outlier analysis for Temporal Datasets
Outlier analysis for Temporal DatasetsOutlier analysis for Temporal Datasets
Outlier analysis for Temporal Datasets
 
STP in Residential complexes
STP in Residential complexesSTP in Residential complexes
STP in Residential complexes
 
Bioo Scientific - Reduced Bias Small RNA Library Prep with Gel-Free or Low-In...
Bioo Scientific - Reduced Bias Small RNA Library Prep with Gel-Free or Low-In...Bioo Scientific - Reduced Bias Small RNA Library Prep with Gel-Free or Low-In...
Bioo Scientific - Reduced Bias Small RNA Library Prep with Gel-Free or Low-In...
 
MPLS Layer 3 VPN
MPLS Layer 3 VPN MPLS Layer 3 VPN
MPLS Layer 3 VPN
 
R in finance: Introduction to R and Its Applications in Finance
R in finance: Introduction to R and Its Applications in FinanceR in finance: Introduction to R and Its Applications in Finance
R in finance: Introduction to R and Its Applications in Finance
 
Analisis La estrategia del Caracol
Analisis La estrategia del Caracol Analisis La estrategia del Caracol
Analisis La estrategia del Caracol
 
Intro, nouns & pronouns
Intro, nouns & pronounsIntro, nouns & pronouns
Intro, nouns & pronouns
 
Nouns pronouns
Nouns pronounsNouns pronouns
Nouns pronouns
 
Putting the Magic in Data Science
Putting the Magic in Data SciencePutting the Magic in Data Science
Putting the Magic in Data Science
 

Ähnlich wie RSlovakia #1 meetup

New Clustering-based Forecasting Method for Disaggregated End-consumer Electr...
New Clustering-based Forecasting Method for Disaggregated End-consumer Electr...New Clustering-based Forecasting Method for Disaggregated End-consumer Electr...
New Clustering-based Forecasting Method for Disaggregated End-consumer Electr...Peter Laurinec
 
Is Unsupervised Ensemble Learning Useful for Aggregated or Clustered Load For...
Is Unsupervised Ensemble Learning Useful for Aggregated or Clustered Load For...Is Unsupervised Ensemble Learning Useful for Aggregated or Clustered Load For...
Is Unsupervised Ensemble Learning Useful for Aggregated or Clustered Load For...Peter Laurinec
 
Graph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear AlgebraGraph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear AlgebraJason Riedy
 
OpenCL applications in genomics
OpenCL applications in genomicsOpenCL applications in genomics
OpenCL applications in genomicsUSC
 
ES_SAA_OG_PF_ECCTD_Pos
ES_SAA_OG_PF_ECCTD_PosES_SAA_OG_PF_ECCTD_Pos
ES_SAA_OG_PF_ECCTD_PosSyed Asad Alam
 
Virus, Vaccines, Genes and Quantum - 2020-06-18
Virus, Vaccines, Genes and Quantum - 2020-06-18Virus, Vaccines, Genes and Quantum - 2020-06-18
Virus, Vaccines, Genes and Quantum - 2020-06-18Aritra Sarkar
 
Advanced database and data mining & clustering concepts
Advanced database and data mining & clustering conceptsAdvanced database and data mining & clustering concepts
Advanced database and data mining & clustering conceptsNithyananthSengottai
 
SAIP2015 presentation_v5
SAIP2015 presentation_v5SAIP2015 presentation_v5
SAIP2015 presentation_v5Thembelani Gina
 
Wereszczynski Molecular Dynamics
Wereszczynski Molecular DynamicsWereszczynski Molecular Dynamics
Wereszczynski Molecular DynamicsSciCompIIT
 
R Packages for Time-Varying Networks and Extremal Dependence
R Packages for Time-Varying Networks and Extremal DependenceR Packages for Time-Varying Networks and Extremal Dependence
R Packages for Time-Varying Networks and Extremal DependenceWork-Bench
 
Simplicial closure and higher-order link prediction --- SIAMNS18
Simplicial closure and higher-order link prediction --- SIAMNS18Simplicial closure and higher-order link prediction --- SIAMNS18
Simplicial closure and higher-order link prediction --- SIAMNS18Austin Benson
 
Thesis (presentation)
Thesis (presentation)Thesis (presentation)
Thesis (presentation)nlt2390
 
Advanced Statistics And Probability (MSC 615
Advanced Statistics And Probability (MSC 615Advanced Statistics And Probability (MSC 615
Advanced Statistics And Probability (MSC 615Maria Perkins
 
Timing-pulse measurement and detector calibration for the OsteoQuant®.
Timing-pulse measurement and detector calibration for the OsteoQuant®.Timing-pulse measurement and detector calibration for the OsteoQuant®.
Timing-pulse measurement and detector calibration for the OsteoQuant®.Binu Enchakalody
 
Multiple Resonant Multiconductor Transmission line Resonator Design using Cir...
Multiple Resonant Multiconductor Transmission line Resonator Design using Cir...Multiple Resonant Multiconductor Transmission line Resonator Design using Cir...
Multiple Resonant Multiconductor Transmission line Resonator Design using Cir...Sasidhar Tadanki
 
Power System Oscillations Detection Estimation & Control
Power System Oscillations Detection Estimation & ControlPower System Oscillations Detection Estimation & Control
Power System Oscillations Detection Estimation & ControlPower System Operation
 
Predictive Modelling
Predictive ModellingPredictive Modelling
Predictive ModellingRajiv Advani
 

Ähnlich wie RSlovakia #1 meetup (20)

New Clustering-based Forecasting Method for Disaggregated End-consumer Electr...
New Clustering-based Forecasting Method for Disaggregated End-consumer Electr...New Clustering-based Forecasting Method for Disaggregated End-consumer Electr...
New Clustering-based Forecasting Method for Disaggregated End-consumer Electr...
 
Is Unsupervised Ensemble Learning Useful for Aggregated or Clustered Load For...
Is Unsupervised Ensemble Learning Useful for Aggregated or Clustered Load For...Is Unsupervised Ensemble Learning Useful for Aggregated or Clustered Load For...
Is Unsupervised Ensemble Learning Useful for Aggregated or Clustered Load For...
 
Graph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear AlgebraGraph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear Algebra
 
OpenCL applications in genomics
OpenCL applications in genomicsOpenCL applications in genomics
OpenCL applications in genomics
 
ES_SAA_OG_PF_ECCTD_Pos
ES_SAA_OG_PF_ECCTD_PosES_SAA_OG_PF_ECCTD_Pos
ES_SAA_OG_PF_ECCTD_Pos
 
Virus, Vaccines, Genes and Quantum - 2020-06-18
Virus, Vaccines, Genes and Quantum - 2020-06-18Virus, Vaccines, Genes and Quantum - 2020-06-18
Virus, Vaccines, Genes and Quantum - 2020-06-18
 
Advanced database and data mining & clustering concepts
Advanced database and data mining & clustering conceptsAdvanced database and data mining & clustering concepts
Advanced database and data mining & clustering concepts
 
SAIP2015 presentation_v5
SAIP2015 presentation_v5SAIP2015 presentation_v5
SAIP2015 presentation_v5
 
Wereszczynski Molecular Dynamics
Wereszczynski Molecular DynamicsWereszczynski Molecular Dynamics
Wereszczynski Molecular Dynamics
 
R Packages for Time-Varying Networks and Extremal Dependence
R Packages for Time-Varying Networks and Extremal DependenceR Packages for Time-Varying Networks and Extremal Dependence
R Packages for Time-Varying Networks and Extremal Dependence
 
Simplicial closure and higher-order link prediction --- SIAMNS18
Simplicial closure and higher-order link prediction --- SIAMNS18Simplicial closure and higher-order link prediction --- SIAMNS18
Simplicial closure and higher-order link prediction --- SIAMNS18
 
Thesis (presentation)
Thesis (presentation)Thesis (presentation)
Thesis (presentation)
 
[Slides]L2.pptx
[Slides]L2.pptx[Slides]L2.pptx
[Slides]L2.pptx
 
Advanced Statistics And Probability (MSC 615
Advanced Statistics And Probability (MSC 615Advanced Statistics And Probability (MSC 615
Advanced Statistics And Probability (MSC 615
 
Timing-pulse measurement and detector calibration for the OsteoQuant®.
Timing-pulse measurement and detector calibration for the OsteoQuant®.Timing-pulse measurement and detector calibration for the OsteoQuant®.
Timing-pulse measurement and detector calibration for the OsteoQuant®.
 
Multiple Resonant Multiconductor Transmission line Resonator Design using Cir...
Multiple Resonant Multiconductor Transmission line Resonator Design using Cir...Multiple Resonant Multiconductor Transmission line Resonator Design using Cir...
Multiple Resonant Multiconductor Transmission line Resonator Design using Cir...
 
Power System Oscillations Detection Estimation & Control
Power System Oscillations Detection Estimation & ControlPower System Oscillations Detection Estimation & Control
Power System Oscillations Detection Estimation & Control
 
community detection
community detectioncommunity detection
community detection
 
Predictive Modelling
Predictive ModellingPredictive Modelling
Predictive Modelling
 
MMseqs NGS 2014
MMseqs NGS 2014MMseqs NGS 2014
MMseqs NGS 2014
 

Kürzlich hochgeladen

Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxVenkatasubramani13
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerPavel Šabatka
 
CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionajayrajaganeshkayala
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Guido X Jansen
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...PrithaVashisht1
 
SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024Becky Burwell
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Vladislav Solodkiy
 
How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?sonikadigital1
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructuresonikadigital1
 
AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)Data & Analytics Magazin
 
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxDwiAyuSitiHartinah
 
YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.JasonViviers2
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best PracticesDataArchiva
 
Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introductionsanjaymuralee1
 
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityAggregage
 
MEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptMEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptaigil2
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationGiorgio Carbone
 

Kürzlich hochgeladen (17)

Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptx
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayer
 
CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual intervention
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...
 
SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023
 
How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructure
 
AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)
 
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
 
YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices
 
Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introduction
 
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
 
MEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptMEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .ppt
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - Presentation
 

RSlovakia #1 meetup

  • 1. Using Clustering Of Electricity Consumers To Produce More Accurate Predictions #1 R<-Slovakia meetup Peter Laurinec 22.3.2017 FIIT STU
  • 3. Why are we here? Rrrr CRAN - about 10300 packages - https://cran.r-project.org/web/packages/. Popularity - http://redmonk.com/sogrady/: 1
  • 4. Why prediction of electricity consumption? Important for: • Distribution (utility) companies. Producers of electricity. Overload. • Deregularization of market • Buy and sell of electricity. • Source of energy - wind and photo-voltaic power-plant. Hardly predictable. • Active consumers. producer + consumer ⇒ prosumer. • Optimal settings of bill. 2
  • 5. Smart grid • smart meter • demand response • smart cities Good for: • Sustainability - Blackouts... • Green energy • Saving energy • Dynamic tariffs ⇒ saving money 3
  • 6. Slovak data • Number of consumers are 21502. From that 11281 are OK. Enterprises. • Time interval of measurements: 01.07.2011 – 14.05.2015. But mainly from the 01.07.2013. 96 measurements per day. Irish data: • Number of consumers are 6435. From that 3639 are residences. • Time interval: 14.7.2009 – 31.12.2010. 48 measurements per day. 4
  • 7. Smart meter data OOM_ID DIAGRAM_ID TIME LOAD TYPE_OF_OOM DATE ZIP 1: 11 202004 −45 4.598 O 01/01/2014 4013 2: 11 202004 195 4.087 O 01/01/2014 4013 3: 11 202004 −30 5.108 O 01/01/2014 4013 4: 11 202004 345 4.598 O 01/01/2014 4013 5: 11 202004 825 2.554 O 01/01/2014 4013 6: 11 202004 870 2.554 O 01/01/2014 4013 41312836: 20970 14922842 90 18.783 O 14/02/2015 4011 41312837: 20970 14922842 75 20.581 O 14/02/2015 4011 41312838: 20970 14922842 60 18.583 O 14/02/2015 4011 41312839: 20970 14922842 45 18.983 O 14/02/2015 4011 41312840: 20970 14922842 30 17.384 O 14/02/2015 4011 41312841: 20970 14922842 15 18.583 O 14/02/2015 4011 5
  • 8. Random consumer from Kosice vol. 1 6
  • 9. Random consumer from Kosice vol. 2 7
  • 11. Random consumer from Ireland 9
  • 12. Aggregate load from Ireland 10
  • 14. STLF - Short Term Load Forecasting • Prediction (forecast) for 1 day ahead (96 resp. 48 measurements). Short term forecasting. • Strong time dependence. Seasonalities (daily, weekly, yearly). Holidays. Weather. • Accuracy of predictions (in %) - MAPE (Mean Absolute Percentage Error). MAPE = 100 × 1 n n∑ t=1 |xt − ˆxt| xt 11
  • 15. Methods for prediction of time series Time series analysis ARIMA, Holt-Winters exponential smoothing, decomposition TS... Linear regression Multiple linear regression, robust LR, GAM (Generalized Additive Model). Machine learning • Neural nets • Support Vector Regression • Regression trees and forests Ensemble learning Linear combination of predictions. 12
  • 17. Time series analysis methods ARIMA, exponential smoothing and co. • Not best for multiple seasonal time series. • Solution: Creation of separate models for different days in week. • Solution: Set higher period - 7 ∗ period. We need at least 3 periods in training set. • Solution: Double-seasonal Holt-Winters (dshw, tbats). • Solution: SARIMA. ARIMAX - ARIMA plus extra predictors by FFT (auto.arima(ts, xreg)). 14
  • 18. Fourier transform fourier in forecast load_msts <- msts(load, seasonal.periods = c(freq, freq * 7)) fourier(load_msts, K = c(6, 6*2)) 15
  • 19. Time series analysis • Solution: Decomposition of time series to three parts - seasonal, trend, remainder (noise). • Many different methods. STL decomposition (seasonal decomposition of time series by loess). • Moving Average (decompose). Exponential smoothing... 16
  • 20. Decomposition - STL stl, ggseas with period 48 17
  • 21. Decomposition - STL vol.2 stl, ggseas with period 48 ∗ 7 18
  • 22. Decomposition - EXP tbats in forecast 19
  • 23. Temporal Hierarchical Forecasting thief - http://robjhyndman.com/hyndsight/thief/ 20
  • 24. Regression methods • MLR (OLS), RLM 1 (M-estimate), SVR 2 • Dummy (binary) variables3 Load D1 D2 D3 D4 D5 D6 D7 D8 D9 … W1 W2 … 1: 0.402 1 0 0 0 0 0 0 0 0 1 0 2: 0.503 0 1 0 0 0 0 0 0 0 1 0 3: 0.338 0 0 1 0 0 0 0 0 0 1 0 4: 0.337 0 0 0 1 0 0 0 0 0 1 0 5: 0.340 0 0 0 0 1 0 0 0 0 1 0 6: 0.340 0 0 0 0 0 1 0 0 0 1 0 7: 0.340 0 0 0 0 0 0 1 0 0 1 0 8: 0.338 0 0 0 0 0 0 0 1 0 1 0 9: 0.339 0 0 0 0 0 0 0 0 1 1 0 ... ... ... ... 1 rlm - MASS 2 ksvm(..., eps, C) - kernlab 3 model.matrix(∼ 0 + as.factor(Day) + as.factor(Week)) 21
  • 25. Linear Regression Interactions! MLR and GAM 4. 4 https://petolau.github.io/ 22
  • 26. Trees and Forests • Dummy variables aren’t appropriate. • CART 5, Extremely Randomized Trees 6, Bagging 7. Daily and Weekly seasonal vector: Day = (1, 2, . . . , freq, 1, 2, . . . , freq, . . . ) Week = (1, 1, . . . , 1, 2, 2, . . . , 2, . . . , 7, 1, 1, . . . ), where freq is a period (48 resp. 96). 5 rpart 6 extraTrees 7 RWeka 23
  • 28. Forests and ctree • Ctree (party), Extreme Gradient Boosting (xgboost), Random Forest. Daily and weekly seasonal vector in form of sine and cosine: Day = (1, 2, . . . , freq, 1, 2, . . . , freq, . . . ) Week = (1, 1, . . . , 1, 2, 2, . . . , 2, . . . , 7, 1, 1, . . . ) sin(2π Day freq ) + 1 2 resp. cos(2π Day freq ) + 1 2 sin(2π Week 7 ) + 1 2 resp. cos(2π Week 7 ) + 1 2 , where freq is a period (48 resp. 96). • lag, seasonal part from decomposition 25
  • 29. Neural nets • Feedforward, recurrent, multilayer perceptron (MLP), deep. • Lagged load. Denoised (detrended) load. • Model for different days - similar day approach. • Sine and cosine • Fourier Figure 1: Day Figure 2: Week 26
  • 30. Ensemble learning • We can’t determine which model is best. Overfitting. • Wisdom of crowds - set of models. Model weighting ⇒ linear combination. • Adaptation of weights of prediction methods based on prediction error - median weighting. et j = median(| xt − ˆx t j |) wt+1 j = wt j median(et ) et j Training set – sliding window. 27
  • 31. Ensemble learning • Heterogeneous ensemble model. • Improvement of 8 − 12%. • Weighting of methods – optimization task. Genetic algorithm, PSO and others. • Improvement of 5, 5%. Possible overfit. 28
  • 32. Ensemble learning with clustering ctree + rpart trained on different data sets (sampling with replacement). ”Weak learners”. Pre-process - PCA → K-means. + unsupervised. 29
  • 34. R <- ”Cluster Analysis” 30
  • 36. Creation of more predictable groups of consumers Cluster Analysis! Unsupervised learning... 32
  • 37. Cluster analysis in smart grids Good for: 33
  • 38. Cluster analysis in smart grids Good for: • Creation of consumer profiles, which can help by recommendation to customers to save electricity. 33
  • 39. Cluster analysis in smart grids Good for: • Creation of consumer profiles, which can help by recommendation to customers to save electricity. • Detection of anomalies. 33
  • 40. Cluster analysis in smart grids Good for: • Creation of consumer profiles, which can help by recommendation to customers to save electricity. • Detection of anomalies. • Neater monitoring of the smart grid (as a whole). 33
  • 41. Cluster analysis in smart grids Good for: • Creation of consumer profiles, which can help by recommendation to customers to save electricity. • Detection of anomalies. • Neater monitoring of the smart grid (as a whole). • Emergency analysis. 33
  • 42. Cluster analysis in smart grids Good for: • Creation of consumer profiles, which can help by recommendation to customers to save electricity. • Detection of anomalies. • Neater monitoring of the smart grid (as a whole). • Emergency analysis. • Generation of more realistic synthetic data. 33
  • 43. Cluster analysis in smart grids Good for: • Creation of consumer profiles, which can help by recommendation to customers to save electricity. • Detection of anomalies. • Neater monitoring of the smart grid (as a whole). • Emergency analysis. • Generation of more realistic synthetic data. • Last but not least, improve prediction methods to produce more accurate predictions. 33
  • 45. Approach Aggregation with clustering 1. Set of time series of electricity consumption 34
  • 46. Approach Aggregation with clustering 1. Set of time series of electricity consumption 2. Normalization (z-score) 34
  • 47. Approach Aggregation with clustering 1. Set of time series of electricity consumption 2. Normalization (z-score) 3. Computation of representations of time series 34
  • 48. Approach Aggregation with clustering 1. Set of time series of electricity consumption 2. Normalization (z-score) 3. Computation of representations of time series 4. Determination of optimal number of clusters K 34
  • 49. Approach Aggregation with clustering 1. Set of time series of electricity consumption 2. Normalization (z-score) 3. Computation of representations of time series 4. Determination of optimal number of clusters K 5. Clustering of representations 34
  • 50. Approach Aggregation with clustering 1. Set of time series of electricity consumption 2. Normalization (z-score) 3. Computation of representations of time series 4. Determination of optimal number of clusters K 5. Clustering of representations 6. Summation of K time series by found clusters 34
  • 51. Approach Aggregation with clustering 1. Set of time series of electricity consumption 2. Normalization (z-score) 3. Computation of representations of time series 4. Determination of optimal number of clusters K 5. Clustering of representations 6. Summation of K time series by found clusters 7. Training of K forecast models and the following forecast 34
  • 52. Approach Aggregation with clustering 1. Set of time series of electricity consumption 2. Normalization (z-score) 3. Computation of representations of time series 4. Determination of optimal number of clusters K 5. Clustering of representations 6. Summation of K time series by found clusters 7. Training of K forecast models and the following forecast 8. Summation of forecasts and evaluation 34
  • 53. Pre-processing Normalization of time series - consumers. Best choice is to use z-score: x − µ σ Alternative: x max(x) 35
  • 55. Representations of time series Why time series representations? 1. Reduce memory load. 2. Accelerate subsequent machine learning algorithms. 3. Implicitly remove noise from the data. 4. Emphasize the essential characteristics of the data. 37
  • 56. Methods of representations of time series PAA - Piecewise Aggregate Approximation. Non data adaptive. n -> d. ˆX = (ˆx1, . . . ,ˆxd). ˆxi = d n (n/d)i ∑ j=(n/d)(i−1)+1 xj. Not only average. Median, standard deviation, maximum... DWT, DFT... 38
  • 57. Back to models - model based methods • Representations based on statistical model. • Extraction of regression coefficients ⇒ creation of daily profiles. • Creation of representation which is long as frequency of time series. yi = β1xi1 + β2xi2 + · · · + βfreqxifreq + βweek1xiweek1 + · · · + βweek7xiweek7 + εi, where i = 1, . . . , n New representation: ˆβ = (ˆβ1, . . . , ˆβfreq, ˆβweek1, . . . , ˆβweek7). Possible applicable methods: MLR, Robust Linear Model, L1-regression, Generalized Additive Model (GAM)... 39
  • 58. Model based methods • Triple Holt-Winters Exponential Smoothing. Last seasonal coefficients as representation. 1. Smoothing factors can be set automatically or manually. • Mean and median daily profile. 40
  • 61. Clipping - bit level representation Data dictated. ˆxt = { 1 if xt > µ 0 otherwise 43
  • 62. Clipping - RLE RLE - Run Length Encoding. Sliding window - one day. 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 0 1 0 1 0 1 0 1 0 1 0 1 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 6 1 12 7 2 1 1 10 2 6 20 7 2 2 1 1 5 1 7 1 1 4 1 15 3 13 1 2 1 3 1 2 1 1 3 1 5 1 6 1 6 4 5 1 3 3 3 1 2 1 2 44
  • 63. Clipping - final representation 1 7 1 10 6 7 2 1 1 1 1 3 1 1 1 1 1 1 1 4 1 3 1 1 6 12 2 1 2 20 2 1 5 7 0 4 15 13 2 3 2 1 3 5 6 6 5 3 Feature extraction: average, maximum, -average, -maximum … 45
  • 64. Figure 3: Slovakia - factories Figure 4: Ireland - residential
  • 65. Figure 5: Ireland - median daily profile
  • 66. K-means Euclidean distance. Centroids. K-means++ - careful seeding of centroids. K-medoids, DBSCAN... 48
  • 67. Finding optimal K Manually or automatically set number of clusters. By intern validity measure. Davies-Bouldin index... 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 123 90 139 477 11 468 263 184 94 136 765 224 195 55 81 82 160 90 2 49
  • 73. Results Static data set One clustering → prediction with sliding window. Irish. GAM, RLM, PAA - 5.23% → okolo 4.22% MAPE. Slovak. Clip10, LM, DWT - 4.15% → okolo 3.97% MAPE. STL + EXP, STL + ARIMA, SVR, Bagging. Comparison of 17 representations. Batch processing Batch clustering → prediction with sliding window (14 days). Irish. RLM, Median, GAM. Comparison of 10 prediction methods. Significant improvement in DSHW, STL + ARIMA, STL + ETS, SVR, RandomForest, Bagging and MLP. Best result: Bagging with GAM 3.68% against 3.82% MAPE. 55
  • 74. R <- Tips and Tricks 55
  • 75. Packages List of all packages by name + short description. (https://cran.r-project.org/web/packages/available_ packages_by_name.html) 56
  • 76. Task Views List of specific tasks (domains) - description of available packages and functions. (https://cran.r-project.org/web/views/) 57
  • 77. Matrix Calculus MRO 8 . The Intel Math Kernel Library (MKL). ˆβ = (XT X)−1 XT y 8 https://mran.microsoft.com/documents/rro/multithread/ 58
  • 78. Fast! SVD and PCA Alternative to prcomp: library(rARPACK) pc <- function(x, k) { # First center data xc <- scale(x, center = TRUE, scale = TRUE) # Partial SVD decomp <- svds(xc, k, nu = 0, nv = k) return(xc %*% decomp$v) } pc_mat <- pc(your_matrix, 2) 59
  • 79. data.table Nice syntax9: DT[i, j, by] R i j by SQL where select or update group by 9 https: //github.com/Rdatatable/data.table/wiki/Getting-started 60
  • 80. data.table Advantages • Memory efficient • Fast • Less code and functions Reading large amount of .csv: files <- list.files(pattern = "*.csv") DT <- rbindlist(lapply(files, function(x) cbind(fread(x), gsub(".csv", "", x)))) 61
  • 81. data.table Easy subsetting data.frame style: DF[DF$col1 > 4 & DF$col2 == "R", "col3"] data.table style: DT[col1 > 4 & col2 == "R", .(col3)] Expressions .N, .SD DT[, .N, by = .(INDUSTRY, SUB_INDUSTRY)] DT[, lapply(.SD, sumNA)] 62
  • 82. data.table Changing DT by reference10 :=, set(), setkey(), setnames() DF$R <- rep("love", Inf) DT[, R := rep("love", Inf)] for(i in from:to) set(DT, row, column, new value) 10 shallow vs deep copy 63
  • 83. Vectorization and apply library(parallel) lmCoef <- function(X, Y) { beta <- solve(t(X) %*% X) %*% t(X) %*% as.vector(Y) as.vector(beta) } cl <- makeCluster(detectCores()) clusterExport(cl, varlist = c("lmCoef", "modelMatrix", "loadMatrix"), envir = environment()) lm_mat <- parSapply(cl, 1:nrow(loadMatrix), function(i) lmCoef(modelMatrix, as.vector(loadMatrix[i,]))) stopCluster(cl) 64
  • 84. Spark + R sparklyr - Perform SQL queries through the sparklyr dplyr interface. MLlib + extensions. sparklyr http://spark.rstudio.com/index.html SparkR https://spark.apache.org/docs/2.0.1/sparkr.html Comparison http://stackoverflow.com/questions/39494484/ sparkr-vs-sparklyr HPC https://cran.r-project.org/web/views/ HighPerformanceComputing.html 65
  • 85. Graphics ggplot2, googleVis, ggVis, dygraphs, plotly, animation, shiny... 66
  • 86. Summary • More accurate predictions are important. • Clustering of electricity consumers have lot of applications. • Prediction methods - different methods needs different approaches. • Tips for more effective computing (programming) in R. Blog: https://petolau.github.io/ Publications: https://www.researchgate.net/profile/Peter_Laurinec Contact: laurinec.peter@gmail.com 67