Microsoft R - ScaleR Overview

| © Copyright 2015 Hitachi Consulting1
Microsoft R
ScaleR Overview with a Quick Tutorial
Khalid M. Salama, Ph.D.
Business Insights & Analytics
Hitachi Consulting UK
We Make it Happen. Better.

Outline
 Experimental Data Science vs Operational Machine Learning
 Microsoft R Server
 Overview on ScaleR
 How to: Setup Environment
 How to: Get Data
 How to: Process & Transform
 How to: Summarize, Analyse, and Visualize
 How to: Learn & Predict
 How to: Deploy and Consume (msrdeploy)
 Overview on MicrosoftML package functionality

Experimental Data Science vs Operational
Machine Learning

Exploratory Data
Analysis
Data Science Activities
Experimentation vs. Operationalization
Collect Data
Blend
Visualize
Prepare
ML Experiment
Algorithm Selection
Parameter Tuning
Training & Testing
Model
Learning
Dataset
Report of Visuals &
Findings
Decision!
Data Analysis &
Experimentation
 Interactive
 Easy to perform
 Rich Visualizations

Online Apps
Automated ML Pipeline
Data Science Activities
Experimentation vs. Operationalization
Model
Data Ingestion Data Processing Model Training Scoring
Deploy
Web APIs
Predict
Train
Export
Batch
Real-time
Operational ML Pipelines
 Pipelined (ETL Integration)
 Scalable
 Apps Integration

Microsoft R Server

Microsoft R Server
R in Microsoft World
Microsoft R Open (MRO)
 Based on latest Open Source R (3.2.2.) - Built, tested, and distributed by Microsoft
 More efficient and multi-threaded computation
 Enhanced by Intel Math Kernel Library (MKL) to speed up linear algebra functions
 Compatible with all R-related software

Microsoft R Server
Comparison
CRAN MRO MRS
Data size In-memory In-memory In-memory & disk
Efficiency Single threaded Multi-threaded Multi-threaded, parallel
processing 1:N servers
Support Community Community Community + Commercial
Functionality 7500+ innovative analytic
packages
7500+ innovative analytic
packages
7500+ innovative packages +
commercial parallel high-speed
functions
Licence Open Source Open Source Commercial license.

Microsoft R Server
Components & Compute Contexts
Microsoft R Server
CRAN&MSROpen
ScaleR
DistributedR
ConnectR
MicrosoftML-Package
Operationalization
(msrdeploy)
RStudio | RTVS
MS R Client
Scale & Deploy
DifferentComputeContexts
 Installed on Windows or Linux
 ScaleR - Optimized for parallel execution on
Big Data, to eliminate memory limitations.
 ConnectR – Provides access to local file
systems, hdfs, hive, sqlserver, Teradata, etc.
 DistributeR - Adaptable parallel execution
framework to enable running on different
(distributed) compute contexts.
 Operationalization (msrdeploy) – Deploy
the model as a Web API.
https://msdn.microsoft.com/en-us/microsoft-r/microsoft-r-getting-started

Import
Data
1- Reference to a Data
Source
 RxTextData()
 RxSqlServerData()
 RxOdbcData()
 RxTeradata()
2- Import Data to XDF
 rxImport()
 RxSasData()
 RxSpssData()
 RxHiveData()
 RxParquetData()
3- Reference XDF
 RxXdfData()
Setup
1- Get Information
 Revo.home()
 Revo.version
 rxGetComputeContex()
 rxGetFileSystem()
 rxOptions()
2- Set Properties
 rxSetComputeContex()
 RxLocalSeq
 RxLocalParallel
 RxInSqlServer
 rxSetFileSystem()
 RxNativeFileSystem
 RxHdfsFileSystem
 rxSetOption()
 RxInTeradata
 RxHadoopMR
 RxSpark
Process
&
Transfor
m
rxDataStep()
 inData (ref to data source)
 outFile (xdf)
 overwrite (the outFile if exists)
 varToKeep (column selection)
 rowSelection (filter)
 transformObjects (need in your process)
 transformPackages (need in your process)
 transformFunc (function with your processing logic)
rxMerge()
 inData1
 inData2
 outFile
 matchVars
 matchType
Others
 rxSplit()
 rxSort()
 rxFactors()
Summariz
e
 rxSummary()
 rxQuantile()
 rxCrossTabs()
 rxCube()
(formula,data)
 rxMarginals()
 as.xtabs()
(crossTabs)
Learn &
Predict
Classification
 rxDTrees()
 rxBTrees()
 rxDForest()
 rxNaiveBayes()
 rxLogit()
(formula, data)
Analyze
 rxCovCor()
 rxCor()
 rxSSCP()
(formula, data)
Predict
 rxPredict(model, data)
 rxRoc()
 rxHistogram()
 rxLinePlot()
 rxRocCurve()
Regression
 rxLinMod()
 rxGlm()
 rxDTrees()
 rxBTrees()
(formula, data)
Clustering
 rxKMeans()
(formula, data)
Analyse
Visualiz
e
Microsoft R
ScaleR Summary Map
Deploy
4 View Data
Information
 rxGetInfo()
 rxChisquaredTest()
 rxFisherTest()
 rxKendallCor()
 rxRiskRatio()
 rxOddsRatio()
(xtab)
msrdeploy
 remoteLogin
 listServices()
 getService()
 publishService()
 api$conumse()

Microsoft R – ScaleR
Get Information
Revo.version – query the version of the current ScaleR
Revo.home() – get the path of the currently used R.
Make sure it is Microsoft R (Client or Server),
not Open-Source R
rxGetComputeContext() – get the current compute context.
You can set the current compute context to many different
options, as shown next.
rxGetFileSystem() – get the default file system used.
You can change the currently used file system from “native” to a
“hdfs”, as shown next.
rxOptions() – list all the ScaleR configurations, and their
current values. You can get the value of a specific option
using rxGetOption(“optionName”)

Set Information
rxSetComputContext(computeContext) – the following
are the various options, each is an computeContext
object (each need different parameters to construct):
 RxLocalSeq()
 RxLocalParallel()
 RxInSqlServer()
rxSetFileSystem(fileSystem) – the filesystem object can
one of the two following options:
 RxNativeFileSystem()
 RxHdfsFileSystem()
rxSetOption(option = value) – used to set an option.
Note that, these are the global default values, you can overwrite
these values in each operation. The default values (that you set
here) are used if nothing is specified in the operations
 RxInTeradata()
 RxHadoopMR()
 RxSpark()

Get Data
1. Reference a Data Source – The following are the functions to use to reference
various data sources
 RxTextData()
 RxOdbcData()
 RxSqlServerData()
 RxTeraData()
2. Import the data to an eXternal Data Frame (xdf) - Not that, you can query the data in
the data source, but you need to import it to xdf to be able to process it in your computeContext.
rxImport( inData = dataSource, outFile = xdfFile.xdf )
 overwrite = Boolean flag to replace an existing xdf file or not
 append = use “rows” to append to the same .xdf file
3. Read the imported xdf data
RxXdfData( file = xdfFile.xdf )
 createCompositeSet = set to TRUE if you point to a directory that contains multiple .xdf files to treat them
as one dataset.
 RxSasData()
 RxSpssData()
 RxHiveData()
 RxParquetData()

Reference a Data Source
file_path = file.path(data_directory,”iris.csv”)
txtDataSource = rxTextData(file = file_path)
OR
connection_string = “Driver=SQL Server; Server=.; Database=dbdemo; Trusted_Connection = True;”
sql_query = “SELECT * FROM iris;”
sqlDataSource = rxSqlServerData(connectionString = connection_string, sqlQuery = sql_query)
Note, this is only reference to the data source,
which will not make anything with the data
until you query it, e.g. head(dataSource)

Import to xdf
xdf_file_path = file_path = file.path(data_directory,”iris.xdf”)
iris_xdata = rxImport( inData = dataSource, outFile = xdf_file_path
overwrite = TRUE, append = “none” )
 inData = any “Rx” Data Source, or it can be a file path
 outFile = file to store the .xdf dataset
 overwrite = Boolean flag to replace an existing xdf file or not
 append = use “rows” to append to the same .xdf file
This will create iris.xdf file in your fileSystem, and return iris_xdata reference to
work with the dataset.
You can read the .xdf file later:
iris_xdata = RxXdfData( file = xdf_file_path)
class(iris_xdata)

Describing xdf
rxGetInfo( data = iris_xdata, getVarInfo = TRUE, numRows = 2)
rxSummary(formula = ~., data = xdata)

Read a subset of xdf to a data frame
iris_subset = rxReadXdf(data = iris.xdata, startRow = 10, numRows = 5)
 iris_subset = in-memory data frame
 data = Rx Data Source
 numRows = number of rows to retrieve
Sometimes it is useful to get a (small) subset of the xdf to a data frame
to test a processing function on it before we apply it on the big data (xdf)

Process & Transform
Remember that you compute context can be a distributed
processing cluster: Hpc, spark, Hadoop, etc.
In such case, each node of the compute cluster processes a
subset of your xdf, as it is shredded also on a HDFS
You data processing operation needs to consider that, i.e., all
the needed objects and packages are available for the local
node to process this data portion
rxDataSetp() function is used to process and transform an xdf
dataset, and can be used to perform the following
 Filter rows
 Select columns
 Add computed columns
 Convert column types (e.g. discetize to factors)
 Update existing columns (handling missing values, scale &
normalize, etc.)
rxDataStep(…)
 inData = xdf to process
 outFile = can be the same as the input xdf.
If omitted, the function return a data frame
 overwrite = set to TRUE if inData = outFile
 rowSelection = (col1 > 50) & …
 varToKeep = character vector of columns to select
 transformFunc = a function that has the processing logic
 transformObjects = list of objects used in the function
 transformPackages = list of packages used in the function

Process & Transform
Extract means and stdvs (will be used to normalize some columns)
rxsummary = rxSummary(~.,iris_xdata)
str(rxsummary$sDataFrame)
means = rxsummary$sDataFrame$Mean
stdvs = rxsummary$sDataFrame$StdDev
Extract quantiles for Sepal.Length (will be used to discretize it)
cut_points = rxQuantile(varName = "Sepal.Length", data = iris_xdata)
cut_points

Process & Transform
Create data processing function
process_data = function(data_frame){
# discretize
data_frame$Sepal.Length_Disc = cut(data_frame$Sepal.Length, breaks = cut_points)
# normalize
data_frame$Petal.Length_norm = (data_frame$Petal.Length - means[3])/stdvs[3]
data_frame$Petal.Width_norm = (data_frame$Petal.Width - means[4])/stdvs[4]
return(data_frame)
}
Note the following:
 The function expects a data frame, which will be a subset of the xdf dataset running on a compute node
 cut_points, means, and stdvs are variables that will be available to the scope of this function when passed
via the rxDataStep() function

Process & Transform
Execute the process_data function on the iris_xdata
rxDataStep(inData = iris_xdata, outFile = iris_xdata, overwrite = TRUE,
rowSelection = !is.na(Species),
transformFunc = process_data,
transformObjects = list(
"cut_points" = cut_points,
"means" = means,
"stdvs" = stdvs
)
)

Summarize & Analyse
Understand variable dependencies & correlations
formula = ~ Species+Sepal.Length + Sepal.Width + Petal.Length + Petal.Width
rxCovCor(formula, data = iris_xdata, type = "Cor")
 “Cor” = correlation
 “Cov” = covariane
 “SSCP” = sum squred / cross product

Summarize & Analyse
Summarize data (generate sums, means, and counts)
using cross tabs
formula = Sepal.Width ~ Sepal.Length_Disc:Species
ctabs = rxCrossTabs(formula, data = iris_xdata, means = TRUE)
ctabs$sums
ctabs$means
ctabs$counts

Summarize & Analyse
Summarize cross tab results
summary(ctabs, output = "means")
Get Margins
rxMarginals(ctabs, output = “sums”)
Perform Statistical Dependency test

Summarize & Analyse
Summarize using xCube (to produce a long-format table)
formula = Petal.Width ~ F(Petal.Length)
rxCube(formula, data = iris_xdata)
 F(variable) converts the variable into a factor,
on the fly, using the distinct rounded values
of this variable

Visualize
rxHistogram(~Sepal.Length|Species, data = iris_xdata)

Learn & Predict
Classification Algorithms
 rxDTrees() – Decision Trees for
classification and regression.
Can be converted to rpart tree models
 rxBTrees() – Gradient Boosted Trees
 rxDForest() – Random Forests
 rxNaiveBayes()
 rxLogit() – Logistic Regression Models
Regression Algorithms
 rxLinMod() – Linear
Regression Models
 rxGlm() Generalized Linear
Models
 rxDTrees()
 rxBTrees()
Clustering Algoritm
 rxKMeans()
All the algorithms accept the following parameters
 Formula: response ~ input1+input2:input3
 Data: learning set
 Other parameters depending on the algorithms

Learn & Predict – Decision Trees Example
rxDTrees() used to train classification (target variable is categorical)
& regression (target variable is numeric) trees.
The output is similar to rpart tree model. The key parameters are:
 formula: response ~ input1+input2:input3
 data: traing set
 xVal: number of cross validation folds for pruning
 maxDepth: maximum number of tree levels (to control complexity)
 minBucket: minimum number of examples must be in a leaf node
(to control complexity)
formula = Species ~ Sepal.Length + Sepal.Width +
Petal.Length + Petal.Width
models.dtree = rxDTree(formula, data = iris_xdata)
models.dtree

# get predictions, in form of probabilities
predictions = rxPredict(models.dtree, data = iris_xdata,
type = c("prob"))
# select only columns of actual and predicted (as data frame)
predictions = rxDataStep(predictions,
varsToKeep =c("Species",
"setosa_Pred",
"versicolor_Pred","virginica_Pred"),
transforms = list( setosa_actual = as.numeric(Species=='setosa'),
versicolor_actual = as.numeric(Species=='versicolor'),
virginica_actual = as.numeric(Species=='virginica')
)
)
# display the prediction results
rxGetInfo(predictions, getVarInfo = TRUE, numRows = 5)
# plot Roc Curve (with respect to versicolor predictions)
rxRocCurve(actualVarName = "versicolor_actual",
predVarNames = c("versicolor_Pred"),
data = predictions)

# compute accuracy
predictions = rxPredict(models.dtree, data = iris_xdata,
type = c("class"))
predictions = rxReadXdf( predictions ,
varsToKeep = c("Species","Species_Pred"))
accuracy = sum(as.numeric(predictions$Species ==
predictions$Species_Pred)/nrow(predictions))
print(accuracy)
#use Revo Tree View to show tree
tree = RevoTreeView::createTreeView(models.dtree)
plot(tree)
#convert to rpart tree model
rpart_tree= as.rpart(models.dtree)
class(rpart_tree)
#export to pmml format
library(pmml)
pmml(rpart_tree)

Parallel Processing on Partitioned Data
In some cases, instead of building one “Big” model using all your “Big” data,
you build “many” models using “small” subsets of the data
For example, building many time-series models, one for each product line, for
demand forecasting, or several regression models, one for each geographic area,
for fraud detection
This is also called mixture of local models
In this case, your data is partitioned into (smaller) subsets, by a certain criteria, and
then local models are built, one for each data subset
Such a process can be performed in parallel, using rxExecBy() function, which takes
the following parameters:
 inData = xdf dataset to be partitioned
 keys = character vector of the name of the dataset columns by which the data will
be partitioned. These columns should be of type factor
 func = the function that will be applied on each data partition
(i.e., learning a local model)
 rxExecBy() returns a list containing the constructed model of each partition
Dataset
Partition
Subset 1 Subset 2 Subset 3
Local
Model 1
Local
Model 2
Local
Model 3
Learn
Learn
Learn
}
Parallel Learning

Parallel Processing on Partitioned Data
For example, using the iris dataset, lets build a regression model that estimates Sepal.Length based on the
Sepal.Width, for each Species type.
In other words, we will partition the iris dataset into 3 subsets, one for each Species type (setosa, versicolor
virginica), and build a local model for each partition, in parallel
xdf = RxTextData(file = file.path(data_directory,"iris.csv"))
buildLocalModels = function(keys, data){
local_xdf = rxImport(inData = data)
local_model = rxLinMod(formula = Sepal.Length ~ Sepal.Width, data = data)
return(local_model)
}
local_models = rxExecBy(inData = xdf, keys = c("Species"),
func = buildLocalModels)
local_models[[1]]$result

Microsoft R – msrdeploy
Deploy & Consume
In order to deploy an R model as a web API, you need to configure an MS R
Server for operationalization, by running the R-Server-Admin-Util, as described in
this link: https://msdn.microsoft.com/en-us/microsoft-r/operationalize/about

Microsoft R – msrdeploy
Deploy & Consume
library(mrsdeploy)
# generate data
x = 1:100
y = 2*x + rnorm(n=length(x), mean = 0, sd = 5)
#buid a linear model
reg_model = lm(y~x)
# create a prediction function: takes input, and uses the lm to estimate the output
estimate_output = function(input){
newdata = as.data.frame(x = input)
names(newdata) = c("x")
estimates = predict(reg_model, newdata = newdata, type = "response")
return(estimates)
}
# connect to R Server to deploy into
remoteLogin("http://localhost:12800", username = "admin", password = <password>)
serviceName <- paste("estimate_output_", round(as.numeric(Sys.time()), 0))
# publish the prediction function
api = publishService( serviceName, code = estimate_output,
model = reg_model, # model to be used in the function
inputs = list(input = "numeric"),
outputs = list(output = "numeric"),
v = "v1.0.0")
# query the published API
api
# list the deployed APIs
mrsdeploy::listServices()
# consume the API
result = api$estimate_output(120)
result$output("output")

Microsoft R – MicrosoftML
MicrosoftML Overview
Machine Learning Algorithms
 rxFastLinear() – binary classification & Regression
 rxOneClassSvm() – anomaly detection (unsupervised)
 rxFastTrees() – classification & regression
 rxFastForest() – classification & regression
 rxNeuralNetworks() – classification & regression
 rxLogisticRegression() - regression
rxEnsemble() – combine a number of models of various kinds
Text Processing
 featurizeText() – TF, IDF, TF-IDF
 getSentiment() – using pretrained model
Image Processing
 featurizeImage() – using a pretrained model
 loadImgae()
 resizeImage()
 extractPixels() - extracts the pixel values from an image
Other Processing
 selectFeatures() – using minCount or mutualInfo
 categorical() – converts a categorical variable to indicator columns
 categoricalHash() converts a categorical variable to indicator
columns using hashing (used with variable with many values)
https://msdn.microsoft.com/en-us/microsoft-r/microsoftml-get-started

My Background
Applying Computational Intelligence in Data Mining
 Honorary Research Fellow, School of Computing , University of Kent.
 Ph.D. Computer Science, University of Kent, Canterbury, UK.
 28+ published journal and conference papers in the fields of AI and ML
https://www.researchgate.net/profile/Khalid_Salama https://www.linkedin.com/in/khalid-salama-24403144/
https://github.com/khalid-m-salama/sqlbits-2017

Thanks!

Microsoft R - ScaleR Overview

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Microsoft R - ScaleR Overview

Ähnlich wie Microsoft R - ScaleR Overview (20)

Mehr von Khalid Salama

Mehr von Khalid Salama (8)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Microsoft R - ScaleR Overview

Hinweis der Redaktion