Microsoft R enable enterprise-wide, scalable experimental data science and operational machine learning, by providing a collection of servers and tools that extend the capabilities of open-source R In these slides, we give a quick introduction to Microsoft R Server architecture, and a comprehensive overview of ScaleR, the core libraries to Microsoft R, that enables parallel execution and use external data frames (xdfs). A tutorial-like presentation covering how to: 1) setup the environments, 2) read data, 3) process & transform, 4) analyse, summarize, visualize, 5) learn & predict, and finally 6) deploy and consume (using msrdeploy).
Hello everyone and welcome to the last day of Sqlbits…
My name is Khalid Salama. I work at Hitachi Consulting, in this Business Insights & Analytics practice, focusing on designing and delivering Data & Analytics Solutions
I n this session, I would like to explore with you the various Microsoft technologies that can help to operationalize your Machine Learning pipelines and enable
scalable data science.
Well, it’s more of an engineering session than a data science one to be fair, however, I think it is an important topic to discuss because,
data science is perceived as experimental, isolated activity…
While in many contemporary applications, specially with the rise of digital transformation and IoT, your data science products need to be incorporated with your operational systems,
and you ML pipelines need to be an integral part of your ETL process.
So, we will try to touch on various the Microsoft options to perform both experimental data science and operational ML.
So without over due, we have a lot of ground to cover…
I’ll start with a very quick intro to data science, I assume everybody here has “a” background on data science
Then, I give some insights on the difference between exploratory data science and Operational ML
After that, we are going to delve into the MS technologies for Advanced Analytics and show several demos….
And finally, I will conclude with some general remarks.
Now let’s take a look onto the activities of any a science process,
to try to discriminate between experimental data science and operational machine learning
It starts with an exploratory data analysis phase…
After being presented with an analytics problem, you start with collecting the relevant data and importing it to your environment…
Then you blend this data by performing some generic data engineering tasks, such as merging, joining, aggerating, and so on….
After that, you apply some machine learning-specific data preparation tasks, also know as features engineering, including features construction, extraction, selection, and feature tuning, like scaling, handling missing values & outliers, and so on.
The output of this phase is a learning dataset, that will be used in your ML experimentation phase.
In this phase, you perform iterative steps training & testing to select the algorithm & parameters that
produce the model that best captures the hidden patterns in your data…
The final output of this whole experimentation phase is a report of findings, along with comprehensive visuals. That can be in the form of a markdown file, using jupyter notebooks, that tills the end-to-end data analysis story and support reproducibility.
These results may lead to a specific decision or recommendation.
In some scenarios, these results are the ultimate output of the data science activity
However, in many other scenarios, where you need repeated and real-time intelligence, such as targeted advertising and recommender systems, you need to productionize the models produced from the previous data science process, and integrate them with your operational systems to perform online predictions and recommendations
In which case, the whole ML pipeline, including data ingestion, processing, model training and/or scoring, needs to be a repeatable, automated process
The process should produce a model that exposes Web API to be integrated with your operational apps and consumed real-time
Microsoft R Server
Probably the most important analytics product for Microsoft at the moment….
If you are an R developer, you will probably know that open-source R has scalability limitations, because it is single-threaded and in-memory only…
You needed to use commercial R libraries to make your program multi-threaded, process your data partly in-memory and partly on-desk, so that you can handle data sizes bigger than your workstation’s memory, and run your R app on a cluster for distributed computing and scaling your data processing…
Well, Microsoft has acquired a company that builds such libraries, called Revolution Analytics, and included their open-source libraries in MRO, and their commercial ones in MRS
Besides, MSR Open has enhanced Math Kernel Library, for more efficient mathematical computations
and it is compatible with all R-related software
So in this comparison, you can see that Microsoft R Server processes the data in-memory as well as on-desk (using external data frames or xds, which we will see shortly),
It is multi-threaded, and supports distributed computing to scale for big data processing
Let’s take a closer look to the main components of MS R Server
ScaleR – The core libraries in MS R, optimized for parallel execution and uses external data frames to overcome the memory limitation
ConnecR – provides access to various data sources including distributed file systems and relational databases
DistributeR – allows you R application to run in different execution context, including distributed one
So you can write you application once, and with a few lines of code, you can configure your application to run on different execution context in order to scale it
MS R Server Operationalization - allows you to deploy your R models, on a configured R Server, as Web APIs (similar to what we have seen in Azure ML) using msrdeploy libraries
Let’s have a look on a sample MS R code
So in this comparison, you can see that Microsoft R Server processes the data in-memory as well as on-desk (using external data frames or xds, which we will see shortly),
It is multi-threaded, and supports distributed computing to scale for big data processing
So in this comparison, you can see that Microsoft R Server processes the data in-memory as well as on-desk (using external data frames or xds, which we will see shortly),
It is multi-threaded, and supports distributed computing to scale for big data processing
So in this comparison, you can see that Microsoft R Server processes the data in-memory as well as on-desk (using external data frames or xds, which we will see shortly),
It is multi-threaded, and supports distributed computing to scale for big data processing
So in this comparison, you can see that Microsoft R Server processes the data in-memory as well as on-desk (using external data frames or xds, which we will see shortly),
It is multi-threaded, and supports distributed computing to scale for big data processing
So in this comparison, you can see that Microsoft R Server processes the data in-memory as well as on-desk (using external data frames or xds, which we will see shortly),
It is multi-threaded, and supports distributed computing to scale for big data processing
So in this comparison, you can see that Microsoft R Server processes the data in-memory as well as on-desk (using external data frames or xds, which we will see shortly),
It is multi-threaded, and supports distributed computing to scale for big data processing
So in this comparison, you can see that Microsoft R Server processes the data in-memory as well as on-desk (using external data frames or xds, which we will see shortly),
It is multi-threaded, and supports distributed computing to scale for big data processing
So in this comparison, you can see that Microsoft R Server processes the data in-memory as well as on-desk (using external data frames or xds, which we will see shortly),
It is multi-threaded, and supports distributed computing to scale for big data processing
So in this comparison, you can see that Microsoft R Server processes the data in-memory as well as on-desk (using external data frames or xds, which we will see shortly),
It is multi-threaded, and supports distributed computing to scale for big data processing
So in this comparison, you can see that Microsoft R Server processes the data in-memory as well as on-desk (using external data frames or xds, which we will see shortly),
It is multi-threaded, and supports distributed computing to scale for big data processing
So in this comparison, you can see that Microsoft R Server processes the data in-memory as well as on-desk (using external data frames or xds, which we will see shortly),
It is multi-threaded, and supports distributed computing to scale for big data processing
So in this comparison, you can see that Microsoft R Server processes the data in-memory as well as on-desk (using external data frames or xds, which we will see shortly),
It is multi-threaded, and supports distributed computing to scale for big data processing
So in this comparison, you can see that Microsoft R Server processes the data in-memory as well as on-desk (using external data frames or xds, which we will see shortly),
It is multi-threaded, and supports distributed computing to scale for big data processing
So in this comparison, you can see that Microsoft R Server processes the data in-memory as well as on-desk (using external data frames or xds, which we will see shortly),
It is multi-threaded, and supports distributed computing to scale for big data processing
So in this comparison, you can see that Microsoft R Server processes the data in-memory as well as on-desk (using external data frames or xds, which we will see shortly),
It is multi-threaded, and supports distributed computing to scale for big data processing
So in this comparison, you can see that Microsoft R Server processes the data in-memory as well as on-desk (using external data frames or xds, which we will see shortly),
It is multi-threaded, and supports distributed computing to scale for big data processing
So in this comparison, you can see that Microsoft R Server processes the data in-memory as well as on-desk (using external data frames or xds, which we will see shortly),
It is multi-threaded, and supports distributed computing to scale for big data processing
So in this comparison, you can see that Microsoft R Server processes the data in-memory as well as on-desk (using external data frames or xds, which we will see shortly),
It is multi-threaded, and supports distributed computing to scale for big data processing
So in this comparison, you can see that Microsoft R Server processes the data in-memory as well as on-desk (using external data frames or xds, which we will see shortly),
It is multi-threaded, and supports distributed computing to scale for big data processing
So in this comparison, you can see that Microsoft R Server processes the data in-memory as well as on-desk (using external data frames or xds, which we will see shortly),
It is multi-threaded, and supports distributed computing to scale for big data processing
So in this comparison, you can see that Microsoft R Server processes the data in-memory as well as on-desk (using external data frames or xds, which we will see shortly),
It is multi-threaded, and supports distributed computing to scale for big data processing
So in this comparison, you can see that Microsoft R Server processes the data in-memory as well as on-desk (using external data frames or xds, which we will see shortly),
It is multi-threaded, and supports distributed computing to scale for big data processing
So in this comparison, you can see that Microsoft R Server processes the data in-memory as well as on-desk (using external data frames or xds, which we will see shortly),
It is multi-threaded, and supports distributed computing to scale for big data processing
So in this comparison, you can see that Microsoft R Server processes the data in-memory as well as on-desk (using external data frames or xds, which we will see shortly),
It is multi-threaded, and supports distributed computing to scale for big data processing
So in this comparison, you can see that Microsoft R Server processes the data in-memory as well as on-desk (using external data frames or xds, which we will see shortly),
It is multi-threaded, and supports distributed computing to scale for big data processing