R at Microsoft

• Introduction to R
• Applications of R at Microsoft
• R Products at Microsoft
• What’s coming for R at Microsoft
• Q&A

April 6, 2015
“This acquisition will help customers use advanced analytics within Microsoft data platforms.“

• Most widely used data analysis software
• Most powerful statistical programming language
• Create beautiful and unique data visualizations
• Thriving open-source community
• Fills the talent gap
www.revolutionanalytics.com/what-is-r

• 1993: Research project in Auckland, NZ
• 1995: Released as open-source software
• 1997: R core group formed
• 2000: R 1.0.0 released
• 2003: R Foundation formed in Austria
• 2004: First international user conference
• 2007: Revolution Analytics founded
• 2009: New York Times article on R
• 2013: Revolution R Open released
• 2015: Microsoft acquires Revolution
Analytics 7
Photo credit: Robert Gentleman

blog.revolutionanalytics.com/popularity
R Usage Growth
Rexer Data Miner Survey, 2007-2013
• Rexer Data Miner Survey • IEEE Spectrum, July 2014
#9: R
Language Popularity
IEEE Spectrum Top Programming Languages

New York Times, June 25 2009
(3 hours after Michael Jackson’s death)

What
happened?
Why did
it happen?
What will
happen?
How can we
make it happen?
Traditional BI Advanced Analytics

• System monitoring & alerting
• Capacity Planning

• TruSkill Matchmaking System
• Player Churn
• Game design
• In-game purchase optimization
• Fraud detection
• Player communities

• Enhanced Open Source R distribution
• Compatible with all R-related software
• Multi-threaded for performance
• Focus on reproducibility
• Open source (GPLv2 license)
• Available for Windows, Mac OS X, Ubuntu,
Red Hat and OpenSUSE
• Download from
mran.revolutionanalytics.com
15

• Built on latest R engine
• 100% compatible with
• Designed to work with RStudio
16

• Multithreaded library replaces
standard BLAS/LAPACK algorithms
• High-performance algorithms
• Sequential  Parallel
• No need to change any R code
• Included with RRO binary
distributions
17
More at Revolutions blog

Adapted from http://xkcd.com/234/
CC BY-NC 2.5

• Static CRAN mirror
• Daily CRAN snapshots
mran.revolutionanalytics.com/snapshot
• Easily write and share scripts synced to a specific snapshot
19
CRAN
RRDaily
snapshots
http://mran.revolutionanalytics.com/snapshot/
checkpoint
package
library(checkpoint)
checkpoint("2014-09-17")
CRAN mirror
http://cran.revolutionanalytics.com/
checkpoint
server
Midnight
UTC

• Easy to use: add 2 lines to the top of each script
• For the package author:
• For a script collaborator:
20

• Download
Revolution R Open
• Learn about R and
RRO
• Daily CRAN
snapshots
• Explore Packages
• Explore Task Views
21

• Toolkits for data scientists and numerical analysts to create custom
parallel and distributed algorithms
• Mainly useful for “embarrassingly parallel” problems, where
parallel components work with small amounts of data
• Big Data Predictive Analytics mostly not embarrassingly parallel
Details at projects.revolutionanalytics.com
24

is….
the only big data big analytics platform
based on open source R
the defacto statistical computing language for
modern analytics

 Naïve Bayes
 Data import – Delimited, Fixed, SAS, SPSS,
OBDC
 Variable creation & transformation
 Recode variables
 Factor variables
 Missing value handling
 Sort, Merge, Split
 Aggregate by category (means, sums)
 Min / Max, Mean, Median (approx.)
 Quantiles (approx.)
 Standard Deviation
 Variance
 Correlation
 Covariance
 Sum of Squares (cross product matrix for set
variables)
 Pairwise Cross tabs
 Risk Ratio & Odds Ratio
 Cross-Tabulation of Data (standard tables & long
form)
 Marginal Summaries of Cross Tabulations
 Chi Square Test
 Kendall Rank Correlation
 Fisher’s Exact Test
 Student’s t-Test
 Subsample (observations & variables)
 Random Sampling
Data Step Statistical Tests
Sampling
Descriptive Statistics
 Sum of Squares (cross product matrix for set
variables)
 Multiple Linear Regression
 Generalized Linear Models (GLM) exponential
family distributions: binomial, Gaussian, inverse
Gaussian, Poisson, Tweedie. Standard link
functions: cauchit, identity, log, logit, probit. User
defined distributions & link functions.
 Covariance & Correlation Matrices
 Logistic Regression
 Classification & Regression Trees
 Predictions/scoring for models
 Residuals for all models
Predictive Models  K-Means
 Decision Trees
 Decision Forests
 Gradient Boosted Decision Trees
Cluster Analysis
Classification
Simulation
Variable Selection
 Stepwise Regression
 Simulation (e.g. Monte Carlo)
 Parallel Random Number Generation
Combination
New in
v7.3
 PEMA-R API
 rxDataStep
 rxExec
Coming
in v7.4

• ETL
• Marketing channel data
• Behavioral variables
• Promotional data
• Overlay data
• Exploratory data analysis
• Time-to-event models
• GAM survival models
• Scoring for inference
• Scoring for prediction
• 5 billion scores per day
per retailer
CUSTOM DATA
FORMAT
CUSTOM VARIABLES
(PMML)

• Exposing the expertise of data scientists as APIs
• Bringing the utility of data science to applications
• Addressing the Data Science talent gap

Azure: Huge infrastructure scale
19 Regions ONLINE…huge datacenter capacity around the world…and we’re growing
 100+ datacenters
 One of the top 3 networks in the world (coverage, speed, connections)
 2 x AWS and 6x Google number of offered regions
 G Series – Largest VM available in the market – 32 cores, 448GB Ram, SSD…
Operational Announced
Central US
Iowa
West US
California
North Europe
Ireland
East US
Virginia
East US 2
Virginia
US Gov
Virginia
North Central US
Illinois
US Gov
Iowa
South Central US
Texas
Brazil South
Sao Paulo
West Europe
Netherlands
China North *
Beijing
China South *
Shanghai
Japan East
Saitama
Japan West
OsakaIndia West
TBD
India East
TBD
East Asia
Hong Kong
SE Asia
Singapore
Australia West
Melbourne
Australia East
Sydney
* Operated by 21Vianet

http://blog.revolutionanalytics.com/2015/06/r-build-keynote.html/

WHAT’S
COMING FOR R
AT MICROSOFT

Data Scientist
Interact directly with data
Built-in to SQL Server
Data Developer/DBA
Manage data and
analytics together
SQL Server 2016
Built-in in-database analytics
Example Solutions
• Fraud detection
• Salesforecasting
• Warehouse efficiency
• Predictive maintenance
Relational Data
Analytic Library
T-SQL Interface
Extensibility
?
R
RIntegration
010010
100100
010101
Microsoft Azure
Machine Learning Marketplace
New R scripts
010010
100100
010101
010010
100100
010101
010010
100100
010101
010010
100100
010101
010010
100100
010101

rows
minutes
R on a
server
pulling data
via SQL
R on a server
Invoking RRE
ScaleR Inside
the EDW

Thank you
Download Revolution R Open:
mran.revolutionanalytics.com
More at:
blog.revolutionanalytics.com
David Smith
R Community Lead
Revolution Analytics
@revodavid
davidsmi@microsoft.com

46
More at deployr.revolutionanalytics.com

R at Microsoft

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (19)

Ähnlich wie R at Microsoft

Ähnlich wie R at Microsoft (20)

Mehr von Revolution Analytics

Mehr von Revolution Analytics (14)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

R at Microsoft

Hinweis der Redaktion