5. • Most widely used data analysis software
• Most powerful statistical programming language
• Create beautiful and unique data visualizations
• Thriving open-source community
• Fills the talent gap
www.revolutionanalytics.com/what-is-r
6.
7. • 1993: Research project in Auckland, NZ
• 1995: Released as open-source software
• 1997: R core group formed
• 2000: R 1.0.0 released
• 2003: R Foundation formed in Austria
• 2004: First international user conference
• 2007: Revolution Analytics founded
• 2009: New York Times article on R
• 2013: Revolution R Open released
• 2015: Microsoft acquires Revolution
Analytics 7
Photo credit: Robert Gentleman
15. • Enhanced Open Source R distribution
• Compatible with all R-related software
• Multi-threaded for performance
• Focus on reproducibility
• Open source (GPLv2 license)
• Available for Windows, Mac OS X, Ubuntu,
Red Hat and OpenSUSE
• Download from
mran.revolutionanalytics.com
15
16. • Built on latest R engine
• 100% compatible with
• Designed to work with RStudio
16
17. • Multithreaded library replaces
standard BLAS/LAPACK algorithms
• High-performance algorithms
• Sequential Parallel
• No need to change any R code
• Included with RRO binary
distributions
17
More at Revolutions blog
24. • Toolkits for data scientists and numerical analysts to create custom
parallel and distributed algorithms
• Mainly useful for “embarrassingly parallel” problems, where
parallel components work with small amounts of data
• Big Data Predictive Analytics mostly not embarrassingly parallel
Details at projects.revolutionanalytics.com
24
25. is….
the only big data big analytics platform
based on open source R
the defacto statistical computing language for
modern analytics
26.
27. Naïve Bayes
Data import – Delimited, Fixed, SAS, SPSS,
OBDC
Variable creation & transformation
Recode variables
Factor variables
Missing value handling
Sort, Merge, Split
Aggregate by category (means, sums)
Min / Max, Mean, Median (approx.)
Quantiles (approx.)
Standard Deviation
Variance
Correlation
Covariance
Sum of Squares (cross product matrix for set
variables)
Pairwise Cross tabs
Risk Ratio & Odds Ratio
Cross-Tabulation of Data (standard tables & long
form)
Marginal Summaries of Cross Tabulations
Chi Square Test
Kendall Rank Correlation
Fisher’s Exact Test
Student’s t-Test
Subsample (observations & variables)
Random Sampling
Data Step Statistical Tests
Sampling
Descriptive Statistics
Sum of Squares (cross product matrix for set
variables)
Multiple Linear Regression
Generalized Linear Models (GLM) exponential
family distributions: binomial, Gaussian, inverse
Gaussian, Poisson, Tweedie. Standard link
functions: cauchit, identity, log, logit, probit. User
defined distributions & link functions.
Covariance & Correlation Matrices
Logistic Regression
Classification & Regression Trees
Predictions/scoring for models
Residuals for all models
Predictive Models K-Means
Decision Trees
Decision Forests
Gradient Boosted Decision Trees
Cluster Analysis
Classification
Simulation
Variable Selection
Stepwise Regression
Simulation (e.g. Monte Carlo)
Parallel Random Number Generation
Combination
New in
v7.3
PEMA-R API
rxDataStep
rxExec
Coming
in v7.4
28.
29. • ETL
• Marketing channel data
• Behavioral variables
• Promotional data
• Overlay data
• Exploratory data analysis
• Time-to-event models
• GAM survival models
• Scoring for inference
• Scoring for prediction
• 5 billion scores per day
per retailer
CUSTOM DATA
FORMAT
CUSTOM VARIABLES
(PMML)
31. • Exposing the expertise of data scientists as APIs
• Bringing the utility of data science to applications
• Addressing the Data Science talent gap
32. Azure: Huge infrastructure scale
19 Regions ONLINE…huge datacenter capacity around the world…and we’re growing
100+ datacenters
One of the top 3 networks in the world (coverage, speed, connections)
2 x AWS and 6x Google number of offered regions
G Series – Largest VM available in the market – 32 cores, 448GB Ram, SSD…
Operational Announced
Central US
Iowa
West US
California
North Europe
Ireland
East US
Virginia
East US 2
Virginia
US Gov
Virginia
North Central US
Illinois
US Gov
Iowa
South Central US
Texas
Brazil South
Sao Paulo
West Europe
Netherlands
China North *
Beijing
China South *
Shanghai
Japan East
Saitama
Japan West
OsakaIndia West
TBD
India East
TBD
East Asia
Hong Kong
SE Asia
Singapore
Australia West
Melbourne
Australia East
Sydney
* Operated by 21Vianet
39. Data Scientist
Interact directly with data
Built-in to SQL Server
Data Developer/DBA
Manage data and
analytics together
SQL Server 2016
Built-in in-database analytics
Example Solutions
• Fraud detection
• Salesforecasting
• Warehouse efficiency
• Predictive maintenance
Relational Data
Analytic Library
T-SQL Interface
Extensibility
?
R
RIntegration
010010
100100
010101
Microsoft Azure
Machine Learning Marketplace
New R scripts
010010
100100
010101
010010
100100
010101
010010
100100
010101
010010
100100
010101
010010
100100
010101
42. Thank you
Download Revolution R Open:
mran.revolutionanalytics.com
More at:
blog.revolutionanalytics.com
David Smith
R Community Lead
Revolution Analytics
@revodavid
davidsmi@microsoft.com
Xbox: http://blog.revolutionanalytics.com/2014/05/microsoft-uses-r-for-xbox-matchmaking.html
Other gaming http://blog.revolutionanalytics.com/2013/06/how-big-data-and-statistical-modeling-are-changing-video-games.html
Infinite scale inexpensively
Tons of data from which you actually have to get value
Customers that have a very high expectation of service and connection – Pier 1 great example
Influx of new talent to fill a very big gap McKinsey says is 300 thousand in US alone
But the market this new talent is entering is still filled with barriers
Enterprise readiness
Performance architecture
Big Data analytics
Data source integration
Development tools
Deployment tools
Demographics: consumer, product, market
Actions: web clicks, email clicks, mobile app usage, call center logs, social, search …
Outcomes: impressions, touches, orders (retail, online, mobile)
Strategic allocation
Outcome is “buying” instead of “dying”
Over the last few years we’ve truly delivered a huge infrastructure to enable us to grow our services at scale around the globe. Whether it’s our flagship facilities in Quincy, Washington or Boydton, Virginia, or some of the newly announced facilities in Shanghai, Australia and Brazil, it really is key for us to make smart investments around the world to deliver services in a resilient and reliable fashion.
A lot of people ask, what goes into site selection at Microsoft and how do we decide where to place our datacenter investments? There are over thirty-five factors in our site selection criteria. But really, the top elements are around proximity to customers and energy and fiber infrastructure, insuring that we have the capacity and the growth platforms to be able to grow our services.
Another key element is about skilled workforce. We need to insure that we have the right people to run and operate our datacenters on a day to day basis.
Work done in conjunction with major Teradata user and household name in silicon valley.
Chart shows results of moving R algorithm execution inside Teradata EDW – achieving combined benefits from scaling computation and slashing data movement.