[1] Microsoft is developing Apache Hadoop-based services for Windows Azure and on-premises use called "Project Isotope" to allow businesses to leverage Hadoop across platforms. [2] These services include hosted elastic Hadoop on Azure, an on-premises Hadoop solution for Windows, and tools to integrate Hadoop with Microsoft products. [3] Microsoft aims to make Hadoop easy to use at scale on premises or in the cloud with full support and integration across the Microsoft ecosystem.
1. APACHE HADOOP
ON AZURE AND WINDOWS
MICROSOFT’S APACHE HADOOP-BASED SERVICES FOR AZURE AND ENTERPRISE
ELASTIC MAPREDUCE FOR AZURE AND ENTERPRISE PRIVATE CLOUDS
Brad Sarsfield
Engineering Architect
Microsoft Big Data | Haodoop
March 2012 | revision 1.02
2. ISOTOPE BRIDGES BI TO COLLABORATION TO CLOUD
“The next frontier is all about uniting the power of the cloud
with the power of data to gain insights that simply weren’t
possible even just a few years ago”
Ted Kummert, CVP Business Platforms
SQL PASS, October 2011
4. 15 out of 17
sectors in the US have more data
stored per company than the
US Library of Congress
140,000-190,000
more deep analytical talent positions
1.5 million 50-60%
more data savvy managers
increase in the number of Hadoop developers
in the US alone within organizations already using Hadoop
within a year
€250 billion
Potential annual value to
Europe’s public sector
$300 billion
Potential annual value to US healthcare
ECONOMIC CONTEXT AND EXEMPLAR
Special Report: The CEO’s Guide to Hadoop
Learn how large corporations are coping with the increasing flow of
unstructured data by using a free software program called Hadoop
http://www.businessweek.com/technology/special-reports/ceo-guide-to-hadoop.html
5. THE 4Vs OF BIG DATA: VOLUME, VELOCITY, VARIABILITY, AND VARIETY
Isotope is designed to enable solution building with all key dimensions in mind
Deep integration and coordination with existing Microsoft enterprise, cloud, and BI tools
6. Cassandra Hadoop BackType MR/GFS SimpleDB
Hive Oozie Hadoop Bigtable Dynamo
Scribe PigLatin Pig HBase Dremel EC2/EMR/S3
Hadoop … Cassandra … …
Internal [ Dryad | Cosmos] and External [ Isotope | Azure | Excel | BI | SQL DW | LTH ]
VIBRANT ECOSYSTEM IN ENTERPRISE AND CLOUD WITH MICROSOFT
Scalable machine learning and data mining [Mahout]
Statistical modeling and analysis [R]
Coordination and workflow [Oozie, Cascading]
Data integration and transformation [SQOOP, Flume]
Social network analytics and petascale graph learning [Pegasus]
Real-time stream analytics and business intelligence merged with petascale computation[HStreamming]
Scale-out caching and storage [Cassandra, HBase, Riak, Redis, Couchbase, S3]
Cloud-oriented data warehousing, pattern discovery, and transformation [Hive, Pig]
7. ENTER ISOTOPE
Isotope is the internal codename for Microsoft’s suite of products to support Hadoop in Windows and Azure
8. Un- and Semi-Structured
Sensors
Crawlers
SQL REPORTING
Devices Interactive Reports
with Crescent
Bots
Apps
Business
HADOOP SQL ANALYSIS
Users
Excel with
PowerPivot
EIS
ERP SQL DATA
WAREHOUSING
CRM
LOB
Embedded BI Apps
Structured
OUR DIFFERENTIATORS FOR CLOUD AND ENTERPRISE
Self-service business intelligence at any scale on premise or cloud
Complete integration of information assets from log files to collaboration artifacts to enterprise data stores
Familiar and integrated tools for analytics, insight, exploration, modeling, and strategic decision making
Transparent, federated identity and security management for all big data services
High availability data protection and recovery services for enterprises through cloud
Enterprise-grade support for all service, frameworks, and tools
9. HADOOP
[Azure and Enterprise]
Java OM Streaming OM HiveQL PigLatin .NET/C#/F# (T)SQL
OCEAN OF DATA
NOSQL [unstructured, semi-structured, structured] ETL
HDFS
A SEAMLESS OCEAN OF INFORMATION PROCESSING AND ANALYTICS
EIS / ERP RDBMS File System OData [RSS] Azure Storage
10. PROJECT ISOTOPE OFFERINGS
• Bi-directional connectors between Hadoop and SQL and PDW
• ODBC driver for Hadoop
• Hive plug-in for Excel
• Hosted elastic Hadoop service on Azure
• Microsoft’s Apache Hadoop-based solution for Windows Azure
• Microsoft’s Apache Hadoop-based solution for Windows Server
• JavaScript support for Hadoop, with web-based interactive environment
• Contributions back to the open source community via the Apache Foundation
11. HIVE PLUG-IN FOR EXCEL
• Connect Excel directly to Hive
• Browse Hive objects – tables, columns, etc.
• Construct and issue queries
12. HOSTED ELASTIC HADOOP SERVICE ON AZURE
• Elastic MapReduce, Hive, PigLatin, .Net, Javascript, and integration with BI, DW, and Office Collaboration tools
• Simple management UI
• Full Hadoop compatibility
• Native support for Azure Blob Storage from HDFS
28. MICROSOFT’S APACHE HADOOP-BASED SOLUTION FOR WINDOWS
• All standard Hadoop modules supported:
Hadoop | HDFS | Pig | Hive | Monitoring Pages
• One-click installer
• Simplified cluster configuration
• Integration with Microsoft ecosystem
System Center | Active Directory | etc.
29. // Map Reduce function in JavaScript
// -------------------------------------------------------
-----------
var map = function (key, value, context) {
var words = value.split(/[^a-zA-Z]/);
for (var i = 0; i < words.length; i++) {
if (words[i] !== "") {
context.write(words[i].toLowerCase(), 1);
}
}
};
var reduce = function (key, values, context) {
var sum = 0;
while (values.hasNext()) {
sum += parseInt(values.next());
}
context.write(key, sum);
};
ISOTOPE.JS: OUR VB MOMENT FOR BIG DATA
• Write MapReduce jobs in JavaScript
• Interactive development environment
• Interactive data query and analytics of petascale datasets
• HIVE command line for interactive HIVE
• Charting and graphing for insight and analytics visualization
30. “We are excited to work with Microsoft to help make Apache
Hadoop a compelling platform for storing and processing data.
Hortonworks welcomes Microsoft to the Hadoop ecosystem
and looks forward to lending our deep domain expertise to
help accelerate the delivery of Microsoft’s Apache Hadoop-
based solution for Windows Server and service for Windows
Azure.”
Eric Baldeschwieler
CEO
GIVING BACK AND PARTICIPATING IN THE HADOOP COMMUNITY
Microsoft will be working with the community to contribute back significant code to the Apache Foundation
Microsoft has announced a partnership with Hortonworks to help accelerate our open source support
31. APACHE HADOOP
ON AZURE AND WINDOWS
MICROSOFT’S APACHE HADOOP-BASED SERVICES FOR AZURE AND ENTERPRISE
SUMMARY
Please visit HadoopOnAzure.com to start using Microsoft’s elastic services for Apache Hadoop
Please visit www.microsoft.com/bigdata to learn more about project codename “Isotope” and the broader ecosystem of
products and services Microsoft is delivering in 2012 an beyond
Hinweis der Redaktion
Key Message: Big Data is a real problem, and Hadoop’s star is rising. It is economically transformative in the way LAMP was in the previous decade. (Linux, Apache, MySQL, Php/Python)Reference numbers from McKinsey Global Institute – Big Data: The next frontier for innovation competition (http://www.mckinsey.com/mgi/publications/big_data/index.asp)http://www.karmasphere.com/images/documents/Karmasphere-HadoopDeveloperResearch.pdfHadoop is moving into mainstream consciousness now. Businessweek recently had a special report dedicated to Hadoop, with half a dozen articles.http://www.businessweek.com/technology/special-reports/ceo-guide-to-hadoop.html
http://nosql.mypopescu.com/post/9621746531/a-definition-of-big-dataKEY POINT: Hadoop is part of the solution -
Hadoop is an AND, not an OR. But it requires a certain philosophy that MSFT has not historically embraced. A key benefit of Hadoop is the large, vibrant open source community around it. To succeed, Microsoft needs to not only acknowledge but thrive in this community.
BIG self service BIBillions+ of data itemsUnstructured, semi-structured, log dataReal-time feedsNew analysis types leveraging large server clusters Leverage the Hadoop ecosystem and ride its momentumIW centric designGive business users direct access to the Big Data storeDeliver IW-centric experiences optimized for unstructured and semi-structured queriesCreate, enrich, visualize and share big data sets through fun and immersive experiencesDo it all in the tool they already use - ExcelIncrease the number of questions, reduce the cost of exploratory mining to zeroLeverage new class of analytics and visualizationEnable new types of questions with new types of data and visualizationsLeverage analysis of text, sentiment, clickstream, time windows, classification, clusteringVisualize big data in impactful ways: tag clouds, graphs, timelines, tree maps, etc. Natural extension of our BI platformMaintain a consistent semantic model, consistent expression languageProvide an iterative, experimental, business-driven workflow from the desktop to the Big Data clusterBuild on existing IW skills with the Microsoft BI platform (Excel, PowerPivot, Crescent)Optimized for cloudIntegrate with Azure DataMarket to connect to Bing and other public data sourcesHost big data sets on Azure , integrated with MyDataLeverage Isotope to run analytics clusters
Isotope is the all-up effort around Microsoft and Hadoop. It includes several components:A full distribution of Apache Hadoop that runs on standard windows hardware.A full version of Apache Hadoop that runs on the Azure cloudConnectors from Hadoop (any Hadoop, not just Microsoft’s) to Microsoft’s key products – SQL, Excel, PDW, etc.Jscript shell for live scripting of Hadoop from the browserAdmin, monitoring, and authoring tools to make Microsoft Hadoop best-in-class