According to IDC, Windows Servers run more than 50% of the servers in the Enterprise Data Center. Hortonworks has worked closely with Microsoft to port Apache Hadoop to Windows to enable organizations to take advantage of this emerging Big Data technology. Join us in this informative webinar to hear about the new Hortonworks Data Platform for Windows.
In less than an hour, you’ll learn:
-Key capabilities available in Hortonworks Data Platform for Windows
-How HDP for Windows integrates with Microsoft tools
-Key workloads and use cases for driving Hadoop today
For the visual thinkers out there, let’s expand our mathematical model to show some concrete examples.ERP, SCM, CRM, and transactional Web applications are classic examples of systems processing Transactions. Highly structured data in these systems is typically stored in SQL databases.Interactions are about how people and things interact with each other or with your business. Web Logs, User Click Streams, Social Interactions & Feeds, and User-Generated Content are classic places to find Interaction data.Observational data tends to come from the “Internet of Things”. Sensors for heat, motion, pressure and RFID and GPS chips within such things as mobile devices, ATM machines, and even aircraft engines provide just some examples of “things” that output Observation data.Most folks would agree that video is “big” data. The analysis of what’s happening in that video (ie. What you, me, and others are doing in the video) may not be “big” but it is valuable and it does fit under our umbrella.Moreover, business data feeds and publicly available data sets are also “big data”.So we should not minimize our thinking to just data that flows through an organization.Ex. The mortgage-related data you may have COULD benefit from being blended with external data found in Zillow, for example.The government, for example, has the Open Data Initiative. Which means that more and more data is being made publicly available.One of the use cases I find interesting is the Predictive Policing use case where state/local law enforcement is using analytics appied to crime databases and other publicly available data to help predict where and when pockets of crime might be springing up. These proactive analytics efforts have yielded real reductions in crime!Anyhow, this is what Big Data means to me…hopefully it makes sense to you.
, an amount that exceeds previous forecasts by 5 ZBs, resulting in a 50-fold growth from the beginning of 2010
At its core, Hadoop is about HDFS and MapReduce, 2 projects that are really about distributed storage and data processing which are the underpinnings of Hadoop.In addition to Core Hadoop, we must identify and include the requisite “Platform Services” that are central to any piece of enterprise software. These include High Availability, Disaster Recovery, Security, etc, which enable use of the technology for a much broader (and mission critical) problem set.This is accomplished not by introducing new open source projects, but rather ensuring that these aspects are addressed within existing projects.HDFS: Self-healing, distributed file system for multi-structured data; breaks files into blocks & stores redundantly across clusterMapReduce: Framework for running large data processing jobs in parallel across many nodes & combining resultsYARN: New application management framework that enables Hadoop to go beyond MapReduce appsEnterprise-ready servicesHigh availability, disaster recovery, snapshots, security, …
In summary, by addressing these elements, we can provide an Enterprise Hadoop distribution which includes the:Core ServicesPlatform ServicesData ServicesOperational ServicesRequired by the Enterprise user.And all of this is done in 100% open source, and tested at scale by our team (together with our partner Yahoo) to bring Enterprise process to an open source approach. And finally this is the distribution that is endorsed by the ecosystem to ensure interoperability in your environment.
Not only is all of this backed by the architects, developers and operators of Hadoop, but it is also assisted by a world class support team. With backgrounds from IBM, Oracle, MySQL and more, the team enables 24X7 support together with very mature support processes to ensure high quality customer service and responsiveness
Additionally, we are a leading provider of Hadoop support through our Hortonworks University, with courses for both development and operations. If required, we can also provide expert consulting services from both ourselves or our System Integrator partners.And for anyone looking to get their hands on Hadoop, we have recently introduced the Hadoop Sandbox program which enables users to download a full instance of HDP together with guided tutorials covering both development and administration topics.
At Hortonworks today, our focus is very clear: we Develop, Distribute and Support a 100% open source distribution of Enterprise Apache Hadoop.We employ the core architects, builders and operators of Apache Hadoop and drive the innovation in the open source community.We distribute the only 100% open source Enterprise Hadoop distribution: the Hortonworks Data PlatformGiven our operational expertise of running some of the largest Hadoop infrastructure in the world at Yahoo, our team is uniquely positioned to support youOur approach is also uniquely endorsed by some of the biggest vendors in the IT marketYahoo is both and investor and a customer, and most importantly, a development partner. We partner to develop Hadoop, and no distribution of HDP is released without first being tested on Yahoo’s infrastructure and using the same regression suite that they have used for years as they grew to have the largest production cluster in the worldMicrosoft has partnered with Hortonworks to include HDP in both their off-premise offering on Azure but also their on-premise offering under the product name HDInsight. This also includes integration with both Visual Studio for application development but also with System Center for operational management of the infrastructureTeradata includes HDP in their products in order to provide the broadest possible range of options for their customers
Hdp on windows, hdp server on windows, hd on azureMscustomer that wants to leverage familiar windows tools system center, Work with it like in linux, bring your own scriptsWhat they will get and when they will get itIntegration with ms tooling Mscust gives them choice because the infrastructure bits underpinnings the sameSo get started todayDriver is isv app that is vertical in nature and need a choice to deploy on windows todayField positioning
At its core, Hadoop is about HDFS and MapReduce, 2 projects that are really about distributed storage and data processing which are the underpinnings of Hadoop.In addition to Core Hadoop, we must identify and include the requisite “Platform Services” that are central to any piece of enterprise software. These include High Availability, Disaster Recovery, Security, etc, which enable use of the technology for a much broader (and mission critical) problem set.This is accomplished not by introducing new open source projects, but rather ensuring that these aspects are addressed within existing projects.HDFS: Self-healing, distributed file system for multi-structured data; breaks files into blocks & stores redundantly across clusterMapReduce: Framework for running large data processing jobs in parallel across many nodes & combining resultsYARN: New application management framework that enables Hadoop to go beyond MapReduce appsEnterprise-ready servicesHigh availability, disaster recovery, snapshots, security, …
Beyond Core and Platform Services, we must add a set of Data Services that enable the full data lifecycle. This includes capabilities to:Store dataProcess dataAccess dataFor example: how do we maintain consistent metadata information required to determine how best to query data stored in HDFS? The answer: a project called Apache HCatalogOr how do we access data stored in Hadoop from SQL-oriented tools? The answer: with projects such as Hive, which is the defacto standard for accessing data stored in HDFS.All of these are broadly captured under the category of “data services”.Apache HCatalog: Metadata & Table ManagementMetadata service that enables users to access Hadoop data as a set of tables without needing to be concerned with where or how their data is storedEnables consistent data sharing and interoperability across data processing tools such as Pig, MapReduce and HiveEnables deep interoperability and data access with systems such as Teradata, SQL Server, etc.Apache Hive: SQL Interface for HadoopThe de-facto SQL-like interface for Hadoop that enables data summarization, ad-hoc query, and analysis of large datasetsConnects to Excel, Microstrategy, PowerPivot, Tableau and other leading BI tools via Hortonworks Hive ODBC DriverHive currently serves batch and non-interactive use cases; in 2013, Hortonworks is working with Hive community to extend use cases to interactive query. Cloudera, on the other hand, has chosen to abandon Hive in lieu of Cloudera Impala (a Cloudera controlled technology aimed at the analytics market and solely focused on non-operational interactive query use cases)Apache HBase: NoSQL DB for Interactive AppsNon-relational, columnar database that provides a way for developers to create, read, update, and delete data in Hadoop in a way that performs well for interactive applicationsCommonly used for serving “intelligent applications” that predict user behavior, detect shifting usage patterns, or recommend ways for users to engageWebHDFS: Web service interface for HDFSScalable REST API that enables easy and scalable access to HDFS Move files in & out and delete from HDFS; leverages parallelism of clusterPerform file and directory functionswebhdfs://<HOST>:<HTTP PORT>/PATHIncluded in versions 1.0 and 2.0 of Hadoop; created & driven by HortonworkersTalend Open Studio for Big Data: open source ETL tool available as an optional download with HDPIntuitive graphical data integration tools for HDFS, Hive, HBase, HCatalog and PigOozie scheduling allows you to manage and stage jobs Connectors for any database, business application or systemIntegrated HCatalog storage
Beyond Core and Platform Services, we must add a set of Data Services that enable the full data lifecycle. This includes capabilities to:Store dataProcess dataAccess dataFor example: how do we maintain consistent metadata information required to determine how best to query data stored in HDFS? The answer: a project called Apache HCatalogOr how do we access data stored in Hadoop from SQL-oriented tools? The answer: with projects such as Hive, which is the defacto standard for accessing data stored in HDFS.All of these are broadly captured under the category of “data services”.Apache HCatalog: Metadata & Table ManagementMetadata service that enables users to access Hadoop data as a set of tables without needing to be concerned with where or how their data is storedEnables consistent data sharing and interoperability across data processing tools such as Pig, MapReduce and HiveEnables deep interoperability and data access with systems such as Teradata, SQL Server, etc.Apache Hive: SQL Interface for HadoopThe de-facto SQL-like interface for Hadoop that enables data summarization, ad-hoc query, and analysis of large datasetsConnects to Excel, Microstrategy, PowerPivot, Tableau and other leading BI tools via Hortonworks Hive ODBC DriverHive currently serves batch and non-interactive use cases; in 2013, Hortonworks is working with Hive community to extend use cases to interactive query. Cloudera, on the other hand, has chosen to abandon Hive in lieu of Cloudera Impala (a Cloudera controlled technology aimed at the analytics market and solely focused on non-operational interactive query use cases)Apache HBase: NoSQL DB for Interactive AppsNon-relational, columnar database that provides a way for developers to create, read, update, and delete data in Hadoop in a way that performs well for interactive applicationsCommonly used for serving “intelligent applications” that predict user behavior, detect shifting usage patterns, or recommend ways for users to engageWebHDFS: Web service interface for HDFSScalable REST API that enables easy and scalable access to HDFS Move files in & out and delete from HDFS; leverages parallelism of clusterPerform file and directory functionswebhdfs://<HOST>:<HTTP PORT>/PATHIncluded in versions 1.0 and 2.0 of Hadoop; created & driven by HortonworkersTalend Open Studio for Big Data: open source ETL tool available as an optional download with HDPIntuitive graphical data integration tools for HDFS, Hive, HBase, HCatalog and PigOozie scheduling allows you to manage and stage jobs Connectors for any database, business application or systemIntegrated HCatalog storage
Beyond Core and Platform Services, we must add a set of Data Services that enable the full data lifecycle. This includes capabilities to:Store dataProcess dataAccess dataFor example: how do we maintain consistent metadata information required to determine how best to query data stored in HDFS? The answer: a project called Apache HCatalogOr how do we access data stored in Hadoop from SQL-oriented tools? The answer: with projects such as Hive, which is the defacto standard for accessing data stored in HDFS.All of these are broadly captured under the category of “data services”.Apache HCatalog: Metadata & Table ManagementMetadata service that enables users to access Hadoop data as a set of tables without needing to be concerned with where or how their data is storedEnables consistent data sharing and interoperability across data processing tools such as Pig, MapReduce and HiveEnables deep interoperability and data access with systems such as Teradata, SQL Server, etc.Apache Hive: SQL Interface for HadoopThe de-facto SQL-like interface for Hadoop that enables data summarization, ad-hoc query, and analysis of large datasetsConnects to Excel, Microstrategy, PowerPivot, Tableau and other leading BI tools via Hortonworks Hive ODBC DriverHive currently serves batch and non-interactive use cases; in 2013, Hortonworks is working with Hive community to extend use cases to interactive query. Cloudera, on the other hand, has chosen to abandon Hive in lieu of Cloudera Impala (a Cloudera controlled technology aimed at the analytics market and solely focused on non-operational interactive query use cases)Apache HBase: NoSQL DB for Interactive AppsNon-relational, columnar database that provides a way for developers to create, read, update, and delete data in Hadoop in a way that performs well for interactive applicationsCommonly used for serving “intelligent applications” that predict user behavior, detect shifting usage patterns, or recommend ways for users to engageWebHDFS: Web service interface for HDFSScalable REST API that enables easy and scalable access to HDFS Move files in & out and delete from HDFS; leverages parallelism of clusterPerform file and directory functionswebhdfs://<HOST>:<HTTP PORT>/PATHIncluded in versions 1.0 and 2.0 of Hadoop; created & driven by HortonworkersTalend Open Studio for Big Data: open source ETL tool available as an optional download with HDPIntuitive graphical data integration tools for HDFS, Hive, HBase, HCatalog and PigOozie scheduling allows you to manage and stage jobs Connectors for any database, business application or systemIntegrated HCatalog storage
Any data management platform that is operated at any reasonable scale requires a management technology – for example SQL Server Management Studio for SQL Server, or Oracle Enterprise Manager for Oracle DB, etc. Hadoop is no exception, and means Apache Ambari, which is increasingly being recognized as foundational to the operation of Hadoop infrastructures. It allows users to provision, manage and monitor a cluster and provides a set of tools to visualize and diagnose operational issues. There are other projects in this category (such as Oozie) but Ambari is really the most influential.Apache Ambari: Management & MonitoringMake Hadoop clusters easy to operateSimplified cluster provisioning with a step-by-step install wizardPre-configured operational metrics for insight into health of Hadoop servicesVisualization of job and task execution for visibility into performance issuesComplete RESTful API for integrating with existing operational toolsIntuitive user interface that makes controlling a cluster easy and productive
In summary, by addressing these elements, we can provide an Enterprise Hadoop distribution which includes the:Core ServicesPlatform ServicesData ServicesOperational ServicesRequired by the Enterprise user.And all of this is done in 100% open source, and tested at scale by our team (together with our partner Yahoo) to bring Enterprise process to an open source approach. And finally this is the distribution that is endorsed by the ecosystem to ensure interoperability in your environment.
It is for that reason that we focus on HDP interoperability across all of these categories:Data systemsHDP is endorsed and embedded with SQL Server, Teradata and moreBI tools: HDP is certified for use with the packaged applications you already use: from Microsoft, to Tableau, Microstrategy, Business Objects and moreWith Development tools: For .Net developers: Visual studio, used to build more than half the custom applications in the world, certifies with HDP to enable microsoft app developers to build custom apps with HadoopFor Java developers: Spring for Apache Hadoop enables Java developers to quickly and easily build Hadoop based applications with HDPOperational toolsIntegration with System Center, and with Teradata viewpoint
Rohit: Can you provide three bullet points of your demo?
Hdp on windows, hdp server on windows, hd on azureMscustomer that wants to leverage familiar windows tools system center, Work with it like in linux, bring your own scriptsWhat they will get and when they will get itIntegration with ms tooling Mscust gives them choice because the infrastructure bits underpinnings the sameSo get started todayDriver is isv app that is vertical in nature and need a choice to deploy on windows todayField positioning