SlideShare a Scribd company logo
1 of 29
Building Web Analytics on Hadoop at CBS Interactive Michael Sun [email_address] Hadoop World November 8 2011
$2  What is fog computing? Deep Thoughts $1  What is cloud computing? Convenient, on-demand, scalable, network accessible service of the shared pool of computing resources Vaporware $5  What is vapor computing? Local cloud computing
About Me ,[object Object],[object Object],[object Object],[object Object],[object Object]
Brands and Websites of CBS interactive, Samples GAMES & MOVIES  TECH, BIZ & NEWS  SPORTS ENTERTAINMENT  MUSIC
CBSi Scale ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],1 - Source: comScore, March 2011
Web Analytics Processing ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],1 - Source: comScore, March 2011
Modernize the platform ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Hadoop to the Rescue / Research ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
The Plan ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Web Analytics Hadoop External data sources HDFS Python-ETL MapReduce Hive DW Database Sites Apache Logs Distribute log by Fido Web metrics Billers Data mining CMS Systems
The Python ETL Framework Lumberjack on Hadoop ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
The Python ETL Framework Lumberjack on Hadoop ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
The Python ETL Framework Lumberjack on Hadoop, Example Python schema schema = Schema(( SchemaField(u'anon_cookie', 'unicode', False, default=u'-', maxlen=19, io_encoding='utf-8', encoding_errors='replace', cache_size=0, on_null='strict', on_type_error='strict', on_range_error='strict'), SchemaField(u'client_ip_addr', 'int', False, default=0, signed=False, bits=32, precision=None, cache_size=1, on_null='strict', on_type_error='strict', on_range_error='strict'), SchemaField(u'session_id', 'int', False, default=-1, signed=True, bits=64, precision=None, cache_size=1, on_null='strict', on_type_error='strict', on_range_error='strict'), SchemaField(u'session_start_dt_ht', 'datetime', False, default='1970-01-01 08:00:00', timezone='PRC', io_format='%Y-%m-%d %H:%M:%S', cache_size=-1, on_null='strict', on_type_error='strict', on_range_error='strict'), SchemaField(u'session_start_dt_ut', 'datetime', False, default='1970-01-01 00:00:00', timezone='UTC', io_format='%Y-%m-%d %H:%M:%S', cache_size=-1, on_null='strict', on_type_error='strict', on_range_error='strict'), ))
The Python ETL Framework Lumberjack on Hadoop, Example Pipeline pl = etl.Pipeline( etl.file.DelimitedInput(stdin, input_schema, drop=('empty1', 'empty2', 'bytes_sent'), errors_policy=etl.file.DelimitedInputSkipAndCountErrors( etl.counter.HadoopStreamingCounter('Dropped Records',  group='DelimitedInput'))) | etl.transform.StringCleaner() | dw_py_mod_utils.IPFilter('ip_address', 'CNET_Exclude_Addresses.txt', None, None) | cnclear_parser.CnclearParser() | etl.transform.StringCleaner(fields=('xref', 'src_url', 'title')) | parse_url.ParseURL('src_url') | event_url_title_cleaner.CNClearURLTitleCleaner() | dw_py_mod_utils.Sha1Hash(['user_agent', 'skc_url', 'title']) | parse_ua.ParseUA('user_agent') | etl.transform.ProjectToOutputSchema(output_schema) | etl.file.DelimitedOutput(stdout) ) pl.setup() pl.run()
The Python ETL Framework Lumberjack on Hadoop ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Web log Processing by Hadoop Streaming and Python-ETL ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Sessionize on Hadoop in Detail ,[object Object],[object Object],[object Object]
Input to Sessionize Take output data from parsed data of type: page impression, click-payable, click-nonpayable, video tracking, optimization event types
Prep-sessionize (Mapper) ,[object Object],[object Object],[object Object],[object Object]
Sorting before Sessionize ,[object Object],[object Object],[object Object],[object Object],[object Object]
Sessionize (Reducer) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Output (Mapper) from Sessionize ,[object Object],[object Object],[object Object]
Web Analytics Hadoop External data sources HDFS Python-ETL MapReduce Hive DW Database Sites Apache Logs Distribute log by Fido Web metrics Billers Data mining CMS Systems
Benefits to Ops ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Conclusions I – Create Tool Appropriate to the Job if it doesn’t have what you want ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Conclusions II – Power and Flexibility for Processing Big Data ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
The Team (alphabetical order) Batu Ulug Dan Lescohier Jim Haas, presenting “Hadoop in Mission-critical Environment” Michael Sun Richard Zhang Ron Mahoney Slawomir Krysiak
Questions? [email_address] Follow up Lumberjack [email_address]
Abstract CBS Interactive successfully adopted Hadoop as the web analytics platform, processing one Billion weblogs daily from hundreds of web site properties that CBS Interactive oversees. After introducing Lumberjack—the Extraction, Transformation and Loading framework we built based on python and streaming, which is under review for Open-Source release—Michael will talk about web metrics processing on Hadoop, focusing on weblog harvesting, parsing, dimension look-up, sessionization, and loading into a database. Since migrating processing from a proprietary platform to Hadoop, CBS Interactive achieved robustness, fault-tolerance and scalability, and significant reduction of processing time to reach SLA (over six hours reduction so far).

More Related Content

What's hot

Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)Julian Hyde
 
Spark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. JyotiskaSpark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. JyotiskaSigmoid
 
U-SQL Killer Scenarios: Taming the Data Science Monster with U-SQL and Big Co...
U-SQL Killer Scenarios: Taming the Data Science Monster with U-SQL and Big Co...U-SQL Killer Scenarios: Taming the Data Science Monster with U-SQL and Big Co...
U-SQL Killer Scenarios: Taming the Data Science Monster with U-SQL and Big Co...Michael Rys
 
Drill / SQL / Optiq
Drill / SQL / OptiqDrill / SQL / Optiq
Drill / SQL / OptiqJulian Hyde
 
Cost-based query optimization in Apache Hive
Cost-based query optimization in Apache HiveCost-based query optimization in Apache Hive
Cost-based query optimization in Apache HiveJulian Hyde
 
U-SQL Killer Scenarios: Custom Processing, Big Cognition, Image and JSON Proc...
U-SQL Killer Scenarios: Custom Processing, Big Cognition, Image and JSON Proc...U-SQL Killer Scenarios: Custom Processing, Big Cognition, Image and JSON Proc...
U-SQL Killer Scenarios: Custom Processing, Big Cognition, Image and JSON Proc...Michael Rys
 
Bring your code to explore the Azure Data Lake: Execute your .NET/Python/R co...
Bring your code to explore the Azure Data Lake: Execute your .NET/Python/R co...Bring your code to explore the Azure Data Lake: Execute your .NET/Python/R co...
Bring your code to explore the Azure Data Lake: Execute your .NET/Python/R co...Michael Rys
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Julian Hyde
 
U-SQL Partitioned Data and Tables (SQLBits 2016)
U-SQL Partitioned Data and Tables (SQLBits 2016)U-SQL Partitioned Data and Tables (SQLBits 2016)
U-SQL Partitioned Data and Tables (SQLBits 2016)Michael Rys
 
U-SQL Reading & Writing Files (SQLBits 2016)
U-SQL Reading & Writing Files (SQLBits 2016)U-SQL Reading & Writing Files (SQLBits 2016)
U-SQL Reading & Writing Files (SQLBits 2016)Michael Rys
 
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)Michael Rys
 
SparkSQL and Dataframe
SparkSQL and DataframeSparkSQL and Dataframe
SparkSQL and DataframeNamgee Lee
 
Be A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data PipelineBe A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data PipelineChester Chen
 
Tactical data engineering
Tactical data engineeringTactical data engineering
Tactical data engineeringJulian Hyde
 
pandas - Python Data Analysis
pandas - Python Data Analysispandas - Python Data Analysis
pandas - Python Data AnalysisAndrew Henshaw
 
The Road to U-SQL: Experiences in Language Design (SQL Konferenz 2017 Keynote)
The Road to U-SQL: Experiences in Language Design (SQL Konferenz 2017 Keynote)The Road to U-SQL: Experiences in Language Design (SQL Konferenz 2017 Keynote)
The Road to U-SQL: Experiences in Language Design (SQL Konferenz 2017 Keynote)Michael Rys
 
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Julian Hyde
 
U-SQL User-Defined Operators (UDOs) (SQLBits 2016)
U-SQL User-Defined Operators (UDOs) (SQLBits 2016)U-SQL User-Defined Operators (UDOs) (SQLBits 2016)
U-SQL User-Defined Operators (UDOs) (SQLBits 2016)Michael Rys
 
Hadoop - Stock Analysis
Hadoop - Stock AnalysisHadoop - Stock Analysis
Hadoop - Stock AnalysisVaibhav Jain
 

What's hot (20)

Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
 
Spark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. JyotiskaSpark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. Jyotiska
 
U-SQL Killer Scenarios: Taming the Data Science Monster with U-SQL and Big Co...
U-SQL Killer Scenarios: Taming the Data Science Monster with U-SQL and Big Co...U-SQL Killer Scenarios: Taming the Data Science Monster with U-SQL and Big Co...
U-SQL Killer Scenarios: Taming the Data Science Monster with U-SQL and Big Co...
 
Drill / SQL / Optiq
Drill / SQL / OptiqDrill / SQL / Optiq
Drill / SQL / Optiq
 
Cost-based query optimization in Apache Hive
Cost-based query optimization in Apache HiveCost-based query optimization in Apache Hive
Cost-based query optimization in Apache Hive
 
U-SQL Killer Scenarios: Custom Processing, Big Cognition, Image and JSON Proc...
U-SQL Killer Scenarios: Custom Processing, Big Cognition, Image and JSON Proc...U-SQL Killer Scenarios: Custom Processing, Big Cognition, Image and JSON Proc...
U-SQL Killer Scenarios: Custom Processing, Big Cognition, Image and JSON Proc...
 
Bring your code to explore the Azure Data Lake: Execute your .NET/Python/R co...
Bring your code to explore the Azure Data Lake: Execute your .NET/Python/R co...Bring your code to explore the Azure Data Lake: Execute your .NET/Python/R co...
Bring your code to explore the Azure Data Lake: Execute your .NET/Python/R co...
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!
 
U-SQL Partitioned Data and Tables (SQLBits 2016)
U-SQL Partitioned Data and Tables (SQLBits 2016)U-SQL Partitioned Data and Tables (SQLBits 2016)
U-SQL Partitioned Data and Tables (SQLBits 2016)
 
U-SQL Reading & Writing Files (SQLBits 2016)
U-SQL Reading & Writing Files (SQLBits 2016)U-SQL Reading & Writing Files (SQLBits 2016)
U-SQL Reading & Writing Files (SQLBits 2016)
 
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
 
SparkSQL and Dataframe
SparkSQL and DataframeSparkSQL and Dataframe
SparkSQL and Dataframe
 
Be A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data PipelineBe A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data Pipeline
 
Tactical data engineering
Tactical data engineeringTactical data engineering
Tactical data engineering
 
pandas - Python Data Analysis
pandas - Python Data Analysispandas - Python Data Analysis
pandas - Python Data Analysis
 
The Road to U-SQL: Experiences in Language Design (SQL Konferenz 2017 Keynote)
The Road to U-SQL: Experiences in Language Design (SQL Konferenz 2017 Keynote)The Road to U-SQL: Experiences in Language Design (SQL Konferenz 2017 Keynote)
The Road to U-SQL: Experiences in Language Design (SQL Konferenz 2017 Keynote)
 
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
 
AfterGlow
AfterGlowAfterGlow
AfterGlow
 
U-SQL User-Defined Operators (UDOs) (SQLBits 2016)
U-SQL User-Defined Operators (UDOs) (SQLBits 2016)U-SQL User-Defined Operators (UDOs) (SQLBits 2016)
U-SQL User-Defined Operators (UDOs) (SQLBits 2016)
 
Hadoop - Stock Analysis
Hadoop - Stock AnalysisHadoop - Stock Analysis
Hadoop - Stock Analysis
 

Viewers also liked

Mongo db and hadoop driving business insights - final
Mongo db and hadoop   driving business insights - finalMongo db and hadoop   driving business insights - final
Mongo db and hadoop driving business insights - finalMongoDB
 
MongoDB Days Silicon Valley: MongoDB and the Hadoop Connector
MongoDB Days Silicon Valley: MongoDB and the Hadoop ConnectorMongoDB Days Silicon Valley: MongoDB and the Hadoop Connector
MongoDB Days Silicon Valley: MongoDB and the Hadoop ConnectorMongoDB
 
Pentaho Analytics for MongoDB - presentation from MongoDB World 2014
Pentaho Analytics for MongoDB - presentation from MongoDB World 2014Pentaho Analytics for MongoDB - presentation from MongoDB World 2014
Pentaho Analytics for MongoDB - presentation from MongoDB World 2014Pentaho
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Xuan-Chao Huang
 
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelMongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelTakahiro Inoue
 
Dealing with Changed Data in Hadoop
Dealing with Changed Data in HadoopDealing with Changed Data in Hadoop
Dealing with Changed Data in HadoopDataWorks Summit
 
Use cases for Hadoop and Big Data Analytics - InfoSphere BigInsights
Use cases for Hadoop and Big Data Analytics - InfoSphere BigInsightsUse cases for Hadoop and Big Data Analytics - InfoSphere BigInsights
Use cases for Hadoop and Big Data Analytics - InfoSphere BigInsightsGord Sissons
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingDataWorks Summit
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
Webinar: MongoDB + Hadoop
Webinar: MongoDB + HadoopWebinar: MongoDB + Hadoop
Webinar: MongoDB + HadoopMongoDB
 
Upgrading the Curriculum
Upgrading the CurriculumUpgrading the Curriculum
Upgrading the CurriculumJanet Hale
 
DieHarder (CCS 2010, WOOT 2011)
DieHarder (CCS 2010, WOOT 2011)DieHarder (CCS 2010, WOOT 2011)
DieHarder (CCS 2010, WOOT 2011)Emery Berger
 
Presentacion de economica politica
Presentacion de economica politicaPresentacion de economica politica
Presentacion de economica politicaabelardoac
 
Presentacion de economica politica
Presentacion de economica politicaPresentacion de economica politica
Presentacion de economica politicaabelardoac
 
VMware title
VMware titleVMware title
VMware titletlevers
 
Las 27-maneras-en-que-la-mente-distorciona-la-realidad
Las 27-maneras-en-que-la-mente-distorciona-la-realidadLas 27-maneras-en-que-la-mente-distorciona-la-realidad
Las 27-maneras-en-que-la-mente-distorciona-la-realidadsteelman182
 
How to Create Creative Commons Licensing Buttons for Your Website
How to Create Creative Commons Licensing Buttons for Your Website  How to Create Creative Commons Licensing Buttons for Your Website
How to Create Creative Commons Licensing Buttons for Your Website NET:101
 
Xstrata Article
Xstrata ArticleXstrata Article
Xstrata ArticleVicki Shaw
 
Week 2: Setting up your Account
Week 2: Setting up your AccountWeek 2: Setting up your Account
Week 2: Setting up your AccountEdel14201341
 

Viewers also liked (20)

Mongo db and hadoop driving business insights - final
Mongo db and hadoop   driving business insights - finalMongo db and hadoop   driving business insights - final
Mongo db and hadoop driving business insights - final
 
MongoDB Days Silicon Valley: MongoDB and the Hadoop Connector
MongoDB Days Silicon Valley: MongoDB and the Hadoop ConnectorMongoDB Days Silicon Valley: MongoDB and the Hadoop Connector
MongoDB Days Silicon Valley: MongoDB and the Hadoop Connector
 
Hadoop to spark-v2
Hadoop to spark-v2Hadoop to spark-v2
Hadoop to spark-v2
 
Pentaho Analytics for MongoDB - presentation from MongoDB World 2014
Pentaho Analytics for MongoDB - presentation from MongoDB World 2014Pentaho Analytics for MongoDB - presentation from MongoDB World 2014
Pentaho Analytics for MongoDB - presentation from MongoDB World 2014
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
 
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelMongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
 
Dealing with Changed Data in Hadoop
Dealing with Changed Data in HadoopDealing with Changed Data in Hadoop
Dealing with Changed Data in Hadoop
 
Use cases for Hadoop and Big Data Analytics - InfoSphere BigInsights
Use cases for Hadoop and Big Data Analytics - InfoSphere BigInsightsUse cases for Hadoop and Big Data Analytics - InfoSphere BigInsights
Use cases for Hadoop and Big Data Analytics - InfoSphere BigInsights
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Webinar: MongoDB + Hadoop
Webinar: MongoDB + HadoopWebinar: MongoDB + Hadoop
Webinar: MongoDB + Hadoop
 
Upgrading the Curriculum
Upgrading the CurriculumUpgrading the Curriculum
Upgrading the Curriculum
 
DieHarder (CCS 2010, WOOT 2011)
DieHarder (CCS 2010, WOOT 2011)DieHarder (CCS 2010, WOOT 2011)
DieHarder (CCS 2010, WOOT 2011)
 
Presentacion de economica politica
Presentacion de economica politicaPresentacion de economica politica
Presentacion de economica politica
 
Presentacion de economica politica
Presentacion de economica politicaPresentacion de economica politica
Presentacion de economica politica
 
VMware title
VMware titleVMware title
VMware title
 
Las 27-maneras-en-que-la-mente-distorciona-la-realidad
Las 27-maneras-en-que-la-mente-distorciona-la-realidadLas 27-maneras-en-que-la-mente-distorciona-la-realidad
Las 27-maneras-en-que-la-mente-distorciona-la-realidad
 
How to Create Creative Commons Licensing Buttons for Your Website
How to Create Creative Commons Licensing Buttons for Your Website  How to Create Creative Commons Licensing Buttons for Your Website
How to Create Creative Commons Licensing Buttons for Your Website
 
Xstrata Article
Xstrata ArticleXstrata Article
Xstrata Article
 
Week 2: Setting up your Account
Week 2: Setting up your AccountWeek 2: Setting up your Account
Week 2: Setting up your Account
 

Similar to Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interactive - Michael Sun - CBSi

Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010nzhang
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Bhupesh Bansal
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop User Group
 
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceEric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceHortonworks
 
What it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! PerspectivesWhat it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! PerspectivesDataWorks Summit
 
Big dataarchitecturesandecosystem+nosql
Big dataarchitecturesandecosystem+nosqlBig dataarchitecturesandecosystem+nosql
Big dataarchitecturesandecosystem+nosqlKhanderao Kand
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environmentDelhi/NCR HUG
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecasesudhakara st
 
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010Yahoo Developer Network
 
Testing Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of HadoopTesting Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of HadoopRTTS
 
Architecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchArchitecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchHortonworks
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookAmr Awadallah
 
Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopJosh Patterson
 
UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015Christopher Curtin
 
Big data or big deal
Big data or big dealBig data or big deal
Big data or big dealeduarderwee
 
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Imam Raza
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopFlavio Vit
 

Similar to Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interactive - Michael Sun - CBSi (20)

Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Datalake Architecture
Datalake ArchitectureDatalake Architecture
Datalake Architecture
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
 
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceEric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers Conference
 
What it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! PerspectivesWhat it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! Perspectives
 
Big dataarchitecturesandecosystem+nosql
Big dataarchitecturesandecosystem+nosqlBig dataarchitecturesandecosystem+nosql
Big dataarchitecturesandecosystem+nosql
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environment
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
 
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010
 
OOP 2014
OOP 2014OOP 2014
OOP 2014
 
Testing Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of HadoopTesting Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of Hadoop
 
Architecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchArchitecting the Future of Big Data and Search
Architecting the Future of Big Data and Search
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
 
Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on Hadoop
 
UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015
 
Big data or big deal
Big data or big dealBig data or big deal
Big data or big deal
 
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 

More from Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

More from Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Recently uploaded

Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 

Recently uploaded (20)

Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 

Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interactive - Michael Sun - CBSi

  • 1. Building Web Analytics on Hadoop at CBS Interactive Michael Sun [email_address] Hadoop World November 8 2011
  • 2. $2 What is fog computing? Deep Thoughts $1 What is cloud computing? Convenient, on-demand, scalable, network accessible service of the shared pool of computing resources Vaporware $5 What is vapor computing? Local cloud computing
  • 3.
  • 4. Brands and Websites of CBS interactive, Samples GAMES & MOVIES TECH, BIZ & NEWS SPORTS ENTERTAINMENT MUSIC
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10. Web Analytics Hadoop External data sources HDFS Python-ETL MapReduce Hive DW Database Sites Apache Logs Distribute log by Fido Web metrics Billers Data mining CMS Systems
  • 11.
  • 12.
  • 13. The Python ETL Framework Lumberjack on Hadoop, Example Python schema schema = Schema(( SchemaField(u'anon_cookie', 'unicode', False, default=u'-', maxlen=19, io_encoding='utf-8', encoding_errors='replace', cache_size=0, on_null='strict', on_type_error='strict', on_range_error='strict'), SchemaField(u'client_ip_addr', 'int', False, default=0, signed=False, bits=32, precision=None, cache_size=1, on_null='strict', on_type_error='strict', on_range_error='strict'), SchemaField(u'session_id', 'int', False, default=-1, signed=True, bits=64, precision=None, cache_size=1, on_null='strict', on_type_error='strict', on_range_error='strict'), SchemaField(u'session_start_dt_ht', 'datetime', False, default='1970-01-01 08:00:00', timezone='PRC', io_format='%Y-%m-%d %H:%M:%S', cache_size=-1, on_null='strict', on_type_error='strict', on_range_error='strict'), SchemaField(u'session_start_dt_ut', 'datetime', False, default='1970-01-01 00:00:00', timezone='UTC', io_format='%Y-%m-%d %H:%M:%S', cache_size=-1, on_null='strict', on_type_error='strict', on_range_error='strict'), ))
  • 14. The Python ETL Framework Lumberjack on Hadoop, Example Pipeline pl = etl.Pipeline( etl.file.DelimitedInput(stdin, input_schema, drop=('empty1', 'empty2', 'bytes_sent'), errors_policy=etl.file.DelimitedInputSkipAndCountErrors( etl.counter.HadoopStreamingCounter('Dropped Records', group='DelimitedInput'))) | etl.transform.StringCleaner() | dw_py_mod_utils.IPFilter('ip_address', 'CNET_Exclude_Addresses.txt', None, None) | cnclear_parser.CnclearParser() | etl.transform.StringCleaner(fields=('xref', 'src_url', 'title')) | parse_url.ParseURL('src_url') | event_url_title_cleaner.CNClearURLTitleCleaner() | dw_py_mod_utils.Sha1Hash(['user_agent', 'skc_url', 'title']) | parse_ua.ParseUA('user_agent') | etl.transform.ProjectToOutputSchema(output_schema) | etl.file.DelimitedOutput(stdout) ) pl.setup() pl.run()
  • 15.
  • 16.
  • 17.
  • 18. Input to Sessionize Take output data from parsed data of type: page impression, click-payable, click-nonpayable, video tracking, optimization event types
  • 19.
  • 20.
  • 21.
  • 22.
  • 23. Web Analytics Hadoop External data sources HDFS Python-ETL MapReduce Hive DW Database Sites Apache Logs Distribute log by Fido Web metrics Billers Data mining CMS Systems
  • 24.
  • 25.
  • 26.
  • 27. The Team (alphabetical order) Batu Ulug Dan Lescohier Jim Haas, presenting “Hadoop in Mission-critical Environment” Michael Sun Richard Zhang Ron Mahoney Slawomir Krysiak
  • 28. Questions? [email_address] Follow up Lumberjack [email_address]
  • 29. Abstract CBS Interactive successfully adopted Hadoop as the web analytics platform, processing one Billion weblogs daily from hundreds of web site properties that CBS Interactive oversees. After introducing Lumberjack—the Extraction, Transformation and Loading framework we built based on python and streaming, which is under review for Open-Source release—Michael will talk about web metrics processing on Hadoop, focusing on weblog harvesting, parsing, dimension look-up, sessionization, and loading into a database. Since migrating processing from a proprietary platform to Hadoop, CBS Interactive achieved robustness, fault-tolerance and scalability, and significant reduction of processing time to reach SLA (over six hours reduction so far).

Editor's Notes

  1. CBSi has a number of brands, this slide shows the biggest ones.
  2. We have a lot of traffic and data. We’ve been using Hadoop quite extensively for a few years now.