SlideShare ist ein Scribd-Unternehmen logo
1 von 80
Downloaden Sie, um offline zu lesen
Cloud Computing Era
Trend Micro
Three Major Trends to Chang the World



                 Cloud Computing




           Big Data           Mobile
什麼是雲端運算?
美國國家標準技術研究所 (NIST)的定義:


                                     Essential
                                     Characteristics




                                     Service Models




                                     Deployment Models




以服務(as-a-service)的商業模式,透過Internet技術,提供具有擴充
性(scalable)和彈性(elastic)的IT相關功能給使用者
It’s About the Ecosystem

                        Structured, Semi-structured

                                                      Cloud Computing

                                                            SaaS
   Enterprise Data Warehouse
                                                            PaaS

                                                            IaaS




                                                                                 Generate




                                                                             Big Data

                                                                                  Lead




                                                                          Business Insights

                                                                                  create




                                                                        Competition, Innovation,
                                                                             Productivity
What is Big Data?
What is the problem

• Getting the data to the processors
  becomes the bottleneck


• Quick calculation
   – Typical disk data transfer rate:
      • 75MB/sec
   – Time taken to transfer 100GB of data
     to the processor:
      • approx. 22   minutes!
The Era of Big Data – Are You Ready




        Data for business commercial analysis
        • 2011: multi-terabyte (TB)
        • 2020: 35.2 ZB (1 ZB = 1 billion TB)
Who Needs It?
           Enterprise Database                          Hadoop




When to use?                          When to use?
•   Ad-hoc Reporting (<1sec)          •   Affordable Storage/Compute
•   Multi-step Transactions           •   Unstructured or Semi-structured
•   Lots of Inserts/Updates/Deletes   •   Resilient Auto Scalability
Hadoop!
– inspired by
• Apache Hadoop project
  – inspired by Google's MapReduce and Google File System
    papers.
• Open sourced, flexible and available architecture for
  large scale computation and data processing on a
  network of commodity hardware
• Open Source Software + Hardware Commodity
  – IT Costs Reduction
Hadoop Core




              MapReduce
              HDFS




              © 2011 Cloudera, Inc. All Rights Reserved.
HDFS

• Hadoop Distributed File System
• Redundancy
• Fault Tolerant
• Scalable
• Self Healing
• Write Once, Read Many Times
• Java API
• Command Line Tool



                   © 2011 Cloudera, Inc. All Rights Reserved.
MapReduce

 • Two Phases of Functional Programming
 • Redundancy
 • Fault Tolerant
 • Scalable
 • Self Healing
 • Java API




13
                    © 2011 Cloudera, Inc. All Rights Reserved.
Hadoop Core

                                                            Java
Java

               MapReduce
               HDFS
                                                             Java
      Java

 14
               © 2011 Cloudera, Inc. All Rights Reserved.
Word Count Example




       Key: offset
       Value: line

                            Key: word      Key: word
                            Value: count   Value: sum of count



0:The cat sat on the mat
22:The aardvark sat on the sofa
The Hadoop Ecosystems
The Ecosystem is the System

• Hadoop has become the kernel of the distributed
  operating system for Big Data
• No one uses the kernel alone
• A collection of projects at Apache
Relation Map

                                                Hue                                   Mahout
                                            (Web Console)                         (Data Mining)


                                                                  Oozie
                                                        (Job Workflow & Scheduling)
    (Coordination)
                     Zookeeper




                                           Sqoop/Flume
                                                                          Pig/Hive (Analytical Language)
                                          (Data integration)



                                 MapReduce Runtime
                                 (Dist. Programming Framework)                         Hbase
                                                                                (Column NoSQL DB)


                                              Hadoop Distributed File System (HDFS)
Zookeeper – Coordination Framework

                                                Hue                                   Mahout
                                            (Web Console)                         (Data Mining)


                                                                  Oozie
                                                        (Job Workflow & Scheduling)
    (Coordination)
                     Zookeeper




                                           Sqoop/Flume
                                                                          Pig/Hive (Analytical Language)
                                          (Data integration)



                                 MapReduce Runtime
                                 (Dist. Programming Framework)                         Hbase
                                                                                (Column NoSQL DB)


                                              Hadoop Distributed File System (HDFS)
What is ZooKeeper

• A centralized service for maintaining
  – Configuration information
  – Providing distributed synchronization
• A set of tools to build distributed applications that can
  safely handle partial failures
• ZooKeeper was designed to store coordination data
  – Status information
  – Configuration
  – Location information
Flume / Sqoop – Data Integration Framework

                                                Hue                                   Mahout
                                            (Web Console)                         (Data Mining)


                                                                  Oozie
                                                        (Job Workflow & Scheduling)
    (Coordination)
                     Zookeeper




                                           Sqoop/Flume
                                                                          Pig/Hive (Analytical Language)
                                          (Data integration)



                                 MapReduce Runtime
                                 (Dist. Programming Framework)                         Hbase
                                                                                (Column NoSQL DB)


                                              Hadoop Distributed File System (HDFS)
What’s the problem for data collection

• Data collection is currently a priori and ad hoc
• A priori – decide what you want to collect ahead of time
• Ad hoc – each kind of data source goes through its own
  collection path
(and how can it help?)



• A distributed data collection service
• It efficiently collecting, aggregating, and moving large
  amounts of data
• Fault tolerant, many failover and recovery mechanism
• One-stop solution for data collection of all formats
Flume Architecture


Log                                                        Log
                                         ...


Flume Node                                                 Flume Node


HDFS


              © 2011 Cloudera, Inc. All Rights Reserved.
Flume Sources and Sinks

• Local Files
• HDFS
• Stdin, Stdout
• Twitter
• IRC
• IMAP




                  © 2011 Cloudera, Inc. All Rights Reserved.
Sqoop

• Easy, parallel database import/export
• What you want do?
  – Insert data from RDBMS to HDFS
  – Export data from HDFS back into RDBMS
Sqoop


        HDFS


        Sqoop


        RDBMS


28
        © 2011 Cloudera, Inc. All Rights Reserved.
Sqoop Examples

 $ sqoop import --connect jdbc:mysql://localhost/world --
 username root --table City
 ...

 $ hadoop fs -cat City/part-m-00000
 1,Kabul,AFG,Kabol,17800002,Qandahar,AFG,Qandahar,2375003,He
 rat,AFG,Herat,1868004,Mazar-e-
 Sharif,AFG,Balkh,1278005,Amsterdam,NLD,Noord-Holland,731200
 ...




29
                    © 2011 Cloudera, Inc. All Rights Reserved.
Pig / Hive – Analytical Language

                                                Hue                                   Mahout
                                            (Web Console)                         (Data Mining)


                                                                  Oozie
                                                        (Job Workflow & Scheduling)
    (Coordination)
                     Zookeeper




                                           Sqoop/Flume
                                                                          Pig/Hive (Analytical Language)
                                          (Data integration)



                                 MapReduce Runtime
                                 (Dist. Programming Framework)                         Hbase
                                                                                (Column NoSQL DB)


                                              Hadoop Distributed File System (HDFS)
Why Hive and Pig?

• Although MapReduce is very powerful, it can also be
  complex to master
• Many organizations have business or data analysts who
  are skilled at writing SQL queries, but not at writing Java
  code
• Many organizations have programmers who are skilled
  at writing code in scripting languages
• Hive and Pig are two projects which evolved separately
  to help such people analyze huge amounts of data via
  MapReduce
  – Hive was initially developed at Facebook, Pig at Yahoo!
Hive     – Developed by

• What is Hive?
  – An SQL-like interface to Hadoop
• Data Warehouse infrastructure that provides data
  summarization and ad hoc querying on top of Hadoop
  – MapRuduce for execution
  – HDFS for storage
• Hive Query Language
  – Basic-SQL : Select, From, Join, Group-By
  – Equi-Join, Muti-Table Insert, Multi-Group-By
  – Batch query
 SELECT * FROM purchases WHERE price > 100 GROUP BY storeid
Hive


       SQL


       Hive


       MapReduce


33
       © 2011 Cloudera, Inc. All Rights Reserved.
Pig                 – Initiated by




• A high-level scripting language (Pig Latin)
• Process data one step at a time
• Simple to write MapReduce program
• Easy understand
• Easy debug         A = load ‘a.txt’ as (id, name, age, ...)
                     B = load ‘b.txt’ as (id, address, ...)
                     C = JOIN A BY id, B BY id;STORE C into ‘c.txt’
Pig


      Script


      Pig


      MapReduce


       © 2011 Cloudera, Inc. All Rights Reserved.
Hive vs. Pig

                  Hive                   Pig
Language          HiveQL (SQL-like)      Pig Latin, a scripting language
Schema            Table definitions      A schema is optionally defined
                  that are stored in a   at runtime
                  metastore
Programmait Access JDBC, ODBC            PigServer
WordCount Example

• Input
  Hello World Bye World
  Hello Hadoop Goodbye Hadoop
• For the given sample input the map emits
  < Hello, 1>
  < World, 1>
  < Bye, 1>
  < World, 1>
  < Hello, 1>
  < Hadoop, 1>
  < Goodbye, 1>
  < Hadoop, 1>

• the reduce just sums up the values
   < Bye, 1>
  < Goodbye, 1>
  < Hadoop, 2>
  < Hello, 2>
  < World, 2>
WordCount Example In MapReduce
public class WordCount {
 public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
  private final static IntWritable one = new IntWritable(1);
  private Text word = new Text();

    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
      String line = value.toString();
      StringTokenizer tokenizer = new StringTokenizer(line);
      while (tokenizer.hasMoreTokens()) {
         word.set(tokenizer.nextToken());
         context.write(word, one);
      }
    }
}

public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
  public void reduce(Text key, Iterable<IntWritable> values, Context context)
    throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
         sum += val.get();
      }
      context.write(key, new IntWritable(sum));
  }
}

public static void main(String[] args) throws Exception {
 Configuration conf = new Configuration();

    Job job = new Job(conf, "wordcount");
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);

    job.setMapperClass(Map.class);
    job.setReducerClass(Reduce.class);

    job.setInputFormatClass(TextInputFormat.class);
    job.setOutputFormatClass(TextOutputFormat.class);

    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));

    job.waitForCompletion(true);
}
WordCount Example By Pig


A = LOAD 'wordcount/input' USING PigStorage as (token:chararray);

B = GROUP A BY token;

C = FOREACH B GENERATE group, COUNT(A) as count;

DUMP C;
WordCount Example By Hive

CREATE TABLE wordcount (token STRING);

LOAD DATA LOCAL INPATH ’wordcount/input'
OVERWRITE INTO TABLE wordcount;


SELECT count(*) FROM wordcount GROUP BY token;
The Story So Far


    SQL         Hive                               Pig        Script

    Java        MapReduce
    Java        HDFS
                Sqoop Flume
    SQL         RDBMS FS                                      Posix


4
1                © 2011 Cloudera, Inc. All Rights Reserved.
Hbase – Column NoSQL DB

                                               Hue                                   Mahout
                                           (Web Console)                         (Data Mining)


                                                                 Oozie
                                                       (Job Workflow & Scheduling)
   (Coordination)
                    Zookeeper




                                          Sqoop/Flume
                                                                         Pig/Hive (Analytical Language)
                                         (Data integration)



                                MapReduce Runtime
                                (Dist. Programming Framework)                         Hbase
                                                                               (Column NoSQL DB)


                                             Hadoop Distributed File System (HDFS)
Structured-data vs Raw-data
I – Inspired by

• Coordinated by Zookeeper
• Low Latency
• Random Reads And Writes
• Distributed Key/Value Store
• Simple API
  –   PUT
  –   GET
  –   DELETE
  –   SCAN
Hbase – Data Model

• Cells are “versioned”
• Table rows are sorted by row key
• Region – a row range [start-key:end-key]
HBase Examples

hbase>   create 'mytable', 'mycf‘
hbase>   list
hbase>   put 'mytable', 'row1', 'mycf:col1', 'val1‘
hbase>   put 'mytable', 'row1', 'mycf:col2', 'val2‘
hbase>   put 'mytable', 'row2', 'mycf:col1', 'val3‘
hbase>   scan 'mytable‘
hbase>   disable 'mytable‘
hbase>   drop 'mytable'




                     © 2011 Cloudera, Inc. All Rights Reserved.
Oozie – Job Workflow & Scheduling

                                                Hue                                   Mahout
                                            (Web Console)                         (Data Mining)


                                                                  Oozie
                                                        (Job Workflow & Scheduling)
    (Coordination)
                     Zookeeper




                                           Sqoop/Flume
                                                                          Pig/Hive (Analytical Language)
                                          (Data integration)



                                 MapReduce Runtime
                                 (Dist. Programming Framework)                         Hbase
                                                                                (Column NoSQL DB)


                                              Hadoop Distributed File System (HDFS)
What is                       ?

• A Java Web Application
• Oozie is a workflow scheduler for Hadoop
• Crond for Hadoop
• Triggered            Job 1 Job 2
  – Time
  – Data
                            Job 3

                       Job 4 Job 5
Oozie Features

• Component Independent
  –   MapReduce
  –   Hive
  –   Pig
  –   SqoopStreaming




                       © 2011 Cloudera, Inc. All Rights Reserved.
Mahout – Data Mining

                                                Hue                                   Mahout
                                            (Web Console)                         (Data Mining)


                                                                  Oozie
                                                        (Job Workflow & Scheduling)
    (Coordination)
                     Zookeeper




                                           Sqoop/Flume
                                                                          Pig/Hive (Analytical Language)
                                          (Data integration)



                                 MapReduce Runtime
                                 (Dist. Programming Framework)                         Hbase
                                                                                (Column NoSQL DB)


                                              Hadoop Distributed File System (HDFS)
What is

• Machine-learning tool
• Distributed and scalable machine learning algorithms on
  the Hadoop platform
• Building intelligent applications easier and faster
Mahout Use Cases

• Yahoo: Spam Detection
• Foursquare: Recommendations
• SpeedDate.com: Recommendations
• Adobe: User Targetting
• Amazon: Personalization Platform




                  © 2011 Cloudera, Inc. All Rights Reserved.
Hue – developed by

• Hadoop User Experience
• Apache Open source project
• HUE is a web UI for Hadoop
• Platform for building custom applications with a nice UI
  library
Hue

• HUE comes with a suite of applications
  – File Browser: Browse HDFS; change permissions and
    ownership; upload, download, view and edit files.
  – Job Browser: View jobs, tasks, counters, logs, etc.
  – Beeswax: Wizards to help create Hive tables, load data, run and
    manage Hive queries, and download results in Excel format.
Hue: File Browser UI
Hue: Beewax UI
Use case Example

• Predict what the user likes based on
  – His/Her historical behavior
  – Aggregate behavior of people similar to him
Conclusion

Today, we introduced:
• Why Hadoop is needed
• The basic concepts of HDFS and MapReduce
• What sort of problems can be solved with Hadoop
• What other projects are included in the Hadoop
  ecosystem
Recap – Hadoop Ecosystem

                                               Hue                                   Mahout
                                           (Web Console)                         (Data Mining)


                                                                 Oozie
                                                       (Job Workflow & Scheduling)
   (Coordination)
                    Zookeeper




                                          Sqoop/Flume
                                                                         Pig/Hive (Analytical Language)
                                         (Data integration)



                                MapReduce Runtime
                                (Dist. Programming Framework)                         Hbase
                                                                               (Column NoSQL DB)


                                             Hadoop Distributed File System (HDFS)
趨勢科技雲端防毒 Case Study
Collaboration in the underground
網路威脅呈現爆炸性的成長
各式各樣的變種病毒、垃圾郵件、不明的下載來源等等,這些來自網路上
的威脅,躲過傳統安全防護系統的偵測,一直持續呈現爆炸性的成長,形
成嚴重的資安威脅
         New Unique Malware Discovered
                                            1M
                                          unique
                                         Malwares
                                          every
                                          month
New Design Concept for Threat Intelligence
                 CDN / xSP                             Human
                                                       Intelligence




 Honeypot


                                                                      Web Crawler




       Trend Micro
       Mail Protection
                                                 Trend Micro
                             Trend Micro         Endpoint Protection
                             Web Protection

                   150M+ Worldwide Endpoints/Sensors
Challenges We Are Faced
             The Concept is Great but ….
  6TB of data and 15B lines of logs received daily by




           It becomes the Big Data Challenge!
Issues to Address




                                          Threat
 Raw Data           Information   Intelligence/Solution


 Volume: Infinite
 Time: No Delay
 Target: Keep Changing Threats
SPN
                     Feedback
                                              SPAM
                                                                  CDN Log              SPN High Level Architecture
                             HTTP POST

                                   L4


                        Log               Log
                      Receiver          Receiver

                                   L4


                       Log Post          Log Post           Log Post
                      Processing        Processing         Processing
SPN infrastructure




                                                   Adhoc-Query (Pig)


                         MapReduce                      HBase
                                                                             Circus
                          Hadoop Distributed File System                    (Ambari)
                                     (HDFS)

                     Feedback Information

                                                                                         Message Bus
Application




                       Email Reputation Service                                        Web Reputation     File Reputation
                                                                                          Service              Service
Trend Micro Big Data process capacity

雲端防毒每日需要處理的資料量
• 85 億個 Web Reputation 查詢
• 30 億個 Email Reputation查詢
• 70 億個 File Reputation 查詢
• 處理 6 TB 從全世界收集到的 raw logs
• 來自1.5億台終端裝置的連線
Trend Micro: Web Reputation Services
 Technology                         Process                           Operation


    Trend Micro               User Traffic | Honeypot
Products / Technology
                                                              8 billions/day
                                      Akamai                             40% filtered
CDN Cache
                                                             4.8 billions/day
                               Rating Server for Known




                                                                                           15 Minutes
High Throughput Web Service            Threats                           82% filtered

                                Unknown & Prefilter
Hadoop Cluster
                                                             860 millions/day
                                  Page Download
Web Crawling
                                                                         99.98% filtered
                                       Threat
                                      Analysis
Machine Learning
Data Mining
                                                         25,000 malicious URL /day


Block malicious URL within 15 minutes once it goes online!
Big Data Cases
Google vs. Hadoop Ecosystem


Chubby vs. Zookeeper

MapReduce vs. MapReduce

BigTable vs. HBase

GFS vs. HDFS

 Ref: Google Cluster
Pioneer of Big Data Infrastructure – Google
Hbase use Case@Facebook - Messages
 HBase Use Cases @ Facebook

            Facebook Insights   Operational Data Store
            Self-service        More Analytics/Hashout apps
 Messages   Hashout             Site Integrity




    2010        2011                 2012            2013

                                                  Social Graph Search Indexing
                                                  Realtime Hive Updates
                                                  Cross-system Tracing
                                                  … and more
Flagship App:Facebook Messages
Monthly data volume prior to launch
• Monthly data volume prior to launch



                 15B x 1,024 bytes = 14TB



                 120B x 100 bytes = 11TB
Facebook Messages Now
book Messages NOW
    •
StatsQuick Stats
                                Messages   Chats
       – 11B+ messages/day
         • 90B+ data accesses
+ messages/day
         • Peak:1.5M ops/sec
0B+ data~55% Read, 45% Write
         •
            accesses
eak: 1.5M ops/sec
55%Rd, 45% Wr data
      – 20PB+ of total
                                 Emails    SMS
         • Grows 400TB/month

B+ of total data
rows 400TB/month
Facebook Messages:Requirements

• Very High Write Volume
  – Previously, chat was not persisted to disk
• Ever-growing data sets(Old data rarely gets
  accessed)
• Elasticity & automatic failover
• Strong consistency within a single data center
• Large scans/map-reduce support for migrations &
  schema conversions
• Bulk import data
Physical Multi-tenancy

• Real-time Ads Insights
  – Real-time analytics for social plugins on top of Hbase
  – Publishers get real-time distribution/engagement metrics:
     • # of impressions, likes
     • analytics by domain/URL/demographics and time periods
  – Uses HBase capabilities:
     • Efficient counters (single-RPC increments)
     • TTL for purging old data
  – Needs massive write throughput & low latencies
     • Billions of URLs
     • Millions of counter increments/second

• Operational Data Store
Facebook Open Source Stack



• Memcached --> App Server Cache
• ZooKeeper --> Small Data Coordination Service
• HBase --> Database Storage Engine
• HDFS --> Distributed FileSystem
• Hadoop --> Asynchronous Map-Reduce Jobs
Questions?
Thank you!

Weitere ähnliche Inhalte

Was ist angesagt?

Optimization on Key-value Stores in Cloud Environment
Optimization on Key-value Stores in Cloud EnvironmentOptimization on Key-value Stores in Cloud Environment
Optimization on Key-value Stores in Cloud EnvironmentFei Dong
 
Compaction and Splitting in Apache Accumulo
Compaction and Splitting in Apache AccumuloCompaction and Splitting in Apache Accumulo
Compaction and Splitting in Apache AccumuloHortonworks
 
Hadoop Summit - Hausenblas 20 March
Hadoop Summit - Hausenblas 20 MarchHadoop Summit - Hausenblas 20 March
Hadoop Summit - Hausenblas 20 MarchMapR Technologies
 
2013 feb 20_thug_h_catalog
2013 feb 20_thug_h_catalog2013 feb 20_thug_h_catalog
2013 feb 20_thug_h_catalogAdam Muise
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldRichard McDougall
 
Cloudera Sessions - Clinic 1 - Getting Started With Hadoop
Cloudera Sessions - Clinic 1 - Getting Started With HadoopCloudera Sessions - Clinic 1 - Getting Started With Hadoop
Cloudera Sessions - Clinic 1 - Getting Started With HadoopCloudera, Inc.
 
Is your cloud ready for Big Data? Strata NY 2013
Is your cloud ready for Big Data? Strata NY 2013Is your cloud ready for Big Data? Strata NY 2013
Is your cloud ready for Big Data? Strata NY 2013Richard McDougall
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuningVitthal Gogate
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & HadoopEdureka!
 
02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013WANdisco Plc
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoopjoelcrabb
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionCloudera, Inc.
 
Architecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric BaldeschwielerArchitecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric Baldeschwielerlucenerevolution
 
Hadoop Overview
Hadoop Overview Hadoop Overview
Hadoop Overview EMC
 
Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataWANdisco Plc
 
Big Data Performance and Capacity Management
Big Data Performance and Capacity ManagementBig Data Performance and Capacity Management
Big Data Performance and Capacity Managementrightsize
 

Was ist angesagt? (20)

cosbench-openstack.pdf
cosbench-openstack.pdfcosbench-openstack.pdf
cosbench-openstack.pdf
 
HUG slides on NFS and ODBC
HUG slides on NFS and ODBCHUG slides on NFS and ODBC
HUG slides on NFS and ODBC
 
Optimization on Key-value Stores in Cloud Environment
Optimization on Key-value Stores in Cloud EnvironmentOptimization on Key-value Stores in Cloud Environment
Optimization on Key-value Stores in Cloud Environment
 
Hadoop
HadoopHadoop
Hadoop
 
Compaction and Splitting in Apache Accumulo
Compaction and Splitting in Apache AccumuloCompaction and Splitting in Apache Accumulo
Compaction and Splitting in Apache Accumulo
 
Hadoop Summit - Hausenblas 20 March
Hadoop Summit - Hausenblas 20 MarchHadoop Summit - Hausenblas 20 March
Hadoop Summit - Hausenblas 20 March
 
2013 feb 20_thug_h_catalog
2013 feb 20_thug_h_catalog2013 feb 20_thug_h_catalog
2013 feb 20_thug_h_catalog
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworld
 
Cloudera Sessions - Clinic 1 - Getting Started With Hadoop
Cloudera Sessions - Clinic 1 - Getting Started With HadoopCloudera Sessions - Clinic 1 - Getting Started With Hadoop
Cloudera Sessions - Clinic 1 - Getting Started With Hadoop
 
Is your cloud ready for Big Data? Strata NY 2013
Is your cloud ready for Big Data? Strata NY 2013Is your cloud ready for Big Data? Strata NY 2013
Is your cloud ready for Big Data? Strata NY 2013
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuning
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & Hadoop
 
02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
 
Architecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric BaldeschwielerArchitecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric Baldeschwieler
 
Hadoop Overview
Hadoop Overview Hadoop Overview
Hadoop Overview
 
Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big Data
 
Big Data Performance and Capacity Management
Big Data Performance and Capacity ManagementBig Data Performance and Capacity Management
Big Data Performance and Capacity Management
 

Ähnlich wie Zh tw cloud computing era

HDP-1 introduction for HUG France
HDP-1 introduction for HUG FranceHDP-1 introduction for HUG France
HDP-1 introduction for HUG FranceSteve Loughran
 
Hadoop - Now, Next and Beyond
Hadoop - Now, Next and BeyondHadoop - Now, Next and Beyond
Hadoop - Now, Next and BeyondTeradata Aster
 
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on DemandApachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on DemandRichard McDougall
 
Apache hadoop bigdata-in-banking
Apache hadoop bigdata-in-bankingApache hadoop bigdata-in-banking
Apache hadoop bigdata-in-bankingm_hepburn
 
Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011Hortonworks
 
NYC-Meetup- Introduction to Hadoop Echosystem
NYC-Meetup- Introduction to Hadoop EchosystemNYC-Meetup- Introduction to Hadoop Echosystem
NYC-Meetup- Introduction to Hadoop EchosystemAL500745425
 
Microsoft's Hadoop Story
Microsoft's Hadoop StoryMicrosoft's Hadoop Story
Microsoft's Hadoop StoryMichael Rys
 
Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsRichard McDougall
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft PlatformJesus Rodriguez
 
Контроль зверей: инструменты для управления и мониторинга распределенных сист...
Контроль зверей: инструменты для управления и мониторинга распределенных сист...Контроль зверей: инструменты для управления и мониторинга распределенных сист...
Контроль зверей: инструменты для управления и мониторинга распределенных сист...yaevents
 
Hadoop and Big Data: Revealed
Hadoop and Big Data: RevealedHadoop and Big Data: Revealed
Hadoop and Big Data: RevealedSachin Holla
 
Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Milos Milovanovic
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosLester Martin
 
Introduction to apache hadoop
Introduction to apache hadoopIntroduction to apache hadoop
Introduction to apache hadoopShashwat Shriparv
 
Hadoop for shanghai dev meetup
Hadoop for shanghai dev meetupHadoop for shanghai dev meetup
Hadoop for shanghai dev meetupRoby Chen
 
Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Darko Marjanovic
 
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...Big Data Spain
 

Ähnlich wie Zh tw cloud computing era (20)

Hadoop Family and Ecosystem
Hadoop Family and EcosystemHadoop Family and Ecosystem
Hadoop Family and Ecosystem
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
HDP-1 introduction for HUG France
HDP-1 introduction for HUG FranceHDP-1 introduction for HUG France
HDP-1 introduction for HUG France
 
Hadoop - Now, Next and Beyond
Hadoop - Now, Next and BeyondHadoop - Now, Next and Beyond
Hadoop - Now, Next and Beyond
 
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on DemandApachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
 
Apache hadoop bigdata-in-banking
Apache hadoop bigdata-in-bankingApache hadoop bigdata-in-banking
Apache hadoop bigdata-in-banking
 
Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011
 
NYC-Meetup- Introduction to Hadoop Echosystem
NYC-Meetup- Introduction to Hadoop EchosystemNYC-Meetup- Introduction to Hadoop Echosystem
NYC-Meetup- Introduction to Hadoop Echosystem
 
Microsoft's Hadoop Story
Microsoft's Hadoop StoryMicrosoft's Hadoop Story
Microsoft's Hadoop Story
 
Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure Considerations
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft Platform
 
Big data
Big dataBig data
Big data
 
Контроль зверей: инструменты для управления и мониторинга распределенных сист...
Контроль зверей: инструменты для управления и мониторинга распределенных сист...Контроль зверей: инструменты для управления и мониторинга распределенных сист...
Контроль зверей: инструменты для управления и мониторинга распределенных сист...
 
Hadoop and Big Data: Revealed
Hadoop and Big Data: RevealedHadoop and Big Data: Revealed
Hadoop and Big Data: Revealed
 
Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
 
Introduction to apache hadoop
Introduction to apache hadoopIntroduction to apache hadoop
Introduction to apache hadoop
 
Hadoop for shanghai dev meetup
Hadoop for shanghai dev meetupHadoop for shanghai dev meetup
Hadoop for shanghai dev meetup
 
Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014
 
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
 

Kürzlich hochgeladen

A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 

Kürzlich hochgeladen (20)

A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 

Zh tw cloud computing era

  • 2. Three Major Trends to Chang the World Cloud Computing Big Data Mobile
  • 3. 什麼是雲端運算? 美國國家標準技術研究所 (NIST)的定義: Essential Characteristics Service Models Deployment Models 以服務(as-a-service)的商業模式,透過Internet技術,提供具有擴充 性(scalable)和彈性(elastic)的IT相關功能給使用者
  • 4. It’s About the Ecosystem Structured, Semi-structured Cloud Computing SaaS Enterprise Data Warehouse PaaS IaaS Generate Big Data Lead Business Insights create Competition, Innovation, Productivity
  • 5. What is Big Data?
  • 6. What is the problem • Getting the data to the processors becomes the bottleneck • Quick calculation – Typical disk data transfer rate: • 75MB/sec – Time taken to transfer 100GB of data to the processor: • approx. 22 minutes!
  • 7. The Era of Big Data – Are You Ready Data for business commercial analysis • 2011: multi-terabyte (TB) • 2020: 35.2 ZB (1 ZB = 1 billion TB)
  • 8. Who Needs It? Enterprise Database Hadoop When to use? When to use? • Ad-hoc Reporting (<1sec) • Affordable Storage/Compute • Multi-step Transactions • Unstructured or Semi-structured • Lots of Inserts/Updates/Deletes • Resilient Auto Scalability
  • 10. – inspired by • Apache Hadoop project – inspired by Google's MapReduce and Google File System papers. • Open sourced, flexible and available architecture for large scale computation and data processing on a network of commodity hardware • Open Source Software + Hardware Commodity – IT Costs Reduction
  • 11. Hadoop Core MapReduce HDFS © 2011 Cloudera, Inc. All Rights Reserved.
  • 12. HDFS • Hadoop Distributed File System • Redundancy • Fault Tolerant • Scalable • Self Healing • Write Once, Read Many Times • Java API • Command Line Tool © 2011 Cloudera, Inc. All Rights Reserved.
  • 13. MapReduce • Two Phases of Functional Programming • Redundancy • Fault Tolerant • Scalable • Self Healing • Java API 13 © 2011 Cloudera, Inc. All Rights Reserved.
  • 14. Hadoop Core Java Java MapReduce HDFS Java Java 14 © 2011 Cloudera, Inc. All Rights Reserved.
  • 15. Word Count Example Key: offset Value: line Key: word Key: word Value: count Value: sum of count 0:The cat sat on the mat 22:The aardvark sat on the sofa
  • 17. The Ecosystem is the System • Hadoop has become the kernel of the distributed operating system for Big Data • No one uses the kernel alone • A collection of projects at Apache
  • 18. Relation Map Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  • 19. Zookeeper – Coordination Framework Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  • 20. What is ZooKeeper • A centralized service for maintaining – Configuration information – Providing distributed synchronization • A set of tools to build distributed applications that can safely handle partial failures • ZooKeeper was designed to store coordination data – Status information – Configuration – Location information
  • 21. Flume / Sqoop – Data Integration Framework Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  • 22. What’s the problem for data collection • Data collection is currently a priori and ad hoc • A priori – decide what you want to collect ahead of time • Ad hoc – each kind of data source goes through its own collection path
  • 23. (and how can it help?) • A distributed data collection service • It efficiently collecting, aggregating, and moving large amounts of data • Fault tolerant, many failover and recovery mechanism • One-stop solution for data collection of all formats
  • 24. Flume Architecture Log Log ... Flume Node Flume Node HDFS © 2011 Cloudera, Inc. All Rights Reserved.
  • 25. Flume Sources and Sinks • Local Files • HDFS • Stdin, Stdout • Twitter • IRC • IMAP © 2011 Cloudera, Inc. All Rights Reserved.
  • 26. Sqoop • Easy, parallel database import/export • What you want do? – Insert data from RDBMS to HDFS – Export data from HDFS back into RDBMS
  • 27. Sqoop HDFS Sqoop RDBMS 28 © 2011 Cloudera, Inc. All Rights Reserved.
  • 28. Sqoop Examples $ sqoop import --connect jdbc:mysql://localhost/world -- username root --table City ... $ hadoop fs -cat City/part-m-00000 1,Kabul,AFG,Kabol,17800002,Qandahar,AFG,Qandahar,2375003,He rat,AFG,Herat,1868004,Mazar-e- Sharif,AFG,Balkh,1278005,Amsterdam,NLD,Noord-Holland,731200 ... 29 © 2011 Cloudera, Inc. All Rights Reserved.
  • 29. Pig / Hive – Analytical Language Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  • 30. Why Hive and Pig? • Although MapReduce is very powerful, it can also be complex to master • Many organizations have business or data analysts who are skilled at writing SQL queries, but not at writing Java code • Many organizations have programmers who are skilled at writing code in scripting languages • Hive and Pig are two projects which evolved separately to help such people analyze huge amounts of data via MapReduce – Hive was initially developed at Facebook, Pig at Yahoo!
  • 31. Hive – Developed by • What is Hive? – An SQL-like interface to Hadoop • Data Warehouse infrastructure that provides data summarization and ad hoc querying on top of Hadoop – MapRuduce for execution – HDFS for storage • Hive Query Language – Basic-SQL : Select, From, Join, Group-By – Equi-Join, Muti-Table Insert, Multi-Group-By – Batch query SELECT * FROM purchases WHERE price > 100 GROUP BY storeid
  • 32. Hive SQL Hive MapReduce 33 © 2011 Cloudera, Inc. All Rights Reserved.
  • 33. Pig – Initiated by • A high-level scripting language (Pig Latin) • Process data one step at a time • Simple to write MapReduce program • Easy understand • Easy debug A = load ‘a.txt’ as (id, name, age, ...) B = load ‘b.txt’ as (id, address, ...) C = JOIN A BY id, B BY id;STORE C into ‘c.txt’
  • 34. Pig Script Pig MapReduce © 2011 Cloudera, Inc. All Rights Reserved.
  • 35. Hive vs. Pig Hive Pig Language HiveQL (SQL-like) Pig Latin, a scripting language Schema Table definitions A schema is optionally defined that are stored in a at runtime metastore Programmait Access JDBC, ODBC PigServer
  • 36. WordCount Example • Input Hello World Bye World Hello Hadoop Goodbye Hadoop • For the given sample input the map emits < Hello, 1> < World, 1> < Bye, 1> < World, 1> < Hello, 1> < Hadoop, 1> < Goodbye, 1> < Hadoop, 1> • the reduce just sums up the values < Bye, 1> < Goodbye, 1> < Hadoop, 2> < Hello, 2> < World, 2>
  • 37. WordCount Example In MapReduce public class WordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); }
  • 38. WordCount Example By Pig A = LOAD 'wordcount/input' USING PigStorage as (token:chararray); B = GROUP A BY token; C = FOREACH B GENERATE group, COUNT(A) as count; DUMP C;
  • 39. WordCount Example By Hive CREATE TABLE wordcount (token STRING); LOAD DATA LOCAL INPATH ’wordcount/input' OVERWRITE INTO TABLE wordcount; SELECT count(*) FROM wordcount GROUP BY token;
  • 40. The Story So Far SQL Hive Pig Script Java MapReduce Java HDFS Sqoop Flume SQL RDBMS FS Posix 4 1 © 2011 Cloudera, Inc. All Rights Reserved.
  • 41. Hbase – Column NoSQL DB Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  • 43. I – Inspired by • Coordinated by Zookeeper • Low Latency • Random Reads And Writes • Distributed Key/Value Store • Simple API – PUT – GET – DELETE – SCAN
  • 44. Hbase – Data Model • Cells are “versioned” • Table rows are sorted by row key • Region – a row range [start-key:end-key]
  • 45. HBase Examples hbase> create 'mytable', 'mycf‘ hbase> list hbase> put 'mytable', 'row1', 'mycf:col1', 'val1‘ hbase> put 'mytable', 'row1', 'mycf:col2', 'val2‘ hbase> put 'mytable', 'row2', 'mycf:col1', 'val3‘ hbase> scan 'mytable‘ hbase> disable 'mytable‘ hbase> drop 'mytable' © 2011 Cloudera, Inc. All Rights Reserved.
  • 46. Oozie – Job Workflow & Scheduling Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  • 47. What is ? • A Java Web Application • Oozie is a workflow scheduler for Hadoop • Crond for Hadoop • Triggered Job 1 Job 2 – Time – Data Job 3 Job 4 Job 5
  • 48. Oozie Features • Component Independent – MapReduce – Hive – Pig – SqoopStreaming © 2011 Cloudera, Inc. All Rights Reserved.
  • 49. Mahout – Data Mining Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  • 50. What is • Machine-learning tool • Distributed and scalable machine learning algorithms on the Hadoop platform • Building intelligent applications easier and faster
  • 51. Mahout Use Cases • Yahoo: Spam Detection • Foursquare: Recommendations • SpeedDate.com: Recommendations • Adobe: User Targetting • Amazon: Personalization Platform © 2011 Cloudera, Inc. All Rights Reserved.
  • 52. Hue – developed by • Hadoop User Experience • Apache Open source project • HUE is a web UI for Hadoop • Platform for building custom applications with a nice UI library
  • 53. Hue • HUE comes with a suite of applications – File Browser: Browse HDFS; change permissions and ownership; upload, download, view and edit files. – Job Browser: View jobs, tasks, counters, logs, etc. – Beeswax: Wizards to help create Hive tables, load data, run and manage Hive queries, and download results in Excel format.
  • 56. Use case Example • Predict what the user likes based on – His/Her historical behavior – Aggregate behavior of people similar to him
  • 57. Conclusion Today, we introduced: • Why Hadoop is needed • The basic concepts of HDFS and MapReduce • What sort of problems can be solved with Hadoop • What other projects are included in the Hadoop ecosystem
  • 58. Recap – Hadoop Ecosystem Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  • 60. Collaboration in the underground
  • 62.
  • 63. New Design Concept for Threat Intelligence CDN / xSP Human Intelligence Honeypot Web Crawler Trend Micro Mail Protection Trend Micro Trend Micro Endpoint Protection Web Protection 150M+ Worldwide Endpoints/Sensors
  • 64. Challenges We Are Faced The Concept is Great but …. 6TB of data and 15B lines of logs received daily by It becomes the Big Data Challenge!
  • 65. Issues to Address Threat Raw Data Information Intelligence/Solution  Volume: Infinite  Time: No Delay  Target: Keep Changing Threats
  • 66. SPN Feedback SPAM CDN Log SPN High Level Architecture HTTP POST L4 Log Log Receiver Receiver L4 Log Post Log Post Log Post Processing Processing Processing SPN infrastructure Adhoc-Query (Pig) MapReduce HBase Circus Hadoop Distributed File System (Ambari) (HDFS) Feedback Information Message Bus Application Email Reputation Service Web Reputation File Reputation Service Service
  • 67. Trend Micro Big Data process capacity 雲端防毒每日需要處理的資料量 • 85 億個 Web Reputation 查詢 • 30 億個 Email Reputation查詢 • 70 億個 File Reputation 查詢 • 處理 6 TB 從全世界收集到的 raw logs • 來自1.5億台終端裝置的連線
  • 68. Trend Micro: Web Reputation Services Technology Process Operation Trend Micro User Traffic | Honeypot Products / Technology 8 billions/day Akamai 40% filtered CDN Cache 4.8 billions/day Rating Server for Known 15 Minutes High Throughput Web Service Threats 82% filtered Unknown & Prefilter Hadoop Cluster 860 millions/day Page Download Web Crawling 99.98% filtered Threat Analysis Machine Learning Data Mining 25,000 malicious URL /day Block malicious URL within 15 minutes once it goes online!
  • 70. Google vs. Hadoop Ecosystem Chubby vs. Zookeeper MapReduce vs. MapReduce BigTable vs. HBase GFS vs. HDFS Ref: Google Cluster
  • 71. Pioneer of Big Data Infrastructure – Google
  • 72. Hbase use Case@Facebook - Messages HBase Use Cases @ Facebook Facebook Insights Operational Data Store Self-service More Analytics/Hashout apps Messages Hashout Site Integrity 2010 2011 2012 2013 Social Graph Search Indexing Realtime Hive Updates Cross-system Tracing … and more
  • 73. Flagship App:Facebook Messages Monthly data volume prior to launch • Monthly data volume prior to launch 15B x 1,024 bytes = 14TB 120B x 100 bytes = 11TB
  • 74. Facebook Messages Now book Messages NOW • StatsQuick Stats Messages Chats – 11B+ messages/day • 90B+ data accesses + messages/day • Peak:1.5M ops/sec 0B+ data~55% Read, 45% Write • accesses eak: 1.5M ops/sec 55%Rd, 45% Wr data – 20PB+ of total Emails SMS • Grows 400TB/month B+ of total data rows 400TB/month
  • 75. Facebook Messages:Requirements • Very High Write Volume – Previously, chat was not persisted to disk • Ever-growing data sets(Old data rarely gets accessed) • Elasticity & automatic failover • Strong consistency within a single data center • Large scans/map-reduce support for migrations & schema conversions • Bulk import data
  • 76. Physical Multi-tenancy • Real-time Ads Insights – Real-time analytics for social plugins on top of Hbase – Publishers get real-time distribution/engagement metrics: • # of impressions, likes • analytics by domain/URL/demographics and time periods – Uses HBase capabilities: • Efficient counters (single-RPC increments) • TTL for purging old data – Needs massive write throughput & low latencies • Billions of URLs • Millions of counter increments/second • Operational Data Store
  • 77. Facebook Open Source Stack • Memcached --> App Server Cache • ZooKeeper --> Small Data Coordination Service • HBase --> Database Storage Engine • HDFS --> Distributed FileSystem • Hadoop --> Asynchronous Map-Reduce Jobs
  • 78.