2. Who are you?
> Masahiro Nakagawa
> github/twitter: @repeatedly
> Treasure Data, Inc.
> Senior Software Engineer
> Fluentd / td-agent developer
> Living at OSS :)
> D language - Phobos committer
> Fluentd - Main maintainer
> MessagePack / RPC - D and Python (only RPC)
> The organizer of several meetups (Presto, DTM, etc…)
> etc…
4. What’s Fluentd?
> Data collector for unified logging layer
> Streaming data transfer based on JSON
> Written in Ruby
> Gem based various plugins
> http://www.fluentd.org/plugins
> Working in production
> http://www.fluentd.org/testimonials
13. Why JSON / MessagePack? (1
> Schema on Write (Traditional MPP DB)
> Writing data using schema for improving
query performance
> Pros
> minimum query overhead
> Cons
> Need to design schema and workload before
> Data load is expensive operation
14. Why JSON / MessagePack? (2
> Schema on Read (Hadoop)
> Writing data without schema and map schema
at query time
> Pros
> Robust over schema and workload change
> Data load is cheap operation
> Cons
> High overhead at query time
16. Core Plugins
> Divide & Conquer
> Buffering & Retrying
> Error handling
> Message routing
> Parallelism
> Read / receive data
> Parse data
> Filter data
> Buffer data
> Format data
> Write / send data
17. Core Plugins
> Divide & Conquer
> Buffering & Retrying
> Error handling
> Message routing
> Parallelism
> Read / receive data
> Parse data
> Filter data
> Buffer data
> Format data
> Write / send data
Common
Concerns
Use Case
Specific
18. > default second unit
> from data source
Event structure(log message)
✓ Time
> for message routing
> where is from?
✓ Tag
> JSON format
> MessagePack
internally
> schema-free
✓ Record
20. Configuration and operation
> No central / master node
> @include helps configuration sharing
> Operation depends on your environment
> Use your deamon / deploy tools
> Use Chef in Treasure Data
> Apache like syntax
23. Treasure Agent (td-agent)
> Treasure Data distribution of Fluentd
> include ruby, popular plugins and etc
> Treasure Agent 2 is current stable
> Recommend to use v2, not v1
> rpm, deb and dmg
> Latest version is 2.2.0 with fluentd v0.12
35. # logs from a file
<source>
type tail
path /var/log/httpd.log
pos_file /tmp/pos_file
format apache2
tag backend.apache
</source>
!
# logs from client libraries
<source>
type forward
port 24224
</source>
!
# store logs to MongoDB
<match backend.*>
type mongo
database fluent
collection test
</match>
39. # logs from a file
<source>
type tail
path /var/log/httpd.log
pos_file /tmp/pos_file
format apache2
tag web.access
</source>
!
# logs from client libraries
<source>
type forward
port 24224
</source>
!
# store logs to ES and HDFS
<match web.*>
type copy
<store>
type elasticsearch
logstash_format true
</store>
<store>
type webhdfs
host namenode
port 50070
path /path/on/hdfs/
</store>
</match>
40. CEP for Stream Processing
Norikra is a SQL based CEP engine: http://norikra.github.io/
49. fluent-bit
> Made for Embedded Linux
> OpenEmbedded & Yocto Project
> Intel Edison, RasPi & Beagle Black boards
> https://github.com/fluent/fluent-bit
> Standalone application or Library mode
> Built-in plugins
> input: cpu, kmsg, output: fluentd
> First release at the end of Mar 2015
50. fluentd-forwarder
> Forwarding agent written in Go
> Focusing log forwarding to Fluentd
> Work on Windows
> Bundle TCP input/output and TD output
> No flexible plugin mechanizm
> We have a plan to add some input/output
> Similar product
> fluent-agent-lite, fluent-agent-hydra, ik
53. The problems at Treasure Data
> Treasure Data Service on the Cloud
> Customers want to try Treasure Data, but
> SEs write scripts to bulk load their data.
Hard work :(
> Customers want to migrate their big data, but
> Hard work :(
> Fluentd solved streaming data collection, but
> bulk data loading is another problem.
54. Embulk
> Bulk Loader version of Fluentd
> Pluggable architecture
> JRuby, JVM languages
> High performance parallel processing
> Share your script as a plugin
> https://github.com/embulk
55. The problems of bulk load
> Data cleaning (normalization)
> How to normalize broken records?
> Error handling
> How to remove broken records?
> Idempotent retrying
> How to retry without duplicated loading?
> Performance optimization
68. Other cases
> Treasure Data
> Embulk worker for automatic import
> Web services
> Send existing logs to Elasticsearch
> Business / Batch systems
> Database to Database
> etc…