This slides are used to present the following Twitter pipeline using the ELK stack (Elasticsearch, Logstash, Kibana): https://github.com/melvynator/ELK_twitter It shows how to integrate Machine Learning into your Twitter pipeline.
1. Machine Learning in a Twitter
ETL using ELK
(ELASTICSEARCH, LOGSTASH, KIBANA)
MELVYN PEIGNON
melvynpeignon@gmail.com
2. What is an ETL?
• ETL: Extract Load Transform
Source Transformation
Data
warehouse/
Data store
Raw data
Processed
data
Extract Transform Load
3. Our use case: An ETL for Twitter
https://github.com/melvynator/ELK_twitter
Goals:
•Simplify a recurrent task for several members of the lab
•Normalize the data collection (Have a “universal” format)
•Have tweets analyzed the way we want (Emoji, punctuation)
•Include some machine learning model in our ETL
4. Our tools: ELK
•E : Elasticsearch
Elasticsearch is a distributed, RESTful search and analytics engine
•L : Logstash
Logstash is an open source, server-side data processing pipeline that
simultaneously ingests data from multiple sources transforms it, and then
sends it to your favorite “stash”.
•K: Kibana
Kibana lets you visualize your Elasticsearch data and navigate the Elastic Stack
Source: Elastic website
6. Logstash
Logstash allows you to ingest data from multiple sources, transform
the data and then store your processed data.
Notions you need to know:
•Event
•Input
•Filter
•Output
9. Logstash: Event
An event can be described as one raw data traveling across the
pipeline:
•A Log
•A Line
•A JSON
•…
10. Logstash: Input
The input plugins consume data from a source.
•File
•Elasticsearch
•Twitter API
•Github API
•…
11. Logstash: Filter
A filter plugin performs intermediary processing on an event. Filters are
often applied conditionally depending on the characteristics of the
event.
•Clone
•Mutate
•Ruby
•…
12. Logstash: Output
An output plugin sends event data to a particular destination.
•Elasticsearch
•MongoDB
•File
•…
15. Example: Building the configuration file
input {
file {
path => "/usr/local/Cellar/logstash/5.5.2/tutorial_input.txt"
start_position => beginning
sincedb_path => "/dev/null"
codec => "json"
} # the input will be a file containing JSON (takes only absolute path)
}
filter {} # Empty for the moment (We will only fill this)
output { stdout {codec => rubydebug}} # Print the result to the console
17. Example: Building “name” key
ruby { # Allows you to insert ruby code
code =>
‘ event.set("name", event.get("[first_name]") + " " +
event.get("[last_name]")) ’
} #event.get() allows retrieving a value given a key, event.set() to create
one
mutate {
remove_field => ["first_name", "last_name", "host", "path"]
} # mutate allows basic transformation on events; here we are removing 4 keys
The order used to modify the
document matters! Here
“Ruby” will be applied first
then “Mutate”.
1. Merging “first_name” and “last_name” into ”name”
2. Removing “first_name” and “last_name”
18. Example: Output
{
"@timestamp" => 2017-09-04T08:33:26.599Z,
"school" => "Stanford",
"degree" => "Master",
"@version" => "1",
"name" => "John Doe",
"comment" => "A guy from somewhere",
"age" => 25
}
19. Example: Building two events out of one
clone {
clones => ["education"]
} # we are duplicating our event and add in the replica a key "type" with "education" as a value.
if ([type] == "education") {
mutate {
remove_field => ["name", "age", "comment", "type"]
}
}
else {
mutate {
remove_field => ["school", "degree"]
}
}
21. Example: Changing the name of a field
mutate {
remove_field => ["@version"] # @version, @timestamp are created
for every events
rename => { "@timestamp" => "created_at" } # change the name of a
field
}
The order inside mutate may
vary. You cannot know if
“remove_field” will be applied
before or after “rename”
24. Logstash: Our filters
There are many filters, but the overall goals of the filters are to:
•Remove depreciated fields
•Divide the tweet into two or three events (users and tweet)
•Remove the nesting of the JSON
•Remove the fields not used
29. Time to do it yourself
•https://github.com/melvynator/Logstash_tutorial
•https://docs.google.com/forms/d/e/1FAIpQLSdDdFxmT5ZCXInaSohJB
Bo6fKKmCg3KLegeOOrxl1l4_sc-7g/viewform