SlideShare ist ein Scribd-Unternehmen logo
1 von 30
Downloaden Sie, um offline zu lesen
Machine Learning in a Twitter
ETL using ELK
(ELASTICSEARCH, LOGSTASH, KIBANA)
MELVYN PEIGNON
melvynpeignon@gmail.com
What is an ETL?
• ETL: Extract Load Transform
Source Transformation
Data
warehouse/
Data store
Raw data
Processed
data
Extract Transform Load
Our use case: An ETL for Twitter
https://github.com/melvynator/ELK_twitter
Goals:
•Simplify a recurrent task for several members of the lab
•Normalize the data collection (Have a “universal” format)
•Have tweets analyzed the way we want (Emoji, punctuation)
•Include some machine learning model in our ETL
Our tools: ELK
•E : Elasticsearch
Elasticsearch is a distributed, RESTful search and analytics engine
•L : Logstash
Logstash is an open source, server-side data processing pipeline that
simultaneously ingests data from multiple sources transforms it, and then
sends it to your favorite “stash”.
•K: Kibana
Kibana lets you visualize your Elasticsearch data and navigate the Elastic Stack
Source: Elastic website
Our model
Twitter Logstash Elasticsearch Kibana
Sentiment
API
Extract Transform Load
Logstash
Logstash allows you to ingest data from multiple sources, transform
the data and then store your processed data.
Notions you need to know:
•Event
•Input
•Filter
•Output
Petrol Pipeline
Storage
Truck
Refinery
Pipeline
Petrol Chemical
Logstash Pipeline
Elasticsearch
Filters
Twitter
API
Twitter
Input
Elasticsearch
outputData
Logstash
Logstash: Event
An event can be described as one raw data traveling across the
pipeline:
•A Log
•A Line
•A JSON
•…
Logstash: Input
The input plugins consume data from a source.
•File
•Elasticsearch
•Twitter API
•Github API
•…
Logstash: Filter
A filter plugin performs intermediary processing on an event. Filters are
often applied conditionally depending on the characteristics of the
event.
•Clone
•Mutate
•Ruby
•…
Logstash: Output
An output plugin sends event data to a particular destination.
•Elasticsearch
•MongoDB
•File
•…
Logstash: Flow
Input Filters Output
Events
Processed
events
Example: tutorial_input.txt
{
"last_name":"John",
"first_name":"Doe",
"age": 25,
"degree":"Master",
"school":"Stanford",
"comment": "A guy from somewhere"
}
{
"name":"John Doe",
"age": 25,
"comment": "A guy from somewhere",
"created_at" : 2017-09-04T08:39:54.847Z
}
{
"school": "Stanford",
"degree": "Master",
"created_at" : 2017-09-04T08:39:54.877Z
}
What we have: What we want:
Example: Building the configuration file
input {
file {
path => "/usr/local/Cellar/logstash/5.5.2/tutorial_input.txt"
start_position => beginning
sincedb_path => "/dev/null"
codec => "json"
} # the input will be a file containing JSON (takes only absolute path)
}
filter {} # Empty for the moment (We will only fill this)
output { stdout {codec => rubydebug}} # Print the result to the console
Example: Output
{
"path" => "/usr/local/Cellar/logstash/5.5.2/tutorial_input.txt",
"@timestamp" => 2017-09-04T07:03:26.388Z,
"degree" => "Master",
"@version" => "1",
"host" => "Melvyns-MacBook-Pro.local",
"last_name" => "Doe",
"school" => "Stanford",
"comment" => "A guy from somewhere",
"first_name" => "John",
"age" => 25
}
Example: Building “name” key
ruby { # Allows you to insert ruby code
code =>
‘ event.set("name", event.get("[first_name]") + " " +
event.get("[last_name]")) ’
} #event.get() allows retrieving a value given a key, event.set() to create
one
mutate {
remove_field => ["first_name", "last_name", "host", "path"]
} # mutate allows basic transformation on events; here we are removing 4 keys
The order used to modify the
document matters! Here
“Ruby” will be applied first
then “Mutate”.
1. Merging “first_name” and “last_name” into ”name”
2. Removing “first_name” and “last_name”
Example: Output
{
"@timestamp" => 2017-09-04T08:33:26.599Z,
"school" => "Stanford",
"degree" => "Master",
"@version" => "1",
"name" => "John Doe",
"comment" => "A guy from somewhere",
"age" => 25
}
Example: Building two events out of one
clone {
clones => ["education"]
} # we are duplicating our event and add in the replica a key "type" with "education" as a value.
if ([type] == "education") {
mutate {
remove_field => ["name", "age", "comment", "type"]
}
}
else {
mutate {
remove_field => ["school", "degree"]
}
}
Example: Output
{
"@timestamp" => 2017-09-04T08:39:54.847Z,
"name" => "John Doe",
"comment" => "A guy from somewhere",
"@version" => "1",
"age" => 25
}
{
"@timestamp" => 2017-09-04T08:39:54.897Z,
"school" => "Stanford",
"@version" => "1",
"degree" => "Master"
}
Example: Changing the name of a field
mutate {
remove_field => ["@version"] # @version, @timestamp are created
for every events
rename => { "@timestamp" => "created_at" } # change the name of a
field
}
The order inside mutate may
vary. You cannot know if
“remove_field” will be applied
before or after “rename”
Example: Output
{
"created_at" => 2017-09-04T08:39:54.847Z,
"name" => "John Doe",
"comment" => "A guy from somewhere",
"age" => 25
}
{
"created_at" => 2017-09-04T08:39:54.897Z,
"school" => "Stanford",
"degree" => "Master"
}
Logstash: Twitter input
input {
twitter {
consumer_key => "<YOUR-KEY>"
consumer_secret => "<YOUR-KEY>"
oauth_token => "<YOUR-KEY>"
oauth_token_secret => "<YOUR-KEY>"
keywords => [ "random", "word"]
full_tweet => true
type => "tweet"
}
}
Only one input:
Logstash: Our filters
There are many filters, but the overall goals of the filters are to:
•Remove depreciated fields
•Divide the tweet into two or three events (users and tweet)
•Remove the nesting of the JSON
•Remove the fields not used
Our model
Twitter Logstash Elasticsearch Kibana
Sentiment
API
Building the API
•Python
•Flask
•Building your model based on labeled data
•Design an endpoint that will receive the data you want to predict
Logstash REST filter
•https://github.com/lucashenning/logstash-filter-rest
•Allows RESTful resources inside Logstash
•Will call your Machine learning API
•Will add information to your events
Logstash REST filter: Example
rest {
request => {
url => http://localhost:5000/predict
method = "post"
params => {
"submit" => "%{tweet_content}"
}
headers => {
"Content-Type" => "application/json"
}
}
target => 'rest_result'
}
Time to do it yourself
•https://github.com/melvynator/Logstash_tutorial
•https://docs.google.com/forms/d/e/1FAIpQLSdDdFxmT5ZCXInaSohJB
Bo6fKKmCg3KLegeOOrxl1l4_sc-7g/viewform
Thank you!
Questions?

Weitere ähnliche Inhalte

Was ist angesagt?

Node.js File system & Streams
Node.js File system & StreamsNode.js File system & Streams
Node.js File system & StreamsEyal Vardi
 
A Practical Introduction to Handling Log Data in ClickHouse, by Robert Hodges...
A Practical Introduction to Handling Log Data in ClickHouse, by Robert Hodges...A Practical Introduction to Handling Log Data in ClickHouse, by Robert Hodges...
A Practical Introduction to Handling Log Data in ClickHouse, by Robert Hodges...Altinity Ltd
 
Web develop in flask
Web develop in flaskWeb develop in flask
Web develop in flaskJim Yeh
 
Centralized log-management-with-elastic-stack
Centralized log-management-with-elastic-stackCentralized log-management-with-elastic-stack
Centralized log-management-with-elastic-stackRich Lee
 
Live Hacking like a MVH – A walkthrough on methodology and strategies to win big
Live Hacking like a MVH – A walkthrough on methodology and strategies to win bigLive Hacking like a MVH – A walkthrough on methodology and strategies to win big
Live Hacking like a MVH – A walkthrough on methodology and strategies to win bigFrans Rosén
 
Hunting for Privilege Escalation in Windows Environment
Hunting for Privilege Escalation in Windows EnvironmentHunting for Privilege Escalation in Windows Environment
Hunting for Privilege Escalation in Windows EnvironmentTeymur Kheirkhabarov
 
Lessons for the optimizer from running the TPC-DS benchmark
Lessons for the optimizer from running the TPC-DS benchmarkLessons for the optimizer from running the TPC-DS benchmark
Lessons for the optimizer from running the TPC-DS benchmarkSergey Petrunya
 
Cross-domain requests with CORS
Cross-domain requests with CORSCross-domain requests with CORS
Cross-domain requests with CORSVladimir Dzhuvinov
 
Elastic - ELK, Logstash & Kibana
Elastic - ELK, Logstash & KibanaElastic - ELK, Logstash & Kibana
Elastic - ELK, Logstash & KibanaSpringPeople
 
Actuation, Federation and Interoperability of Context Brokers
Actuation, Federation and Interoperability of Context BrokersActuation, Federation and Interoperability of Context Brokers
Actuation, Federation and Interoperability of Context BrokersFIWARE
 
Hunting Lateral Movement in Windows Infrastructure
Hunting Lateral Movement in Windows InfrastructureHunting Lateral Movement in Windows Infrastructure
Hunting Lateral Movement in Windows InfrastructureSergey Soldatov
 
OWASP AppSecEU 2018 – Attacking "Modern" Web Technologies
OWASP AppSecEU 2018 – Attacking "Modern" Web TechnologiesOWASP AppSecEU 2018 – Attacking "Modern" Web Technologies
OWASP AppSecEU 2018 – Attacking "Modern" Web TechnologiesFrans Rosén
 
Searching for AI - Leveraging Solr for classic Artificial Intelligence tasks
Searching for AI - Leveraging Solr for classic Artificial Intelligence tasksSearching for AI - Leveraging Solr for classic Artificial Intelligence tasks
Searching for AI - Leveraging Solr for classic Artificial Intelligence tasksAlexandre Rafalovitch
 
No Easy Breach DerbyCon 2016
No Easy Breach DerbyCon 2016No Easy Breach DerbyCon 2016
No Easy Breach DerbyCon 2016Matthew Dunwoody
 

Was ist angesagt? (20)

Node.js File system & Streams
Node.js File system & StreamsNode.js File system & Streams
Node.js File system & Streams
 
A Practical Introduction to Handling Log Data in ClickHouse, by Robert Hodges...
A Practical Introduction to Handling Log Data in ClickHouse, by Robert Hodges...A Practical Introduction to Handling Log Data in ClickHouse, by Robert Hodges...
A Practical Introduction to Handling Log Data in ClickHouse, by Robert Hodges...
 
Kheirkhabarov24052017_phdays7
Kheirkhabarov24052017_phdays7Kheirkhabarov24052017_phdays7
Kheirkhabarov24052017_phdays7
 
Web develop in flask
Web develop in flaskWeb develop in flask
Web develop in flask
 
Centralized log-management-with-elastic-stack
Centralized log-management-with-elastic-stackCentralized log-management-with-elastic-stack
Centralized log-management-with-elastic-stack
 
Live Hacking like a MVH – A walkthrough on methodology and strategies to win big
Live Hacking like a MVH – A walkthrough on methodology and strategies to win bigLive Hacking like a MVH – A walkthrough on methodology and strategies to win big
Live Hacking like a MVH – A walkthrough on methodology and strategies to win big
 
Hunting for Privilege Escalation in Windows Environment
Hunting for Privilege Escalation in Windows EnvironmentHunting for Privilege Escalation in Windows Environment
Hunting for Privilege Escalation in Windows Environment
 
Lessons for the optimizer from running the TPC-DS benchmark
Lessons for the optimizer from running the TPC-DS benchmarkLessons for the optimizer from running the TPC-DS benchmark
Lessons for the optimizer from running the TPC-DS benchmark
 
Cross-domain requests with CORS
Cross-domain requests with CORSCross-domain requests with CORS
Cross-domain requests with CORS
 
Elastic - ELK, Logstash & Kibana
Elastic - ELK, Logstash & KibanaElastic - ELK, Logstash & Kibana
Elastic - ELK, Logstash & Kibana
 
Boto3
Boto3Boto3
Boto3
 
Actuation, Federation and Interoperability of Context Brokers
Actuation, Federation and Interoperability of Context BrokersActuation, Federation and Interoperability of Context Brokers
Actuation, Federation and Interoperability of Context Brokers
 
Hunting Lateral Movement in Windows Infrastructure
Hunting Lateral Movement in Windows InfrastructureHunting Lateral Movement in Windows Infrastructure
Hunting Lateral Movement in Windows Infrastructure
 
OWASP AppSecEU 2018 – Attacking "Modern" Web Technologies
OWASP AppSecEU 2018 – Attacking "Modern" Web TechnologiesOWASP AppSecEU 2018 – Attacking "Modern" Web Technologies
OWASP AppSecEU 2018 – Attacking "Modern" Web Technologies
 
Pentesting ReST API
Pentesting ReST APIPentesting ReST API
Pentesting ReST API
 
Introduction to ELK
Introduction to ELKIntroduction to ELK
Introduction to ELK
 
Searching for AI - Leveraging Solr for classic Artificial Intelligence tasks
Searching for AI - Leveraging Solr for classic Artificial Intelligence tasksSearching for AI - Leveraging Solr for classic Artificial Intelligence tasks
Searching for AI - Leveraging Solr for classic Artificial Intelligence tasks
 
Security in NodeJS applications
Security in NodeJS applicationsSecurity in NodeJS applications
Security in NodeJS applications
 
No Easy Breach DerbyCon 2016
No Easy Breach DerbyCon 2016No Easy Breach DerbyCon 2016
No Easy Breach DerbyCon 2016
 
Python3 (boto3) for aws
Python3 (boto3) for awsPython3 (boto3) for aws
Python3 (boto3) for aws
 

Ähnlich wie Machine Learning in a Twitter ETL using ELK

03 form-data
03 form-data03 form-data
03 form-datasnopteck
 
Pxb For Yapc2008
Pxb For Yapc2008Pxb For Yapc2008
Pxb For Yapc2008maximgrp
 
Spark with Elasticsearch
Spark with ElasticsearchSpark with Elasticsearch
Spark with ElasticsearchHolden Karau
 
NET Systems Programming Learned the Hard Way.pptx
NET Systems Programming Learned the Hard Way.pptxNET Systems Programming Learned the Hard Way.pptx
NET Systems Programming Learned the Hard Way.pptxpetabridge
 
Building a friendly .NET SDK to connect to Space
Building a friendly .NET SDK to connect to SpaceBuilding a friendly .NET SDK to connect to Space
Building a friendly .NET SDK to connect to SpaceMaarten Balliauw
 
Writing code that writes code - Nguyen Luong
Writing code that writes code - Nguyen LuongWriting code that writes code - Nguyen Luong
Writing code that writes code - Nguyen LuongVu Huy
 
TechkTalk #12 Grokking: Writing code that writes code – Nguyen Luong
TechkTalk #12 Grokking: Writing code that writes code – Nguyen LuongTechkTalk #12 Grokking: Writing code that writes code – Nguyen Luong
TechkTalk #12 Grokking: Writing code that writes code – Nguyen LuongGrokking VN
 
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibElasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibJen Aman
 
JSLT: JSON querying and transformation
JSLT: JSON querying and transformationJSLT: JSON querying and transformation
JSLT: JSON querying and transformationLars Marius Garshol
 
Ejb3 Struts Tutorial En
Ejb3 Struts Tutorial EnEjb3 Struts Tutorial En
Ejb3 Struts Tutorial EnAnkur Dongre
 
Ejb3 Struts Tutorial En
Ejb3 Struts Tutorial EnEjb3 Struts Tutorial En
Ejb3 Struts Tutorial EnAnkur Dongre
 
Hands On Spring Data
Hands On Spring DataHands On Spring Data
Hands On Spring DataEric Bottard
 
Terrastore - A document database for developers
Terrastore - A document database for developersTerrastore - A document database for developers
Terrastore - A document database for developersSergio Bossa
 
ElasticSearch for .NET Developers
ElasticSearch for .NET DevelopersElasticSearch for .NET Developers
ElasticSearch for .NET DevelopersBen van Mol
 

Ähnlich wie Machine Learning in a Twitter ETL using ELK (20)

03 form-data
03 form-data03 form-data
03 form-data
 
Pxb For Yapc2008
Pxb For Yapc2008Pxb For Yapc2008
Pxb For Yapc2008
 
Spark with Elasticsearch
Spark with ElasticsearchSpark with Elasticsearch
Spark with Elasticsearch
 
NET Systems Programming Learned the Hard Way.pptx
NET Systems Programming Learned the Hard Way.pptxNET Systems Programming Learned the Hard Way.pptx
NET Systems Programming Learned the Hard Way.pptx
 
AD102 - Break out of the Box
AD102 - Break out of the BoxAD102 - Break out of the Box
AD102 - Break out of the Box
 
Apache Beam de A à Z
 Apache Beam de A à Z Apache Beam de A à Z
Apache Beam de A à Z
 
Building a friendly .NET SDK to connect to Space
Building a friendly .NET SDK to connect to SpaceBuilding a friendly .NET SDK to connect to Space
Building a friendly .NET SDK to connect to Space
 
Writing code that writes code - Nguyen Luong
Writing code that writes code - Nguyen LuongWriting code that writes code - Nguyen Luong
Writing code that writes code - Nguyen Luong
 
TechkTalk #12 Grokking: Writing code that writes code – Nguyen Luong
TechkTalk #12 Grokking: Writing code that writes code – Nguyen LuongTechkTalk #12 Grokking: Writing code that writes code – Nguyen Luong
TechkTalk #12 Grokking: Writing code that writes code – Nguyen Luong
 
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibElasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlib
 
Sbt for mere mortals
Sbt for mere mortalsSbt for mere mortals
Sbt for mere mortals
 
JSLT: JSON querying and transformation
JSLT: JSON querying and transformationJSLT: JSON querying and transformation
JSLT: JSON querying and transformation
 
Ejb3 Struts Tutorial En
Ejb3 Struts Tutorial EnEjb3 Struts Tutorial En
Ejb3 Struts Tutorial En
 
Ejb3 Struts Tutorial En
Ejb3 Struts Tutorial EnEjb3 Struts Tutorial En
Ejb3 Struts Tutorial En
 
Php summary
Php summaryPhp summary
Php summary
 
Hands On Spring Data
Hands On Spring DataHands On Spring Data
Hands On Spring Data
 
CSV JSON and XML files in Python.pptx
CSV JSON and XML files in Python.pptxCSV JSON and XML files in Python.pptx
CSV JSON and XML files in Python.pptx
 
Terrastore - A document database for developers
Terrastore - A document database for developersTerrastore - A document database for developers
Terrastore - A document database for developers
 
Azure F#unctions
Azure F#unctionsAzure F#unctions
Azure F#unctions
 
ElasticSearch for .NET Developers
ElasticSearch for .NET DevelopersElasticSearch for .NET Developers
ElasticSearch for .NET Developers
 

Kürzlich hochgeladen

Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introductionsanjaymuralee1
 
Rock Songs common codes and conventions.pptx
Rock Songs common codes and conventions.pptxRock Songs common codes and conventions.pptx
Rock Songs common codes and conventions.pptxFinatron037
 
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityAggregage
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best PracticesDataArchiva
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...PrithaVashisht1
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Vladislav Solodkiy
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Guido X Jansen
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationGiorgio Carbone
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructuresonikadigital1
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerPavel Šabatka
 
How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?sonikadigital1
 
CCS336-Cloud-Services-Management-Lecture-Notes-1.pptx
CCS336-Cloud-Services-Management-Lecture-Notes-1.pptxCCS336-Cloud-Services-Management-Lecture-Notes-1.pptx
CCS336-Cloud-Services-Management-Lecture-Notes-1.pptxdhiyaneswaranv1
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxVenkatasubramani13
 
Optimal Decision Making - Cost Reduction in Logistics
Optimal Decision Making - Cost Reduction in LogisticsOptimal Decision Making - Cost Reduction in Logistics
Optimal Decision Making - Cost Reduction in LogisticsThinkInnovation
 
CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionajayrajaganeshkayala
 
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxDwiAyuSitiHartinah
 

Kürzlich hochgeladen (16)

Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introduction
 
Rock Songs common codes and conventions.pptx
Rock Songs common codes and conventions.pptxRock Songs common codes and conventions.pptx
Rock Songs common codes and conventions.pptx
 
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - Presentation
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructure
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayer
 
How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?
 
CCS336-Cloud-Services-Management-Lecture-Notes-1.pptx
CCS336-Cloud-Services-Management-Lecture-Notes-1.pptxCCS336-Cloud-Services-Management-Lecture-Notes-1.pptx
CCS336-Cloud-Services-Management-Lecture-Notes-1.pptx
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptx
 
Optimal Decision Making - Cost Reduction in Logistics
Optimal Decision Making - Cost Reduction in LogisticsOptimal Decision Making - Cost Reduction in Logistics
Optimal Decision Making - Cost Reduction in Logistics
 
CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual intervention
 
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
 

Machine Learning in a Twitter ETL using ELK

  • 1. Machine Learning in a Twitter ETL using ELK (ELASTICSEARCH, LOGSTASH, KIBANA) MELVYN PEIGNON melvynpeignon@gmail.com
  • 2. What is an ETL? • ETL: Extract Load Transform Source Transformation Data warehouse/ Data store Raw data Processed data Extract Transform Load
  • 3. Our use case: An ETL for Twitter https://github.com/melvynator/ELK_twitter Goals: •Simplify a recurrent task for several members of the lab •Normalize the data collection (Have a “universal” format) •Have tweets analyzed the way we want (Emoji, punctuation) •Include some machine learning model in our ETL
  • 4. Our tools: ELK •E : Elasticsearch Elasticsearch is a distributed, RESTful search and analytics engine •L : Logstash Logstash is an open source, server-side data processing pipeline that simultaneously ingests data from multiple sources transforms it, and then sends it to your favorite “stash”. •K: Kibana Kibana lets you visualize your Elasticsearch data and navigate the Elastic Stack Source: Elastic website
  • 5. Our model Twitter Logstash Elasticsearch Kibana Sentiment API Extract Transform Load
  • 6. Logstash Logstash allows you to ingest data from multiple sources, transform the data and then store your processed data. Notions you need to know: •Event •Input •Filter •Output
  • 9. Logstash: Event An event can be described as one raw data traveling across the pipeline: •A Log •A Line •A JSON •…
  • 10. Logstash: Input The input plugins consume data from a source. •File •Elasticsearch •Twitter API •Github API •…
  • 11. Logstash: Filter A filter plugin performs intermediary processing on an event. Filters are often applied conditionally depending on the characteristics of the event. •Clone •Mutate •Ruby •…
  • 12. Logstash: Output An output plugin sends event data to a particular destination. •Elasticsearch •MongoDB •File •…
  • 13. Logstash: Flow Input Filters Output Events Processed events
  • 14. Example: tutorial_input.txt { "last_name":"John", "first_name":"Doe", "age": 25, "degree":"Master", "school":"Stanford", "comment": "A guy from somewhere" } { "name":"John Doe", "age": 25, "comment": "A guy from somewhere", "created_at" : 2017-09-04T08:39:54.847Z } { "school": "Stanford", "degree": "Master", "created_at" : 2017-09-04T08:39:54.877Z } What we have: What we want:
  • 15. Example: Building the configuration file input { file { path => "/usr/local/Cellar/logstash/5.5.2/tutorial_input.txt" start_position => beginning sincedb_path => "/dev/null" codec => "json" } # the input will be a file containing JSON (takes only absolute path) } filter {} # Empty for the moment (We will only fill this) output { stdout {codec => rubydebug}} # Print the result to the console
  • 16. Example: Output { "path" => "/usr/local/Cellar/logstash/5.5.2/tutorial_input.txt", "@timestamp" => 2017-09-04T07:03:26.388Z, "degree" => "Master", "@version" => "1", "host" => "Melvyns-MacBook-Pro.local", "last_name" => "Doe", "school" => "Stanford", "comment" => "A guy from somewhere", "first_name" => "John", "age" => 25 }
  • 17. Example: Building “name” key ruby { # Allows you to insert ruby code code => ‘ event.set("name", event.get("[first_name]") + " " + event.get("[last_name]")) ’ } #event.get() allows retrieving a value given a key, event.set() to create one mutate { remove_field => ["first_name", "last_name", "host", "path"] } # mutate allows basic transformation on events; here we are removing 4 keys The order used to modify the document matters! Here “Ruby” will be applied first then “Mutate”. 1. Merging “first_name” and “last_name” into ”name” 2. Removing “first_name” and “last_name”
  • 18. Example: Output { "@timestamp" => 2017-09-04T08:33:26.599Z, "school" => "Stanford", "degree" => "Master", "@version" => "1", "name" => "John Doe", "comment" => "A guy from somewhere", "age" => 25 }
  • 19. Example: Building two events out of one clone { clones => ["education"] } # we are duplicating our event and add in the replica a key "type" with "education" as a value. if ([type] == "education") { mutate { remove_field => ["name", "age", "comment", "type"] } } else { mutate { remove_field => ["school", "degree"] } }
  • 20. Example: Output { "@timestamp" => 2017-09-04T08:39:54.847Z, "name" => "John Doe", "comment" => "A guy from somewhere", "@version" => "1", "age" => 25 } { "@timestamp" => 2017-09-04T08:39:54.897Z, "school" => "Stanford", "@version" => "1", "degree" => "Master" }
  • 21. Example: Changing the name of a field mutate { remove_field => ["@version"] # @version, @timestamp are created for every events rename => { "@timestamp" => "created_at" } # change the name of a field } The order inside mutate may vary. You cannot know if “remove_field” will be applied before or after “rename”
  • 22. Example: Output { "created_at" => 2017-09-04T08:39:54.847Z, "name" => "John Doe", "comment" => "A guy from somewhere", "age" => 25 } { "created_at" => 2017-09-04T08:39:54.897Z, "school" => "Stanford", "degree" => "Master" }
  • 23. Logstash: Twitter input input { twitter { consumer_key => "<YOUR-KEY>" consumer_secret => "<YOUR-KEY>" oauth_token => "<YOUR-KEY>" oauth_token_secret => "<YOUR-KEY>" keywords => [ "random", "word"] full_tweet => true type => "tweet" } } Only one input:
  • 24. Logstash: Our filters There are many filters, but the overall goals of the filters are to: •Remove depreciated fields •Divide the tweet into two or three events (users and tweet) •Remove the nesting of the JSON •Remove the fields not used
  • 25. Our model Twitter Logstash Elasticsearch Kibana Sentiment API
  • 26. Building the API •Python •Flask •Building your model based on labeled data •Design an endpoint that will receive the data you want to predict
  • 27. Logstash REST filter •https://github.com/lucashenning/logstash-filter-rest •Allows RESTful resources inside Logstash •Will call your Machine learning API •Will add information to your events
  • 28. Logstash REST filter: Example rest { request => { url => http://localhost:5000/predict method = "post" params => { "submit" => "%{tweet_content}" } headers => { "Content-Type" => "application/json" } } target => 'rest_result' }
  • 29. Time to do it yourself •https://github.com/melvynator/Logstash_tutorial •https://docs.google.com/forms/d/e/1FAIpQLSdDdFxmT5ZCXInaSohJB Bo6fKKmCg3KLegeOOrxl1l4_sc-7g/viewform