SlideShare ist ein Scribd-Unternehmen logo
1 von 91
Downloaden Sie, um offline zu lesen
BUSINESS
DASHBOARDS
using Bonobo, Airflow and Grafana
makersquad.fr
makersquad.fr
Romain Dorgueil

romain@makersquad.fr


Maker ~0 years
Containers ~5 years
Python ~10 years
Web ~15 years
Linux ~20 years
Code ~25 years
rdorgueil
Intro. Product
1. Plan.
2. Implement.
3. Visualize.
4. Monitor.
5. Iterate.
Outro. References, Pointers, Sidekick
Content
DISCLAIMERS
If you build a product, do your own research.

Take time to learn and understand tools you consider.





I don’t know nothing, and I recommend nothing.

I assume things and change my mind when I hit a wall.

I’m the creator and main developper of Bonobo ETL.

I’ll try to be objective, but there is a bias here.
PRODUCT
GET https://aprc.it/api/800x600/ep2018.europython.eu/
January 2009
timedelta(years=9)
Febuary 2018
March 2018
April 2018
June 2018
July 2018
not expected
UNDER THE
HOOD
Load Balancer
TCP / L4
Janitor
(asyncio)
SpideSpideSpideSpideSpideSpideSpider
(asyncio)
“Events” Message Queue
Database
(postgres)
“Orders” Message Queue
Object
Storage
Redis
AMQP
HTTP
“CRAWL”
“CREATED”
AMQP/RabbitMQ
HTTP/HTTPS
SQL/Storage
Reverse Proxy
HTTP2 / L7
WebsiWebsite
(django)
APISeAPISeAPISeAPISeAPISeAPIServer
(tornado)
“MISS”
Local
Cache
Load Balancer
TCP / L4
HTTP
Reverse Proxy
HTTP2 / L7
HTTP/HTTPS
SQL/Storage
Prometheus
AlertManager
Grafana
Weblate
MANAGEMENT
SERVICES
Google Analytics
EXTERNAL SERVICES
Stripe
Slack
Sentry
Mailgun
Drift
MixMax
…
Prometheus
Kubernetes
RabbitMQ
Redis
PostgreSQL
NGINX + VTS
Apercite
…
PROMETHEUS EXPORTERS
Database
(postgres) +
PLAN
- Peter Drucker (?)
« If you can’t measure it,
you can’t improve it. »
« What gets measured
gets improved. »
- Peter Drucker (?)
- Take your time to choose metrics wisely.
- Cause vs Effect.
- Less is More.
- One at a time. Or one by team.
- There is not one answer to this question.
Planning
Vanity metrics will
waste your time
- What business are you in?
- What stage are you at?
- Start with a framework.
- You may build your own, later.
Planning
Pirate Metrics
Pirate Metrics(based on Ash Maurya version)
Lean Analytics(book by Alistair Croll & Benjamin Yoskovitz)
Plan A
Plan A
- What business? Software as a Service
- What stage? Empathy / Stickyness
- What metric matters?
- Rate from acquisition to activation.
- QOS (both for display and measure
improvements).
IMPLEMENT
Idea
DataSources
anything
Database
metric → value
Aggregated
(dims, metrics)
Model
Metric
(id) → name
HourlyValue

(metric, date, hour) → value
DailyValue

(metric, date) → value
1
1
n
n
Quick to write.
Not the best.
Keywords to read more: Star and Snowflake Schemas
Bonobo
Extract Transform Load
Select('''
SELECT *
FROM …
WHERE …
''')
def qualify(row):
yield (
row,
'active' if …
else 'inactive'
)
Join('''
SELECT count(*)
FROM …
WHERE uid = %(id)s
''')
def report(row):
send_email(
render(
'email.html', row
)
)
Bonobo
- Independent threads.
- Data is passed first in, first out.
- Supports any kind of directed acyclic graphs.
- Standard Python callable and iterators.
- Getting started still fits in here (ok, barely)
$ pip install bonobo
$ bonobo init somejob.py
$ python somejob.py
Bonobo
Let’s write our jobs.
Extract
from bonobo.config import use_context, Service
from bonobo_sqlalchemy.readers import Select
@use_context
class ObjectCountsReader(Select):
engine = Service('website.engine')
query = '''
SELECT count(%(0)s.id) AS cnt
FROM %(0)s
'''
output_fields = ['dims', 'metrics']
def formatter(self, input_row, row):
now = datetime.datetime.now()
return ({
'date': now.date(),
'hour': now.hour,
}, {
'objects.{}.count'.format(input_row[1]): row['cnt']
})
… counts from website’s database
Extract
TABLES_METRICS = {
AsIs('apercite_account_user'): 'users',
AsIs('apercite_account_userprofile'): 'profiles',
AsIs('apercite_account_apikey'): 'apikeys',
}
def get_readers():
return [
TABLES_METRICS.items(),
ObjectCountsReader(),
]
… counts from website’s database
Normalize
bonobo.SetFields(['dims', 'metrics'])
All data should look the same
Load
class AnalyticsWriter(InsertOrUpdate):
dims = Option(required=True)
filter = Option(required=True)
@property
def discriminant(self):
return ('metric_id', *self.dims)
def get_or_create_metrics(self, context, connection, metrics):
…
def __call__(self, connection, table, buffer, context, row, engine):
dims, metrics = row
if not self.filter(dims, metrics): return
# Get database rows for metric objects.
db_metrics_ids = self.get_or_create_metrics(context, connection, metrics)
# Insert or update values.
for metric, value in metrics.items():
yield from self._put(table, connection, buffer, {
'metric_id': db_metrics_ids[metric],
**{dim: dims[dim] for dim in self.dims},
'value': value,
})
Compose
def get_graph():
normalize = bonobo.SetFields(['dims', 'metrics'])
graph = bonobo.Graph(*get_readers(), normalize)
graph.add_chain(
AnalyticsWriter(
table_name=HourlyValue.__tablename__,
dims=('date', 'hour',),
filter=lambda dims, metrics: 'hour' in dims,
name='Hourly',
), _input=normalize
)
graph.add_chain(
AnalyticsWriter(
table_name=DailyValue.__tablename__,
dims=('date',),
filter=lambda dims, metrics: 'hour' not in dims,
name='Daily',
),
_input=normalize
)
return graph
Configure
def get_services():
return {
'sqlalchemy.engine': EventsDatabase().create_engine(),
'website.engine': WebsiteDatabase().create_engine(),
}
bonobo inspect --graph job.py | dot -o graph.png -T png
Inspect
Run
$ python -m apercite.analytics read objects --write
- dict_items in=1 out=3 [done]
- ObjectCountsReader in=3 out=3 [done]
- SetFields(['dims', 'metrics']) in=3 out=3 [done]
- HourlyAnalyticsWriter in=3 out=3 [done]
- DailyAnalyticsWriter in=3 [done]
Got it.
Let’s add readers.
We’ll run through, you’ll have the code.
Google Analytics
@use('google_analytics')
def read_analytics(google_analytics):
reports = google_analytics.reports().batchGet(
body={…}
).execute().get('reports', [])
for report in reports:
dimensions = report['columnHeader']['dimensions']
metrics = report[‘columnHeader']['metricHeader']['metricHeaderEntries']
rows = report['data']['rows']
for row in rows:
dim_values = zip(dimensions, row['dimensions'])
yield (
{
GOOGLE_ANALYTICS_DIMENSIONS.get(dim, [dim])[0]:
GOOGLE_ANALYTICS_DIMENSIONS.get(dim, [None, IDENTITY])[1](val)
for dim, val in dim_values
},
{
GOOGLE_ANALYTICS_METRICS.get(metric['name'], metric['name']):
GOOGLE_ANALYTICS_TYPES[metric['type']](value)
for metric, value in zip(metrics, row['metrics'][0]['values'])
},
)
Prometheus
class PrometheusReader(Configurable):
http = Service('http')
endpoint = 'http://{}:{}/api/v1'.format(PROMETHEUS_HOST, PROMETHEUS_PORT)
queries = […]
def __call__(self, *, http):
start_at, end_at = self.get_timerange()
for query in self.queries:
for result in http.get(…).json().get('data', {}).get('result', []):
metric = result.get('metric', {})
for ts, val in result.get('values', []):
name = query.target.format(**metric)
_date, _hour = …
yield {
'date': _date,
'hour': _hour,
}, {
name: float(val)
}
Spider counts
class SpidersReader(Select):
kwargs = Option()
output_fields = ['row']
@property
def query(self):
return '''
SELECT spider.value AS name,
spider.created_at AS created_at,
spider_status.attributes AS attributes,
spider_status.created_at AS updated_at
FROM
spider
JOIN …
WHERE spider_status.created_at > %(now)s
ORDER BY spider_status.created_at DESC
'''
def formatter(self, input_row, row):
return (row, )
Spider counts
def spider_reducer(self, left, right):
result = dict(left)
result['spider.total'] += len(right.attributes)
for worker in right.attributes:
if 'stage' in worker:
result['spider.active'] += 1
else:
result['spider.idle'] += 1
return result
Spider counts
now = datetime.datetime.utcnow() - datetime.timedelta(minutes=30)
def get_readers():
return (
SpidersReader(kwargs={'now': now}),
Reduce(spider_reducer, initializer={
'spider.idle': 0,
'spider.active': 0,
'spider.total': 0,
}),
(lambda x: ({'date': now.date(), 'hour': now.hour}, x))
)
You got the idea.
Inspect
We can generate ETL graphs with all readers or only a few.
Run
$ python -m apercite.analytics read all --write
- read_analytics in=1 out=91 [done]
- EventsReader in=1 out=27 [done]
- EventsTimingsReader in=1 out=2039 [done]
- group_timings in=2039 out=24 [done]
- format_timings_for_metrics in=24 out=24 [done]
- SpidersReader in=1 out=1 [done]
- Reduce in=1 out=1 [done]
- <lambda> in=1 out=1 [done]
- PrometheusReader in=1 out=3274 [done]
- dict_items in=1 out=3 [done]
- ObjectCountsReader in=3 out=3 [done]
- SetFields(['dims', 'metrics']) in=3420 out=3420 [done]
- HourlyAnalyticsWriter in=3420 out=3562 [done]
- DailyAnalyticsWriter in=3420 out=182 [done]
Easy to build.
Easy to add or replace parts.
Easy to run.
Told ya, slight bias.
VISUALIZE
Grafana
Analytics & Monitoring
Dashboards
Quality of Service
Quality of Service
Quality of Service
Public Dashboards
Acquisition Rate
+
User Counts New Sessions
Acquisition Rate
We’re just getting started.
MONITOR
Iteration 0
- Cron job runs everything every 30
minutes.
- No way to know if something fails.
- Expensive tasks.
- Hard to run manually.
I’ll handle that.
Monitoring?
Airflow




«Airflow is a platform to
programmatically author,
schedule and monitor
workflows.»
- Official docs
Airflow
- Created by Airbnb, joined Apache incubation.
- Schedules & monitor jobs.
- Distribute workloads through Celery, Dask, K8S…
- Can run anything, not just Python.
Airflow
Webserver Scheduler
Worker
Metadata
Worker
Worker
Worker
Simplified to show high-level concept.

Depends on executor (celery, dask, k8s, local, sequential …)
DAGsimport shlex
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
def _get_bash_command(*args, module='apercite.analytics'):
return '(cd /usr/local/apercite; /usr/local/env/bin/python -m {} {})'.format(
module, ' '.join(map(shlex.quote, args)),
)
def build_dag(name, *args, schedule_interval='@hourly'):
dag = DAG(
name,
schedule_interval=schedule_interval,
default_args=default_args,
catchup=False,
)
dag >> BashOperator(
dag=dag,
task_id=args[0],
bash_command=_get_bash_command(*args),
env=env,
)
return dag
DAGs
# Build datasource-to-metrics-db related dags.
for source in ('google-analytics', 'events', 'events-timings', 'spiders',
'prometheus', 'objects'):
name = 'apercite.analytics.' + source.replace('-', '_')
globals()[name] = build_dag(name, 'read', source, '--write')
# Cleanup dag.
name = 'apercite.analytics.cleanup'
globals()[name] = build_dag(name, 'clean', 'all', schedule_interval='@daily')
Data Sources
from airflow.models import Connection
from airflow.settings import Session
session = Session()
website = session.query(Connection).filter_by(conn_id='apercite_website').first()
events = session.query(Connection).filter_by(conn_id='apercite_events').first()
session.close()
env = {}
if website:
env['DATABASE_HOST'] = str(website.host)
env['DATABASE_PORT'] = str(website.port)
env['DATABASE_USER'] = str(website.login)
env['DATABASE_NAME'] = str(website.schema)
env['DATABASE_PASSWORD'] = str(website.password)
if events:
env['EVENT_DATABASE_USER'] = str(events.login)
env['EVENT_DATABASE_NAME'] = str(events.schema)
env['EVENT_DATABASE_PASSWORD'] = str(events.password)
Warning: sub-optimal
Airflow
- Where to store the dags ?
- Build : separate virtualenv
- Everything we do run locally
- Deployment
Learnings
- Multiple services, not trivial
- Helm charts :-(
- Astronomer Distro :-)
- Read the Source, Luke
ITERATE
Plan N+1
- Create a framework for
experiments.
- Timebox constraint
- Objective & Key Result
- Decide upon results.
Plan N+1read: Scaling Lean, by Ash Maurya
Tech Side
- Month on Month
- Year on Year
- % Growth
Ideas …
- Revenue (stripe, billing …)
- Trafic & SEO (analytics, console …)
- Conversion (AARRR)
- Quality of Service
- Processing (AMQP, …)
- Service Level (HTTP statuses, Time of Requests …)
- Vanity metrics
- Business metrics
Pick one.

Rince.

Repeat.
OUTRO
Airflow helps you manage the whole factory.

Does not care about the jobs’ content.

Bonobo helps you build assembly lines.

Does not care about the surrounding factory.
References
Feedback

+

Bonobo ETL Sprint
Slides, resources, feedback …
apercite.fr/europython
romain@makersquad.fr rdorgueil

Weitere ähnliche Inhalte

Was ist angesagt?

MeetUp Monitoring with Prometheus and Grafana (September 2018)
MeetUp Monitoring with Prometheus and Grafana (September 2018)MeetUp Monitoring with Prometheus and Grafana (September 2018)
MeetUp Monitoring with Prometheus and Grafana (September 2018)Lucas Jellema
 
Prometheus design and philosophy
Prometheus design and philosophy   Prometheus design and philosophy
Prometheus design and philosophy Docker, Inc.
 
Monitoring using Prometheus and Grafana
Monitoring using Prometheus and GrafanaMonitoring using Prometheus and Grafana
Monitoring using Prometheus and GrafanaArvind Kumar G.S
 
Monitoring Kubernetes with Prometheus
Monitoring Kubernetes with PrometheusMonitoring Kubernetes with Prometheus
Monitoring Kubernetes with PrometheusGrafana Labs
 
OpenTelemetry For Architects
OpenTelemetry For ArchitectsOpenTelemetry For Architects
OpenTelemetry For ArchitectsKevin Brockhoff
 
OpenTelemetry For Developers
OpenTelemetry For DevelopersOpenTelemetry For Developers
OpenTelemetry For DevelopersKevin Brockhoff
 
An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)Brian Brazil
 
Elastic stack Presentation
Elastic stack PresentationElastic stack Presentation
Elastic stack PresentationAmr Alaa Yassen
 
Adopting OpenTelemetry
Adopting OpenTelemetryAdopting OpenTelemetry
Adopting OpenTelemetryVincent Behar
 
Kubernetes 101 for Beginners
Kubernetes 101 for BeginnersKubernetes 101 for Beginners
Kubernetes 101 for BeginnersOktay Esgul
 
Kafka Practices @ Uber - Seattle Apache Kafka meetup
Kafka Practices @ Uber - Seattle Apache Kafka meetupKafka Practices @ Uber - Seattle Apache Kafka meetup
Kafka Practices @ Uber - Seattle Apache Kafka meetupMingmin Chen
 
YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions Yugabyte
 
Monitoring with Prometheus
Monitoring with PrometheusMonitoring with Prometheus
Monitoring with PrometheusShiao-An Yuan
 
Illustration of BPMN Loop Types
Illustration of BPMN Loop TypesIllustration of BPMN Loop Types
Illustration of BPMN Loop TypesMax Tay
 
Introduction to MariaDB
Introduction to MariaDBIntroduction to MariaDB
Introduction to MariaDBJongJin Lee
 
Adventures with the ClickHouse ReplacingMergeTree Engine
Adventures with the ClickHouse ReplacingMergeTree EngineAdventures with the ClickHouse ReplacingMergeTree Engine
Adventures with the ClickHouse ReplacingMergeTree EngineAltinity Ltd
 
VictoriaMetrics 2023 Roadmap
VictoriaMetrics 2023 RoadmapVictoriaMetrics 2023 Roadmap
VictoriaMetrics 2023 RoadmapVictoriaMetrics
 
Cobbler - Fast and reliable multi-OS provisioning
Cobbler - Fast and reliable multi-OS provisioningCobbler - Fast and reliable multi-OS provisioning
Cobbler - Fast and reliable multi-OS provisioningRUDDER
 
Kubernetes - A Comprehensive Overview
Kubernetes - A Comprehensive OverviewKubernetes - A Comprehensive Overview
Kubernetes - A Comprehensive OverviewBob Killen
 

Was ist angesagt? (20)

MeetUp Monitoring with Prometheus and Grafana (September 2018)
MeetUp Monitoring with Prometheus and Grafana (September 2018)MeetUp Monitoring with Prometheus and Grafana (September 2018)
MeetUp Monitoring with Prometheus and Grafana (September 2018)
 
Prometheus design and philosophy
Prometheus design and philosophy   Prometheus design and philosophy
Prometheus design and philosophy
 
Monitoring using Prometheus and Grafana
Monitoring using Prometheus and GrafanaMonitoring using Prometheus and Grafana
Monitoring using Prometheus and Grafana
 
Monitoring Kubernetes with Prometheus
Monitoring Kubernetes with PrometheusMonitoring Kubernetes with Prometheus
Monitoring Kubernetes with Prometheus
 
OpenTelemetry For Architects
OpenTelemetry For ArchitectsOpenTelemetry For Architects
OpenTelemetry For Architects
 
OpenTelemetry For Developers
OpenTelemetry For DevelopersOpenTelemetry For Developers
OpenTelemetry For Developers
 
An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)
 
Elastic stack Presentation
Elastic stack PresentationElastic stack Presentation
Elastic stack Presentation
 
Adopting OpenTelemetry
Adopting OpenTelemetryAdopting OpenTelemetry
Adopting OpenTelemetry
 
Kubernetes 101 for Beginners
Kubernetes 101 for BeginnersKubernetes 101 for Beginners
Kubernetes 101 for Beginners
 
Kafka Practices @ Uber - Seattle Apache Kafka meetup
Kafka Practices @ Uber - Seattle Apache Kafka meetupKafka Practices @ Uber - Seattle Apache Kafka meetup
Kafka Practices @ Uber - Seattle Apache Kafka meetup
 
YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions
 
Monitoring with Prometheus
Monitoring with PrometheusMonitoring with Prometheus
Monitoring with Prometheus
 
Illustration of BPMN Loop Types
Illustration of BPMN Loop TypesIllustration of BPMN Loop Types
Illustration of BPMN Loop Types
 
Introduction to MariaDB
Introduction to MariaDBIntroduction to MariaDB
Introduction to MariaDB
 
Adventures with the ClickHouse ReplacingMergeTree Engine
Adventures with the ClickHouse ReplacingMergeTree EngineAdventures with the ClickHouse ReplacingMergeTree Engine
Adventures with the ClickHouse ReplacingMergeTree Engine
 
ClickHouse Keeper
ClickHouse KeeperClickHouse Keeper
ClickHouse Keeper
 
VictoriaMetrics 2023 Roadmap
VictoriaMetrics 2023 RoadmapVictoriaMetrics 2023 Roadmap
VictoriaMetrics 2023 Roadmap
 
Cobbler - Fast and reliable multi-OS provisioning
Cobbler - Fast and reliable multi-OS provisioningCobbler - Fast and reliable multi-OS provisioning
Cobbler - Fast and reliable multi-OS provisioning
 
Kubernetes - A Comprehensive Overview
Kubernetes - A Comprehensive OverviewKubernetes - A Comprehensive Overview
Kubernetes - A Comprehensive Overview
 

Ähnlich wie Business Dashboards using Bonobo ETL, Grafana and Apache Airflow

Simple ETL in python 3.5+ with Bonobo - PyParis 2017
Simple ETL in python 3.5+ with Bonobo - PyParis 2017Simple ETL in python 3.5+ with Bonobo - PyParis 2017
Simple ETL in python 3.5+ with Bonobo - PyParis 2017Romain Dorgueil
 
Simple ETL in python 3.5+ with Bonobo, Romain Dorgueil
Simple ETL in python 3.5+ with Bonobo, Romain DorgueilSimple ETL in python 3.5+ with Bonobo, Romain Dorgueil
Simple ETL in python 3.5+ with Bonobo, Romain DorgueilPôle Systematic Paris-Region
 
Rethinking metrics: metrics 2.0 @ Lisa 2014
Rethinking metrics: metrics 2.0 @ Lisa 2014Rethinking metrics: metrics 2.0 @ Lisa 2014
Rethinking metrics: metrics 2.0 @ Lisa 2014Dieter Plaetinck
 
Code is not text! How graph technologies can help us to understand our code b...
Code is not text! How graph technologies can help us to understand our code b...Code is not text! How graph technologies can help us to understand our code b...
Code is not text! How graph technologies can help us to understand our code b...Andreas Dewes
 
Cluj.py Meetup: Extending Python in C
Cluj.py Meetup: Extending Python in CCluj.py Meetup: Extending Python in C
Cluj.py Meetup: Extending Python in CSteffen Wenz
 
Digital analytics with R - Sydney Users of R Forum - May 2015
Digital analytics with R - Sydney Users of R Forum - May 2015Digital analytics with R - Sydney Users of R Forum - May 2015
Digital analytics with R - Sydney Users of R Forum - May 2015Johann de Boer
 
Luigi presentation NYC Data Science
Luigi presentation NYC Data ScienceLuigi presentation NYC Data Science
Luigi presentation NYC Data ScienceErik Bernhardsson
 
All I know about rsc.io/c2go
All I know about rsc.io/c2goAll I know about rsc.io/c2go
All I know about rsc.io/c2goMoriyoshi Koizumi
 
Fast REST APIs Development with MongoDB
Fast REST APIs Development with MongoDBFast REST APIs Development with MongoDB
Fast REST APIs Development with MongoDBMongoDB
 
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak   CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak PROIDEA
 
Monitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
Monitoring Your ISP Using InfluxDB Cloud and Raspberry PiMonitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
Monitoring Your ISP Using InfluxDB Cloud and Raspberry PiInfluxData
 
Swift for tensorflow
Swift for tensorflowSwift for tensorflow
Swift for tensorflow규영 허
 
Big data week presentation
Big data week presentationBig data week presentation
Big data week presentationJoseph Adler
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)Paul Chao
 
Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...
Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...
Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...GeeksLab Odessa
 
An Overview Of Python With Functional Programming
An Overview Of Python With Functional ProgrammingAn Overview Of Python With Functional Programming
An Overview Of Python With Functional ProgrammingAdam Getchell
 
Beyond Breakpoints: A Tour of Dynamic Analysis
Beyond Breakpoints: A Tour of Dynamic AnalysisBeyond Breakpoints: A Tour of Dynamic Analysis
Beyond Breakpoints: A Tour of Dynamic AnalysisC4Media
 
Michael Hall [InfluxData] | Become an InfluxDB Pro in 20 Minutes | InfluxDays...
Michael Hall [InfluxData] | Become an InfluxDB Pro in 20 Minutes | InfluxDays...Michael Hall [InfluxData] | Become an InfluxDB Pro in 20 Minutes | InfluxDays...
Michael Hall [InfluxData] | Become an InfluxDB Pro in 20 Minutes | InfluxDays...InfluxData
 
Python business intelligence (PyData 2012 talk)
Python business intelligence (PyData 2012 talk)Python business intelligence (PyData 2012 talk)
Python business intelligence (PyData 2012 talk)Stefan Urbanek
 
Automating Workflows for Analytics Pipelines
Automating Workflows for Analytics PipelinesAutomating Workflows for Analytics Pipelines
Automating Workflows for Analytics PipelinesSadayuki Furuhashi
 

Ähnlich wie Business Dashboards using Bonobo ETL, Grafana and Apache Airflow (20)

Simple ETL in python 3.5+ with Bonobo - PyParis 2017
Simple ETL in python 3.5+ with Bonobo - PyParis 2017Simple ETL in python 3.5+ with Bonobo - PyParis 2017
Simple ETL in python 3.5+ with Bonobo - PyParis 2017
 
Simple ETL in python 3.5+ with Bonobo, Romain Dorgueil
Simple ETL in python 3.5+ with Bonobo, Romain DorgueilSimple ETL in python 3.5+ with Bonobo, Romain Dorgueil
Simple ETL in python 3.5+ with Bonobo, Romain Dorgueil
 
Rethinking metrics: metrics 2.0 @ Lisa 2014
Rethinking metrics: metrics 2.0 @ Lisa 2014Rethinking metrics: metrics 2.0 @ Lisa 2014
Rethinking metrics: metrics 2.0 @ Lisa 2014
 
Code is not text! How graph technologies can help us to understand our code b...
Code is not text! How graph technologies can help us to understand our code b...Code is not text! How graph technologies can help us to understand our code b...
Code is not text! How graph technologies can help us to understand our code b...
 
Cluj.py Meetup: Extending Python in C
Cluj.py Meetup: Extending Python in CCluj.py Meetup: Extending Python in C
Cluj.py Meetup: Extending Python in C
 
Digital analytics with R - Sydney Users of R Forum - May 2015
Digital analytics with R - Sydney Users of R Forum - May 2015Digital analytics with R - Sydney Users of R Forum - May 2015
Digital analytics with R - Sydney Users of R Forum - May 2015
 
Luigi presentation NYC Data Science
Luigi presentation NYC Data ScienceLuigi presentation NYC Data Science
Luigi presentation NYC Data Science
 
All I know about rsc.io/c2go
All I know about rsc.io/c2goAll I know about rsc.io/c2go
All I know about rsc.io/c2go
 
Fast REST APIs Development with MongoDB
Fast REST APIs Development with MongoDBFast REST APIs Development with MongoDB
Fast REST APIs Development with MongoDB
 
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak   CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
 
Monitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
Monitoring Your ISP Using InfluxDB Cloud and Raspberry PiMonitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
Monitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
 
Swift for tensorflow
Swift for tensorflowSwift for tensorflow
Swift for tensorflow
 
Big data week presentation
Big data week presentationBig data week presentation
Big data week presentation
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...
Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...
Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...
 
An Overview Of Python With Functional Programming
An Overview Of Python With Functional ProgrammingAn Overview Of Python With Functional Programming
An Overview Of Python With Functional Programming
 
Beyond Breakpoints: A Tour of Dynamic Analysis
Beyond Breakpoints: A Tour of Dynamic AnalysisBeyond Breakpoints: A Tour of Dynamic Analysis
Beyond Breakpoints: A Tour of Dynamic Analysis
 
Michael Hall [InfluxData] | Become an InfluxDB Pro in 20 Minutes | InfluxDays...
Michael Hall [InfluxData] | Become an InfluxDB Pro in 20 Minutes | InfluxDays...Michael Hall [InfluxData] | Become an InfluxDB Pro in 20 Minutes | InfluxDays...
Michael Hall [InfluxData] | Become an InfluxDB Pro in 20 Minutes | InfluxDays...
 
Python business intelligence (PyData 2012 talk)
Python business intelligence (PyData 2012 talk)Python business intelligence (PyData 2012 talk)
Python business intelligence (PyData 2012 talk)
 
Automating Workflows for Analytics Pipelines
Automating Workflows for Analytics PipelinesAutomating Workflows for Analytics Pipelines
Automating Workflows for Analytics Pipelines
 

Kürzlich hochgeladen

Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxReal-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxRTS corp
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogueitservices996
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorTier1 app
 
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...OnePlan Solutions
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringHironori Washizaki
 
Patterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencePatterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencessuser9e7c64
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Rob Geurden
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZABSYZ Inc
 
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesAmazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesKrzysztofKkol1
 
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonLeveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonApplitools
 
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldRoberto Pérez Alcolea
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolsosttopstonverter
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shardsChristopher Curtin
 
What’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesWhat’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesVictoriaMetrics
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...Bert Jan Schrijver
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecturerahul_net
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?Alexandre Beguel
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLionel Briand
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxAndreas Kunz
 

Kürzlich hochgeladen (20)

Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxReal-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogue
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryError
 
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their Engineering
 
Patterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencePatterns for automating API delivery. API conference
Patterns for automating API delivery. API conference
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZ
 
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesAmazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
 
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonLeveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
 
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository world
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration tools
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards
 
What’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesWhat’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 Updates
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecture
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and Repair
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
 

Business Dashboards using Bonobo ETL, Grafana and Apache Airflow

  • 1. BUSINESS DASHBOARDS using Bonobo, Airflow and Grafana makersquad.fr
  • 2. makersquad.fr Romain Dorgueil
 romain@makersquad.fr 
 Maker ~0 years Containers ~5 years Python ~10 years Web ~15 years Linux ~20 years Code ~25 years rdorgueil
  • 3. Intro. Product 1. Plan. 2. Implement. 3. Visualize. 4. Monitor. 5. Iterate. Outro. References, Pointers, Sidekick Content
  • 4. DISCLAIMERS If you build a product, do your own research.
 Take time to learn and understand tools you consider.
 
 
 I don’t know nothing, and I recommend nothing.
 I assume things and change my mind when I hit a wall.
 I’m the creator and main developper of Bonobo ETL.
 I’ll try to be objective, but there is a bias here.
  • 16. Load Balancer TCP / L4 Janitor (asyncio) SpideSpideSpideSpideSpideSpideSpider (asyncio) “Events” Message Queue Database (postgres) “Orders” Message Queue Object Storage Redis AMQP HTTP “CRAWL” “CREATED” AMQP/RabbitMQ HTTP/HTTPS SQL/Storage Reverse Proxy HTTP2 / L7 WebsiWebsite (django) APISeAPISeAPISeAPISeAPISeAPIServer (tornado) “MISS” Local Cache
  • 17. Load Balancer TCP / L4 HTTP Reverse Proxy HTTP2 / L7 HTTP/HTTPS SQL/Storage Prometheus AlertManager Grafana Weblate MANAGEMENT SERVICES Google Analytics EXTERNAL SERVICES Stripe Slack Sentry Mailgun Drift MixMax … Prometheus Kubernetes RabbitMQ Redis PostgreSQL NGINX + VTS Apercite … PROMETHEUS EXPORTERS Database (postgres) +
  • 18.
  • 19.
  • 20.
  • 21. PLAN
  • 22. - Peter Drucker (?) « If you can’t measure it, you can’t improve it. »
  • 23. « What gets measured gets improved. » - Peter Drucker (?)
  • 24. - Take your time to choose metrics wisely. - Cause vs Effect. - Less is More. - One at a time. Or one by team. - There is not one answer to this question. Planning
  • 26. - What business are you in? - What stage are you at? - Start with a framework. - You may build your own, later. Planning
  • 28. Pirate Metrics(based on Ash Maurya version)
  • 29. Lean Analytics(book by Alistair Croll & Benjamin Yoskovitz)
  • 31. Plan A - What business? Software as a Service - What stage? Empathy / Stickyness - What metric matters? - Rate from acquisition to activation. - QOS (both for display and measure improvements).
  • 34. Model Metric (id) → name HourlyValue
 (metric, date, hour) → value DailyValue
 (metric, date) → value 1 1 n n
  • 35. Quick to write. Not the best. Keywords to read more: Star and Snowflake Schemas
  • 37. Select(''' SELECT * FROM … WHERE … ''') def qualify(row): yield ( row, 'active' if … else 'inactive' ) Join(''' SELECT count(*) FROM … WHERE uid = %(id)s ''') def report(row): send_email( render( 'email.html', row ) ) Bonobo
  • 38. - Independent threads. - Data is passed first in, first out. - Supports any kind of directed acyclic graphs. - Standard Python callable and iterators. - Getting started still fits in here (ok, barely) $ pip install bonobo $ bonobo init somejob.py $ python somejob.py Bonobo
  • 40. Extract from bonobo.config import use_context, Service from bonobo_sqlalchemy.readers import Select @use_context class ObjectCountsReader(Select): engine = Service('website.engine') query = ''' SELECT count(%(0)s.id) AS cnt FROM %(0)s ''' output_fields = ['dims', 'metrics'] def formatter(self, input_row, row): now = datetime.datetime.now() return ({ 'date': now.date(), 'hour': now.hour, }, { 'objects.{}.count'.format(input_row[1]): row['cnt'] }) … counts from website’s database
  • 41. Extract TABLES_METRICS = { AsIs('apercite_account_user'): 'users', AsIs('apercite_account_userprofile'): 'profiles', AsIs('apercite_account_apikey'): 'apikeys', } def get_readers(): return [ TABLES_METRICS.items(), ObjectCountsReader(), ] … counts from website’s database
  • 43. Load class AnalyticsWriter(InsertOrUpdate): dims = Option(required=True) filter = Option(required=True) @property def discriminant(self): return ('metric_id', *self.dims) def get_or_create_metrics(self, context, connection, metrics): … def __call__(self, connection, table, buffer, context, row, engine): dims, metrics = row if not self.filter(dims, metrics): return # Get database rows for metric objects. db_metrics_ids = self.get_or_create_metrics(context, connection, metrics) # Insert or update values. for metric, value in metrics.items(): yield from self._put(table, connection, buffer, { 'metric_id': db_metrics_ids[metric], **{dim: dims[dim] for dim in self.dims}, 'value': value, })
  • 44. Compose def get_graph(): normalize = bonobo.SetFields(['dims', 'metrics']) graph = bonobo.Graph(*get_readers(), normalize) graph.add_chain( AnalyticsWriter( table_name=HourlyValue.__tablename__, dims=('date', 'hour',), filter=lambda dims, metrics: 'hour' in dims, name='Hourly', ), _input=normalize ) graph.add_chain( AnalyticsWriter( table_name=DailyValue.__tablename__, dims=('date',), filter=lambda dims, metrics: 'hour' not in dims, name='Daily', ), _input=normalize ) return graph
  • 45. Configure def get_services(): return { 'sqlalchemy.engine': EventsDatabase().create_engine(), 'website.engine': WebsiteDatabase().create_engine(), }
  • 46. bonobo inspect --graph job.py | dot -o graph.png -T png Inspect
  • 47. Run $ python -m apercite.analytics read objects --write - dict_items in=1 out=3 [done] - ObjectCountsReader in=3 out=3 [done] - SetFields(['dims', 'metrics']) in=3 out=3 [done] - HourlyAnalyticsWriter in=3 out=3 [done] - DailyAnalyticsWriter in=3 [done]
  • 48. Got it. Let’s add readers. We’ll run through, you’ll have the code.
  • 49. Google Analytics @use('google_analytics') def read_analytics(google_analytics): reports = google_analytics.reports().batchGet( body={…} ).execute().get('reports', []) for report in reports: dimensions = report['columnHeader']['dimensions'] metrics = report[‘columnHeader']['metricHeader']['metricHeaderEntries'] rows = report['data']['rows'] for row in rows: dim_values = zip(dimensions, row['dimensions']) yield ( { GOOGLE_ANALYTICS_DIMENSIONS.get(dim, [dim])[0]: GOOGLE_ANALYTICS_DIMENSIONS.get(dim, [None, IDENTITY])[1](val) for dim, val in dim_values }, { GOOGLE_ANALYTICS_METRICS.get(metric['name'], metric['name']): GOOGLE_ANALYTICS_TYPES[metric['type']](value) for metric, value in zip(metrics, row['metrics'][0]['values']) }, )
  • 50. Prometheus class PrometheusReader(Configurable): http = Service('http') endpoint = 'http://{}:{}/api/v1'.format(PROMETHEUS_HOST, PROMETHEUS_PORT) queries = […] def __call__(self, *, http): start_at, end_at = self.get_timerange() for query in self.queries: for result in http.get(…).json().get('data', {}).get('result', []): metric = result.get('metric', {}) for ts, val in result.get('values', []): name = query.target.format(**metric) _date, _hour = … yield { 'date': _date, 'hour': _hour, }, { name: float(val) }
  • 51. Spider counts class SpidersReader(Select): kwargs = Option() output_fields = ['row'] @property def query(self): return ''' SELECT spider.value AS name, spider.created_at AS created_at, spider_status.attributes AS attributes, spider_status.created_at AS updated_at FROM spider JOIN … WHERE spider_status.created_at > %(now)s ORDER BY spider_status.created_at DESC ''' def formatter(self, input_row, row): return (row, )
  • 52. Spider counts def spider_reducer(self, left, right): result = dict(left) result['spider.total'] += len(right.attributes) for worker in right.attributes: if 'stage' in worker: result['spider.active'] += 1 else: result['spider.idle'] += 1 return result
  • 53. Spider counts now = datetime.datetime.utcnow() - datetime.timedelta(minutes=30) def get_readers(): return ( SpidersReader(kwargs={'now': now}), Reduce(spider_reducer, initializer={ 'spider.idle': 0, 'spider.active': 0, 'spider.total': 0, }), (lambda x: ({'date': now.date(), 'hour': now.hour}, x)) )
  • 54. You got the idea.
  • 55. Inspect We can generate ETL graphs with all readers or only a few.
  • 56. Run $ python -m apercite.analytics read all --write - read_analytics in=1 out=91 [done] - EventsReader in=1 out=27 [done] - EventsTimingsReader in=1 out=2039 [done] - group_timings in=2039 out=24 [done] - format_timings_for_metrics in=24 out=24 [done] - SpidersReader in=1 out=1 [done] - Reduce in=1 out=1 [done] - <lambda> in=1 out=1 [done] - PrometheusReader in=1 out=3274 [done] - dict_items in=1 out=3 [done] - ObjectCountsReader in=3 out=3 [done] - SetFields(['dims', 'metrics']) in=3420 out=3420 [done] - HourlyAnalyticsWriter in=3420 out=3562 [done] - DailyAnalyticsWriter in=3420 out=182 [done]
  • 57. Easy to build. Easy to add or replace parts. Easy to run. Told ya, slight bias.
  • 69. Iteration 0 - Cron job runs everything every 30 minutes. - No way to know if something fails. - Expensive tasks. - Hard to run manually.
  • 71. Airflow 
 
 «Airflow is a platform to programmatically author, schedule and monitor workflows.» - Official docs
  • 72. Airflow - Created by Airbnb, joined Apache incubation. - Schedules & monitor jobs. - Distribute workloads through Celery, Dask, K8S… - Can run anything, not just Python.
  • 73. Airflow Webserver Scheduler Worker Metadata Worker Worker Worker Simplified to show high-level concept.
 Depends on executor (celery, dask, k8s, local, sequential …)
  • 74.
  • 75. DAGsimport shlex from airflow import DAG from airflow.operators.bash_operator import BashOperator def _get_bash_command(*args, module='apercite.analytics'): return '(cd /usr/local/apercite; /usr/local/env/bin/python -m {} {})'.format( module, ' '.join(map(shlex.quote, args)), ) def build_dag(name, *args, schedule_interval='@hourly'): dag = DAG( name, schedule_interval=schedule_interval, default_args=default_args, catchup=False, ) dag >> BashOperator( dag=dag, task_id=args[0], bash_command=_get_bash_command(*args), env=env, ) return dag
  • 76. DAGs # Build datasource-to-metrics-db related dags. for source in ('google-analytics', 'events', 'events-timings', 'spiders', 'prometheus', 'objects'): name = 'apercite.analytics.' + source.replace('-', '_') globals()[name] = build_dag(name, 'read', source, '--write') # Cleanup dag. name = 'apercite.analytics.cleanup' globals()[name] = build_dag(name, 'clean', 'all', schedule_interval='@daily')
  • 77.
  • 78. Data Sources from airflow.models import Connection from airflow.settings import Session session = Session() website = session.query(Connection).filter_by(conn_id='apercite_website').first() events = session.query(Connection).filter_by(conn_id='apercite_events').first() session.close() env = {} if website: env['DATABASE_HOST'] = str(website.host) env['DATABASE_PORT'] = str(website.port) env['DATABASE_USER'] = str(website.login) env['DATABASE_NAME'] = str(website.schema) env['DATABASE_PASSWORD'] = str(website.password) if events: env['EVENT_DATABASE_USER'] = str(events.login) env['EVENT_DATABASE_NAME'] = str(events.schema) env['EVENT_DATABASE_PASSWORD'] = str(events.password) Warning: sub-optimal
  • 79. Airflow - Where to store the dags ? - Build : separate virtualenv - Everything we do run locally - Deployment
  • 80. Learnings - Multiple services, not trivial - Helm charts :-( - Astronomer Distro :-) - Read the Source, Luke
  • 82. Plan N+1 - Create a framework for experiments. - Timebox constraint - Objective & Key Result - Decide upon results.
  • 83. Plan N+1read: Scaling Lean, by Ash Maurya
  • 84. Tech Side - Month on Month - Year on Year - % Growth
  • 85. Ideas … - Revenue (stripe, billing …) - Trafic & SEO (analytics, console …) - Conversion (AARRR) - Quality of Service - Processing (AMQP, …) - Service Level (HTTP statuses, Time of Requests …) - Vanity metrics - Business metrics
  • 87. OUTRO
  • 88. Airflow helps you manage the whole factory.
 Does not care about the jobs’ content.
 Bonobo helps you build assembly lines.
 Does not care about the surrounding factory.
  • 91. Slides, resources, feedback … apercite.fr/europython romain@makersquad.fr rdorgueil