RTAS 2023: Building a Real-Time IoT Application

Building a Real-Time IoT Application
Tim Spann
Principal Developer Advocate
26-April-2023

2
Notes
We will walk step-by-step with live code and demos on how to build a
real-time IoT application with Pinot + Pulsar. First, we stream sensor data
from an edge device monitoring location conditions to Pulsar via a Python
application. We have our Apache Pinot “realtime” table connected to
Pulsar via the pinot-pulsar stream ingestion connector. Our data streams
into the stream, and we visualize it with Superset.
https://medium.com/@tspann/building-a-real-time-iot-application-with-apache-pulsar-and-apach
e-pinot-1e3baf8c1824
https://github.com/tspannhw/pulsar-thermal-pinot

FLiPN-FLaNK Stack
Tim Spann
@PaasDev // Blog: www.datainmotion.dev
Principal Developer Advocate, Cloudera
Princeton Future of Data Meetup.
ex-Pivotal, ex-Hortonworks, ex-StreamNative, ex-PwC
https://github.com/tspannhw/EverythingApacheNiFi
https://medium.com/@tspann
Apache NiFi x Apache Kafka x Apache Flink x Java

https://attend.cloudera.com/nificommitters0503

5
● Introduction to Pinot
● Introduction to Apache Pulsar
● NiFi to Pulsar to Pinot (FLiPN)
● NiFi to Kafka to Pinot
(P-FLaNK)
● FLaNK Ingest
● Demos

6
Assets
Apache NiFi: Flows
Apache Pinot: Real-Time Tables
Apache Kafka: Topics
Apache Pulsar: Topics
Apache Flink SQL: Virtual Tables

101
Uniﬁed
Messaging
Platform
Guaranteed
Message
Delivery
Resiliency Inﬁnite
Scalability

Streaming
Consumer
Consumer
Consumer
Subscription
Shared
Failover
Consumer
Consumer
Subscription
In case of failure in
Consumer B-0
Consumer
Consumer
Subscription
Exclusive
X
Consumer
Consumer
Key-Shared
Subscription
Pulsar
Topic/Partition
Messaging

14
STREAMING FROM … TO .. WHILE ..
Data distribution as a ﬁrst class citizen
IOT
Devices
LOG DATA
SOURCES
ON-PREM
DATA SOURCES
BIG DATA CLOUD
SERVICES
CLOUD BUSINESS
PROCESS SERVICES *
CLOUD DATA*
ANALYTICS /SERVICE
(Cloudera DW)
App
Logs
Laptops
/Servers Mobile
Apps
Security
Agents
CLOUD
WAREHOUSE
UNIVERSAL
DATA DISTRIBUTION
(Ingest, Transform, Deliver)
Ingest
Processors
Ingest
Gateway
Router, Filter &
Transform
Processors
Destination
Processors

15
End to End Streaming Pipeline Example
Enterprise
sources
Weather
Errors
Aggregates
Alerts
Stocks
ETL
Analytics
Clickstream Market data
Machine logs Social
SQL

© 2019 Cloudera, Inc. All rights reserved. 17
Apache
Kafka
• Highly reliable distributed
messaging system
• Decouple applications, enables
many-to-many patterns
• Publish-Subscribe semantics
• Horizontal scalability
• Eﬃcient implementation to
operate at speed with big data
volumes
• Organized by topic to support
several use cases
Source
System
Source
System
Source
System
Kafka
Fraud
Detection
Security
Systems
Real-Time
Monitoring
Source
System
Source
System
Source
System
Fraud
Detection
Security
Systems
Real-Time
Monitoring
Many-To-Many
Publish-Subscribe
Point-To-Point
Request-Response

19
CSP
Community
Edition
• Kafka, KConnect, SMM, SR,
Flink, and SSB in Docker
• Runs in Docker
• Try new features quickly
• Develop applications locally
● Docker compose ﬁle of CSP to run from command line w/o any
dependencies, including Flink, SQL Stream Builder, Kafka, Kafka
Connect, Streams Messaging Manager and Schema Registry
○ $> docker compose up
● Licensed under the Cloudera Community License
● Unsupported
● Community Group Hub for CSP
● Find it on docs.cloudera.com under Applications

21
Cloudera Flow and Edge Management
Enable easy ingestion, routing, management and delivery of any data anywhere (Edge, cloud, data
center) to any downstream system with built in end-to-end security and provenance
Advanced tooling to industrialize
ﬂow development (Flow Development
Life Cycle)
ACQUIRE
• Over 300 Prebuilt Processors
• Easy to build your own
• Parse, Enrich & Apply Schema
• Filter, Split, Merger & Route
• Throttle & Backpressure
FTP
SFTP
HL7
UDP
XML
HTTP
EMAIL
HTML
IMAGE
SYSLOG
PROCESS
HASH
MERGE
EXTRACT
DUPLICATE
SPLIT
ENCRYPT
TALL
EVALUATE
EXECUTE
GEOENRICH
SCAN
REPLACE
TRANSLATE
CONVERT
ROUTE TEXT
ROUTE CONTENT
ROUTE CONTEXT
ROUTE RATE
DISTRIBUTE LOAD
DELIVER
• Guaranteed Delivery
• Full data provenance from
acquisition to delivery
• Diverse, Non-Traditional Sources
• Eco-system integration
FTP
SFTP
HL7
UDP
XML
HTTP
EMAIL
HTML
IMAGE
SYSLOG

22
Cloudera DataFlow: Universal Data Distribution Service
Process
Route
Filter
Enrich
Transform
Distribute
Connectors
Any
destination
Deliver
Ingest
Active
Passive
Connectors
Gateway
Endpoint
Connect & Pull
Send
Data born in
the cloud
Data born
outside the
cloud
Universal Data Distribution
Connect to Any Data Source Anywhere then Process and Deliver to Any Destination

23
© 2023 Cloudera, Inc. All rights reserved.
What is Apache NiFi?
Apache NiFi is a scalable, real-time streaming data
platform that collects, curates, and analyzes data so
customers gain key insights for immediate
actionable intelligence.

24
Apache NiFi
Enable easy ingestion, routing, management and delivery of any data anywhere (Edge, cloud,
data center) to any downstream system with built in end-to-end security and provenance
ACQUIRE PROCESS DELIVER
• Over 300 Prebuilt Processors
• Easy to build your own
• Parse, Enrich & Apply Schema
• Filter, Split, Merger & Route
• Throttle & Backpressure
• Guaranteed Delivery
• Full data provenance from acquisition to
delivery
• Diverse, Non-Traditional Sources
• Eco-system integration
Advanced tooling to industrialize ﬂow development
(Flow Development Life Cycle)
FTP
SFTP
HL7
UDP
XML
HTTP
EMAIL
HTML
IMAGE
SYSLO
G
FTP
SFTP
HL7
UDP
XML
HTTP
EMAIL
HTML
IMAGE
SYSLO
G
HASH
MERGE
EXTRACT
DUPLICATE
SPLIT
ROUTE TEXT
ROUTE CONTENT
ROUTE CONTEXT
CONTROL RATE
DISTRIBUTE LOAD
GEOENRICH
SCAN
REPLACE
TRANSLATE
CONVERT
ENCRYPT
TALL
EVALUATE
EXECUTE

Apache NiFi Pulsar Connector
https://streamnative.io/apache-nifi-connector/

Flink SQL
https://www.datainmotion.dev/2021/04/cloudera-sql-stream-builder-ssb-updated.html
● Streaming Analytics
● Continuous SQL
● Continuous ETL
● Complex Event Processing
● Standard SQL Powered by
Apache Calcite

28
Flink SQL
-- specify Kafka partition key on output
SELECT foo AS _eventKey FROM sensors
-- use event time timestamp from kafka
-- exactly once compatible
SELECT eventTimestamp FROM sensors
-- nested structures access
SELECT foo.’bar’ FROM table; -- must quote nested
column
-- timestamps
SELECT * FROM payments
WHERE eventTimestamp > CURRENT_TIMESTAMP-interval
'10' second;
-- unnest
SELECT b.*, u.*
FROM bgp_avro b,
UNNEST(b.path) AS u(pathitem)
-- aggregations and windows
SELECT card,
MAX(amount) as theamount,
TUMBLE_END(eventTimestamp, interval '5' minute) as
ts
FROM payments
WHERE lat IS NOT NULL
AND lon IS NOT NULL
GROUP BY card,
TUMBLE(eventTimestamp, interval '5' minute)
HAVING COUNT(*) > 4 -- >4==fraud
-- try to do this ksql!
SELECT us_west.user_score+ap_south.user_score
FROM kafka_in_zone_us_west us_west
FULL OUTER JOIN kafka_in_zone_ap_south ap_south
ON us_west.user_id = ap_south.user_id;
Key Takeaway: Rich SQL grammar with advanced time and aggregation tools

29
© 2023 Cloudera, Inc. All rights reserved.
SQL Stream Builder (SSB)
SQL STREAM BUILDER allows
developers, analysts, and data
scientists to write streaming
applications with industry
standard SQL.
No Java or Scala code
development required.
Simpliﬁes access to data in Kafka
& Flink. Connectors to batch data in
HDFS, Kudu, Hive, S3, JDBC, CDC
and more
Enrich streaming data with batch
data in a single tool
Democratize access to real-time data with just SQL

Schemas
Build a schema
https://github.com/startreedata/pinot-recipes/tree/main/recipes/infer-schema-j
son-data

https://docs.pinot.apache.org/basics/data-import/pinot-stream-ingestion/apache-pulsar
https://dev.startree.ai/docs/pinot/recipes/pulsar
https://github.com/startreedata/pinot-recipes/tree/main/recipes/pulsar
Development Resources

Easy Docker Demo
docker exec -it pinot-controller /bin/bash
docker exec -it pinot-controller bin/pinot-admin.sh JsonToPinotSchema
-timeColumnName ts
-metrics "temperature,humidity,co2,totalvocppb,equivalentco2ppm,pressure,temperatureicp,cputempf"
-dimensions "host,ipaddress"
-pinotSchemaName=thermal
-jsonFile=/data/thermal.json
-outputDir=/config
docker exec -it pinot-controller bin/pinot-admin.sh AddSchema
-schemaFile /config/thermalschema.json
-exec

Local Apache Pinot Admin
curl -X DELETE "http://localhost:9000/tables/thermal?type=realtime" -H "accept: application/json"
curl -X DELETE "http://localhost:9000/schemas/thermal" -H "accept: application/json"
docker exec -it pinot-controller bin/pinot-admin.sh AddSchema
-schemaFile /config/thermalschema.json
-exec
curl -X POST "http://localhost:9000/tables" -H "accept: application/json" -H " ….

Reference Architecture
Microservices
ETL

RTAS 2023: Building a Real-Time IoT Application

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie RTAS 2023: Building a Real-Time IoT Application

Ähnlich wie RTAS 2023: Building a Real-Time IoT Application (20)

Mehr von Timothy Spann

Mehr von Timothy Spann (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

RTAS 2023: Building a Real-Time IoT Application