RTAS 2023: Building a Real-Time IoT Application
https://rtasummit.com/
Apache Pulsar, Apache Pinot, Apache Flink, Apache Kafka, Apache NiFi, FLaNK Stack, IoT
https://rtasummit.com/session/building-a-real-time-iot-application-with-apache-pulsar-and-apache-pinot/
Building a Real-Time IoT Application with Apache Pulsar and Apache Pinot
Timothy Spann
Cloudera
Time: Wednesday, April 26, 11:00 am
Location: Nikko Ballroom I & II, 3rd Floor
We will walk step-by-step with live code and demos on how to build a real-time IoT application with Pinot + Pulsar.
First, we stream sensor data from an edge device monitoring location conditions to Pulsar via a Python application.
We have our Apache Pinot “realtime” table connected to Pulsar via the pinot-pulsar stream ingestion connector.
Our data streams into the stream, and we visualize it with Superset.
https://medium.com/@tspann/building-a-real-time-iot-application-with-apache-pulsar-and-apache-pinot-1e3baf8c1824
Source Code
https://github.com/tspannhw/pulsar-thermal-pinot
Reference
https://docs.pinot.apache.org/basics/data-import/pinot-stream-ingestion/apache-pulsar
https://dev.startree.ai/docs/pinot/recipes/pulsar
Create Topic in Pulsar
bin/pulsar-admin topics delete persistent://public/default/thermalsensors
bin/pulsar-admin topics create persistent://public/default/thermalsensors
bin/pulsar-admin topics create-partitioned-topic --partitions 1 persistent://public/default/thermalsensors
Consume Data in Pulsar
bin/pulsar-client consume "persistent://public/default/thermalsensors" -s "thrmlsnosconsumer" -n 0
DevOps Pulsar
curl http://localhost:8080/admin/v2/persistent/public/default
curl http://localhost:8080/admin/v2/persistent/public/default/thermalsensors-partition-0/stats
http://localhost:8080/admin/v2/persistent/public/default/thermalsensors/partitions?createLocalTopicOnly=false
Data
{
"uuid": "thrml_qsx_20221121215610",
"ipaddress": "192.168.1.179",
"cputempf": 115,
"runtime": 0,
"host": "thermal",
"hostname": "thermal",
"macaddress": "e4:5f:01:7c:3f:34",
"endtime": "1669067770.6400402",
"te": "0.0005550384521484375",
"cpu": 4.5,
"diskusage": "102676.2 MB",
"memory": 9.7,
"rowid": "20221121215610_8e753591-cb7c-4e1c-886d-85cb3dba6c50",
"systemtime": "11/21/2022 16:56:15",
"ts": 1669067775,
"starttime": "11/21/2022 16:56:10",
"datetimestamp": "2022-11-21 21:56:14.404291+00:00",
"temperature": 27.9069,
"humidity": 24.89,
"co2": 698.0,
"totalvocppb": 0.0,
"equivalentco2ppm": 65535.0,
"pressure": 102048.65,
"temperatureicp": 82.0
}
Continuous Analytics with Flink SQL (Pulsar-Flink 1.15+ Connector)
Reference: https://github.com/tspannhw/pulsar-transit-function
CREATE CATALOG pulsar WITH (
'type' = 'pulsar-catalog',
'catalog-service-url' = 'pulsar://localhost:6650',
'catalog-admin-url' = 'http://localhost:8080'
);
SHOW CURRENT DATABASE;
SHOW DATABASES;
USE CATALOG pulsar;
set table.dynamic-table-options.enabled = true;
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
RTAS 2023: Building a Real-Time IoT Application
1. Building a Real-Time IoT Application
Tim Spann
Principal Developer Advocate
26-April-2023
2. 2
Notes
We will walk step-by-step with live code and demos on how to build a
real-time IoT application with Pinot + Pulsar. First, we stream sensor data
from an edge device monitoring location conditions to Pulsar via a Python
application. We have our Apache Pinot “realtime” table connected to
Pulsar via the pinot-pulsar stream ingestion connector. Our data streams
into the stream, and we visualize it with Superset.
https://medium.com/@tspann/building-a-real-time-iot-application-with-apache-pulsar-and-apach
e-pinot-1e3baf8c1824
https://github.com/tspannhw/pulsar-thermal-pinot
3. FLiPN-FLaNK Stack
Tim Spann
@PaasDev // Blog: www.datainmotion.dev
Principal Developer Advocate, Cloudera
Princeton Future of Data Meetup.
ex-Pivotal, ex-Hortonworks, ex-StreamNative, ex-PwC
https://github.com/tspannhw/EverythingApacheNiFi
https://medium.com/@tspann
Apache NiFi x Apache Kafka x Apache Flink x Java
5. 5
● Introduction to Pinot
● Introduction to Apache Pulsar
● NiFi to Pulsar to Pinot (FLiPN)
● NiFi to Kafka to Pinot
(P-FLaNK)
● FLaNK Ingest
● Demos
14. 14
STREAMING FROM … TO .. WHILE ..
Data distribution as a first class citizen
IOT
Devices
LOG DATA
SOURCES
ON-PREM
DATA SOURCES
BIG DATA CLOUD
SERVICES
CLOUD BUSINESS
PROCESS SERVICES *
CLOUD DATA*
ANALYTICS /SERVICE
(Cloudera DW)
App
Logs
Laptops
/Servers Mobile
Apps
Security
Agents
CLOUD
WAREHOUSE
UNIVERSAL
DATA DISTRIBUTION
(Ingest, Transform, Deliver)
Ingest
Processors
Ingest
Gateway
Router, Filter &
Transform
Processors
Destination
Processors
15. 15
End to End Streaming Pipeline Example
Enterprise
sources
Weather
Errors
Aggregates
Alerts
Stocks
ETL
Analytics
Clickstream Market data
Machine logs Social
SQL
19. 19
CSP
Community
Edition
• Kafka, KConnect, SMM, SR,
Flink, and SSB in Docker
• Runs in Docker
• Try new features quickly
• Develop applications locally
● Docker compose file of CSP to run from command line w/o any
dependencies, including Flink, SQL Stream Builder, Kafka, Kafka
Connect, Streams Messaging Manager and Schema Registry
○ $> docker compose up
● Licensed under the Cloudera Community License
● Unsupported
● Community Group Hub for CSP
● Find it on docs.cloudera.com under Applications
21. 21
Cloudera Flow and Edge Management
Enable easy ingestion, routing, management and delivery of any data anywhere (Edge, cloud, data
center) to any downstream system with built in end-to-end security and provenance
Advanced tooling to industrialize
flow development (Flow Development
Life Cycle)
ACQUIRE
• Over 300 Prebuilt Processors
• Easy to build your own
• Parse, Enrich & Apply Schema
• Filter, Split, Merger & Route
• Throttle & Backpressure
FTP
SFTP
HL7
UDP
XML
HTTP
EMAIL
HTML
IMAGE
SYSLOG
PROCESS
HASH
MERGE
EXTRACT
DUPLICATE
SPLIT
ENCRYPT
TALL
EVALUATE
EXECUTE
GEOENRICH
SCAN
REPLACE
TRANSLATE
CONVERT
ROUTE TEXT
ROUTE CONTENT
ROUTE CONTEXT
ROUTE RATE
DISTRIBUTE LOAD
DELIVER
• Guaranteed Delivery
• Full data provenance from
acquisition to delivery
• Diverse, Non-Traditional Sources
• Eco-system integration
FTP
SFTP
HL7
UDP
XML
HTTP
EMAIL
HTML
IMAGE
SYSLOG
22. 22
Cloudera DataFlow: Universal Data Distribution Service
Process
Route
Filter
Enrich
Transform
Distribute
Connectors
Any
destination
Deliver
Ingest
Active
Passive
Connectors
Gateway
Endpoint
Connect & Pull
Send
Data born in
the cloud
Data born
outside the
cloud
Universal Data Distribution
Connect to Any Data Source Anywhere then Process and Deliver to Any Destination
24. 24
Apache NiFi
Enable easy ingestion, routing, management and delivery of any data anywhere (Edge, cloud,
data center) to any downstream system with built in end-to-end security and provenance
ACQUIRE PROCESS DELIVER
• Over 300 Prebuilt Processors
• Easy to build your own
• Parse, Enrich & Apply Schema
• Filter, Split, Merger & Route
• Throttle & Backpressure
• Guaranteed Delivery
• Full data provenance from acquisition to
delivery
• Diverse, Non-Traditional Sources
• Eco-system integration
Advanced tooling to industrialize flow development
(Flow Development Life Cycle)
FTP
SFTP
HL7
UDP
XML
HTTP
EMAIL
HTML
IMAGE
SYSLO
G
FTP
SFTP
HL7
UDP
XML
HTTP
EMAIL
HTML
IMAGE
SYSLO
G
HASH
MERGE
EXTRACT
DUPLICATE
SPLIT
ROUTE TEXT
ROUTE CONTENT
ROUTE CONTEXT
CONTROL RATE
DISTRIBUTE LOAD
GEOENRICH
SCAN
REPLACE
TRANSLATE
CONVERT
ENCRYPT
TALL
EVALUATE
EXECUTE
28. 28
Flink SQL
-- specify Kafka partition key on output
SELECT foo AS _eventKey FROM sensors
-- use event time timestamp from kafka
-- exactly once compatible
SELECT eventTimestamp FROM sensors
-- nested structures access
SELECT foo.’bar’ FROM table; -- must quote nested
column
-- timestamps
SELECT * FROM payments
WHERE eventTimestamp > CURRENT_TIMESTAMP-interval
'10' second;
-- unnest
SELECT b.*, u.*
FROM bgp_avro b,
UNNEST(b.path) AS u(pathitem)
-- aggregations and windows
SELECT card,
MAX(amount) as theamount,
TUMBLE_END(eventTimestamp, interval '5' minute) as
ts
FROM payments
WHERE lat IS NOT NULL
AND lon IS NOT NULL
GROUP BY card,
TUMBLE(eventTimestamp, interval '5' minute)
HAVING COUNT(*) > 4 -- >4==fraud
-- try to do this ksql!
SELECT us_west.user_score+ap_south.user_score
FROM kafka_in_zone_us_west us_west
FULL OUTER JOIN kafka_in_zone_ap_south ap_south
ON us_west.user_id = ap_south.user_id;
Key Takeaway: Rich SQL grammar with advanced time and aggregation tools