SlideShare ist ein Scribd-Unternehmen logo
1 von 139
Downloaden Sie, um offline zu lesen
Streaming Data Platforms for
Cloud-Native Event-driven Applications
Apache Pulsar +
Apache NiFi +
Apache Flink +
Cloudera
Tim Spann
Agenda
1. Welcome
2. Introduction to Apache Pulsar
• Basics of Pulsar
• Use Cases
3. Apache NiFi <-> Apache Pulsar
4. Apache Flink <-> Apache Pulsar
5. Let’s Build an App!
• Demo
4. Resources
5. Q&A
Tim Spann
Developer Advocate
Tim Spann
Developer Advocate at StreamNative
● FLiP(N) Stack = Flink, Pulsar and NiFi Stack
● Streaming Systems & Data Architecture Expert
● Experience:
○ 15+ years of experience with streaming technologies including Pulsar, Flink, Spark, NiFi, Big
Data, Cloud, MXNet, IoT, Python and more.
○ Today, he helps to grow the Pulsar community sharing rich technical knowledge and experience
at both global conferences and through individual conversations.
FLiP Stack Weekly
This week in Apache Flink, Apache Pulsar, Apache
NiFi, Apache Spark and open source friends.
https://bit.ly/32dAJft
Proprietary & Confidential |
Apache Pulsar made easy.
Apache Pulsar is a Cloud-Native
Messaging and Event-Streaming Platform.
CREATED
Originally
developed inside
Yahoo! as Cloud
Messaging
Service
GROWTH
10x Contributors
10MM+ Downloads
Ecosystem Expands
Kafka on Pulsar
AMQ on Pulsar
Functions
. . .
2012 2016 2018 TODAY
APACHE TLP
Pulsar
becomes
Apache top
level project.
OPEN SOURCE
Pulsar
committed
to open source.
Apache Pulsar Timeline
Evolution of Pulsar Growth
Pulsar Has a Built-in Super Set of OSS
Features
Durability
Scalability Geo-Replication
Multi-Tenancy
Unified Messaging
Model
Reduced Vendor Dependency
Functions
Open-Source Features
Schema Registry
schema-1 (value=Avro/Protobuf/JSON) schema-2
(value=Avro/Protobuf/JSON)
schema-3
(value=Avro/Protobuf/JSON)
Schema
Data
ID
Local Cache
for Schemas
+
Schema
Data
ID +
Local Cache
for Schemas
Send schema-1
(value=Avro/Protobuf/JSON) data
serialized per schema ID
Send (register)
schema (if not in
local cache)
Read schema-1
(value=Avro/Protobuf/JSON) data
deserialized per schema ID
Get schema by ID (if
not in local cache)
Producers Consumers
Integrated Schema Registry
Ideal for app and data tiers
Less sprawl and better
utilization
Cloud-native scalability
Build globally without
the complexity
Cost effective long-term
storage
Pulsar across the
organization
Joining Streams in SQL
Perform in Real-Time
Ordering and Arrival
Concurrent Consumers
Change Data Capture
Data Streaming
Streaming
Consumer
Consumer
Consumer
Subscription
Shared
Failover
Consumer
Consumer
Subscription
In case of failure in
Consumer B-0
Consumer
Consumer
Subscription
Exclusive
X
Consumer
Consumer
Key-Shared
Subscription
Pulsar
Topic/Partition
Messaging
Messages - the basic unit of Pulsar
Component Description
Value / data payload The data carried by the message. All Pulsar messages contain raw bytes, although
message data can also conform to data schemas.
Key Messages are optionally tagged with keys, used in partitioning and also is useful for
things like topic compaction.
Properties An optional key/value map of user-defined properties.
Producer name The name of the producer who produces the message. If you do not specify a producer
name, the default name is used. Message De-Duplication.
Sequence ID Each Pulsar message belongs to an ordered sequence on its topic. The sequence ID of
the message is its order in that sequence. Message De-Duplication.
Connectivity
• Libraries - (Java, Python, Go, NodeJS,
WebSockets, C++, C#, Scala, Rust,...)
• Functions - Lightweight Stream
Processing (Java, Python, Go)
• Connectors - Sources & Sinks
(Cassandra, Kafka, …)
• Protocol Handlers - AoP (AMQP), KoP
(Kafka), MoP (MQTT)
• Processing Engines - Flink, Spark,
Presto/Trino via Pulsar SQL
• Data Offloaders - Tiered Storage - (S3)
hub.streamnative.io
Schema Registry
Schema Registry
schema-1 (value=Avro/Protobuf/JSON) schema-2 (value=Avro/Protobuf/JSON) schema-3
(value=Avro/Protobuf/JSON)
Schema
Data
ID
Local Cache
for Schemas
+
Schema
Data
ID +
Local Cache
for Schemas
Send schema-1
(value=Avro/Protobuf/JSON) data
serialized per schema ID
Send (register)
schema (if not in
local cache)
Read schema-1
(value=Avro/Protobuf/JSON) data
deserialized per schema ID
Get schema by ID (if
not in local cache)
Producers Consumers
Apache Pulsar has a vibrant community
560+
Contributors
10,000+
Commits
7,000+
Slack Members
1,000+
Organizations
Using Pulsar
Unified
Messaging
Platform
Guaranteed
Message
Delivery
Resiliency Infinite
Scalability
The Basics
● Serverless computing framework.
● Unbounded storage, multi-tiered
architecture, and tiered-storage.
● Streaming & Pub/Sub messaging
semantics.
● Multi-protocol support.
● Open Source
● Cloud-Native
● Multi-Tenant
Features
Message Queuing
Data Streaming
Unified Messaging & Streaming Platform
Apache Pulsar features
Cloud native with decoupled
storage and compute layers.
Built-in compatibility with your
existing code and messaging
infrastructure.
Geographic redundancy and high
availability included.
Centralized cluster management
and oversight.
Elastic horizontal and vertical
scalability.
Seamless and instant partitioning
rebalancing with no downtime.
Flexible subscription model
supports a wide array of use cases.
Compatible with the tools you use
to store, analyze, and process data.
● “Bookies”
● Stores messages and cursors
● Messages are grouped in
segments/ledgers
● A group of bookies form an
“ensemble” to store a ledger
● “Brokers”
● Handles message routing and
connections
● Stateless, but with caches
● Automatic load-balancing
● Topics are composed of
multiple segments
●
● Stores metadata for both
Pulsar and BookKeeper
● Service discovery
Store
Messages
Metadata &
Service Discovery
Metadata &
Service Discovery
Pulsar Cluster
Metadata
Storage
Pulsar Cluster
Tenants
(Compliance)
Tenants
(Data Services)
Namespace
(Microservices)
Topic-1
(Cust Auth)
Topic-1
(Location Resolution)
Topic-2
(Demographics)
Topic-1
(Budgeted Spend)
Topic-1
(Acct History)
Topic-1
(Risk Detection)
Namespace
(ETL)
Namespace
(Campaigns)
Namespace
(ETL)
Tenants
(Marketing)
Namespace
(Risk Assessment)
Pulsar Cluster
Tenant - Namespaces - Topics
Component Description
Value / data payload The data carried by the message. All Pulsar messages contain raw bytes, although message data
can also conform to data schemas.
Key Messages are optionally tagged with keys, used in partitioning and also is useful for things like
topic compaction.
Properties An optional key/value map of user-defined properties.
Producer name The name of the producer who produces the message. If you do not specify a producer name, the
default name is used.
Sequence ID Each Pulsar message belongs to an ordered sequence on its topic. The sequence ID of the
message is its order in that sequence.
Messages
Streaming
Consumer
Consumer
Consumer
Subscription
Shared
Failover
Consumer
Consumer
Subscription
In case of failure in
Consumer B-0
Consumer
Consumer
Subscription
Exclusive
X
Consumer
Consumer
Key-Shared
Subscription
Pulsar
Topic/Partition
Messaging
Schema Registry
schema-1 (value=Avro/Protobuf/JSON) schema-2
(value=Avro/Protobuf/JSON)
schema-3
(value=Avro/Protobuf/JSON)
Schema
Data
ID
Local Cache
for Schemas
+
Schema
Data
ID +
Local Cache
for Schemas
Send schema-1
(value=Avro/Protobuf/JSON) data
serialized per schema ID
Send (register)
schema (if not in
local cache)
Read schema-1
(value=Avro/Protobuf/JSON) data
deserialized per schema ID
Get schema by ID (if
not in local cache)
Producers Consumers
Integrated Schema Registry
Flexible Pub/Sub API for Pulsar - Shared
Consumer consumer = client.newConsumer()
.topic("my-topic")
.subscriptionName("work-q-1")
.subscriptionType(SubType.Shared)
.subscribe();
Flexible Pub/Sub API for Pulsar - Failover
Consumer consumer = client.newConsumer()
.topic("my-topic")
.subscriptionName("stream-1")
.subscriptionType(SubType.Failover)
.subscribe();
Reader Interface
byte[] msgIdBytes = // Some byte
array
MessageId id =
MessageId.fromByteArray(msgIdBytes);
Reader<byte[]> reader =
pulsarClient.newReader()
.topic(topic)
.startMessageId(id)
.create();
Create a reader that will read from
some message between earliest and
latest.
Reader
Built-in
Back
Pressure
Producer<String> producer = client.newProducer(Schema.STRING)
.topic("hellotopic")
.blockIfQueueFull(true) // enable blocking
// when queue is full
.maxPendingMessages(10) // max queue size
.create();
// During Back Pressure: the sendAsync call blocks
// with no room in queues
producer.newMessage()
.key("mykey")
.value("myvalue")
.sendAsync(); // can be a blocking call
• Functions - Lightweight Stream
Processing (Java, Python, Go)
• Connectors - Sources & Sinks
(Cassandra, Kafka, …)
• Protocol Handlers - AoP (AMQP), KoP
(Kafka), MoP (MQTT)
• Processing Engines - Flink, Spark,
Presto/Trino via Pulsar SQL
• Data Offloaders - Tiered Storage - (S3)
Tenant - Namespaces - Topics
Kafka on Pulsar (KoP)
MQTT on Pulsar (MoP)
AMQP on Pulsar (AoP)
Pulsar Functions
● Lightweight computation
similar to AWS Lambda.
● Specifically designed to use
Apache Pulsar as a message
bus.
● Function runtime can be
located within Pulsar Broker.
A serverless event streaming
framework
● Consume messages from one
or more Pulsar topics.
● Apply user-supplied
processing logic to each
message.
● Publish the results of the
computation to another topic.
● Support multiple
programming languages (Java,
Python, Go)
● Can leverage 3rd-party
libraries to support the
execution of ML models on
the edge.
Pulsar Functions
Moving Data In and Out of Pulsar
IO/Connectors are a simple way to integrate with external systems and move
data in and out of Pulsar. https://pulsar.apache.org/docs/en/io-jdbc-sink/
● Built on top of Pulsar Functions
● Built-in connectors - hub.streamnative.io
Source Sink
Kafka-on-Pulsar (Kop)
Apache NiFi Pulsar Connector
https://github.com/streamnative/pulsar-nifi-bundle
Apache NiFi Pulsar Connector
https://github.com/david-streamlio/pulsar-nifi-bundle
Apache NiFi Pulsar Connector
https://www.datainmotion.dev/2021/11/producing-and-consuming-pulsar-messages.html
Apache NiFi Pulsar Connector
Apache NiFi Pulsar Connector
Apache NiFi Pulsar Connector
https://github.com/david-streamlio/pulsar-nifi-bundle/releases/tag/v1.14.0
Events <->
Streaming FLiPS Apps
StreamNative Hub
StreamNative Cloud
Unified Batch and Stream COMPUTING
Batch
(Batch + Stream)
Unified Batch and Stream STORAGE
Offload
(Queuing + Streaming)
Tiered Storage
Pulsar
---
KoP
---
MoP
---
Websocket
Pulsar
Sink
Streaming
Edge Gateway
Protocols
<-> Events <->
CDC
Apps
Building Real-Time Requires a Team
Building An App
Code Along With Tim
<<DEMO>>
RESOURCES
Here are resources to continue your journey
with Apache Pulsar
Links
● https://github.com/tspannhw/Flip-iot
● https://github.com/streamnative/examples/tree/master/cloud/go
● https://pulsar.apache.org/docs/en/client-libraries-go/
● https://github.com/apache/pulsar-client-go
● https://github.com/tspannhw/Meetup-YourFirstEventDrivenApp
● https://github.com/tspannhw/SpeakerProfile
● https://github.com/tspannhw/pulsar-flinksql-1.13.2
● https://github.com/tspannhw/pulsar-pychat-function
● https://www.datainmotion.dev/
● https://www.meetup.com/SF-Bay-Area-Apache-Pulsar-Meetup/events/283083105/
● https://medium.com/@tspann
● https://dzone.com/users/297029/bunkertor.html
● https://dev.to/tspannhw
● https://linktr.ee/tspannhw
FLiP Stack Weekly
This week in Apache Flink, Apache Pulsar, Apache
NiFi, Apache Spark and open source friends.
https://bit.ly/32dAJft
Scan the QR code
to learn more about
Apache Pulsar and
StreamNative.
Scan the QR code
to build your own
apps today.
Deploying AI With an
Event-Driven
Platform
https://dzone.com/trendreports/enterprise-ai-1
Let’s Keep
in Touch!
Tim Spann
Developer Advocate
@PaaSDev
https://www.linkedin.com/in/timothyspann
https://github.com/tspannhw
@TimPulsar@hachyderm.io
https://hachyderm.io/@TimPulsar
61
Additional Q&A
Data Aggregation
Concentrate data for multiple clusters into a single one where main processing
happens
Active - Active
Processing happens identical in both clusters.
Active - Passive
Only active cluster processes data - Passive cluster is for failover
Java Function
● Lightweight computation similar
to AWS Lambda.
● Specifically designed to use
Apache Pulsar as a message
bus.
● Function runtime can be
located within Pulsar Broker.
● Java Functions
A serverless event
streaming framework
Pulsar Functions
Integrated with pulsar-admin CLI
● Consume messages from one or
more Pulsar topics.
● Apply user-supplied processing
logic to each message.
● Publish the results of the
computation to another topic.
● Support multiple programming
languages (Java, Python, Go)
● Can leverage 3rd-party libraries
to support the execution of ML
models on the edge.
Pulsar Functions
Simple Native Java Function
Simple Function with Pulsar SDK
from pulsar import Function
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
import json
class Chat(Function):
def __init__(self):
pass
def process(self, input, context):
fields = json.loads(input)
sid = SentimentIntensityAnalyzer()
ss = sid.polarity_scores(fields["comment"])
row = { }
row['id'] = str(msg_id)
if ss['compound'] < 0.00:
row['sentiment'] = 'Negative'
else:
row['sentiment'] = 'Positive'
row['comment'] = str(fields["comment"])
json_string = json.dumps(row)
return json_string
Entire Function
Pulsar Python NLP Function
https://www.influxdata.com/integration/mqtt-monitoring/
https://www.influxdata.com/integration/mqtt-monitoring/
• Guaranteed delivery
• Data buffering
- Backpressure
- Pressure release
• Prioritized queuing
• Flow specific QoS
- Latency vs. throughput
- Loss tolerance
• Data provenance
• Supports push and pull
models
• Hundreds of processors
• Visual command and
control
• Over a 300 components
• Flow templates
• Pluggable/multi-role
security
• Designed for extension
• Clustering
• Version Control
Apache NiFi Basics
Apache NiFi - Apache Pulsar Connector
Apache NiFi - Producing to Pulsar
Apache NiFi - Live Metrics
● Unified computing engine
● Batch processing is a special case of stream processing
● Stateful processing
● Massive Scalability
● Flink SQL for queries, inserts against Pulsar Topics
● Streaming Analytics
● Continuous SQL
● Continuous ETL
● Complex Event Processing
● Standard SQL Powered by Apache Calcite
Apache Flink
Apache Flink + Apache Pulsar
Kafka to Pulsar
Apache Flink Job Running Against Apache Pulsar
● Java, Scala, Python Support
● Strong ETL/ELT
● Diverse ML support
● Scalable Distributed compute
● Apache Zeppelin and Jupyter Notebooks
● Fast connector for Apache Pulsar
Apache Spark
val dfPulsar = spark.readStream.format("
pulsar")
.option("
service.url", "pulsar://pulsar1:6650")
.option("
admin.url", "http://pulsar1:8080
")
.option("
topic", "persistent://public/default/airquality").load()
val pQuery = dfPulsar.selectExpr("*")
.writeStream.format("
console")
.option("truncate", false).start()
____ __
/ __/__ ___ _____/ /__
_ / _ / _ `/ __/ '_/
/___/ .__/_,_/_/ /_/_ version 3.2.0
/_/
Using Scala version 2.12.15
(OpenJDK 64-Bit Server VM, Java 11.0.11)
Apache Spark + Apache Pulsar
Apache Spark Running Against Apache Pulsar
Fanout, Queueing, or Streaming
In Pulsar, you have flexibility to use different subscription modes.
If you want to... Do this... And use these subscription
modes...
achieve fanout messaging
among consumers
specify a unique subscription name
for each consumer
exclusive (or failover)
achieve message queuing
among consumers
share the same subscription name
among multiple consumers
shared
allow for ordered consumption
(streaming)
distribute work among any number of
consumers
exclusive, failover or key shared
84
Unified Messaging Model
Simplify your data infrastructure
and enable new use cases with
queuing and streaming capabilities
in one platform.
Ideal for app and data tiers
Less sprawl and better utilization
Cloud-native scalability
Build globally without the
complexity
Cost effective long-term storage
Pulsar across the
organization
Starting a Function - Distributed Cluster
Once compiled into a JAR, start a Pulsar Function in a distributed cluster:
Customers on StreamNative Cloud
Moving Data In and Out of Pulsar
IO/Connectors are a simple way to integrate with external systems and move
data in and out of Pulsar. https://pulsar.apache.org/docs/en/io-jdbc-sink/
● Built on top of Pulsar Functions
● Built-in connectors - hub.streamnative.io
Source Sink
Why a Table Abstraction?
● In many use cases, applications want to consume data from a
Pulsar Topic as if it were a database table, where each new
message is an “update” to the table.
● Up until now, applications used Pulsar consumers or readers to
fetch all the updates from a topic and construct a map with the
latest value of each key for received messages.
● The Table Abstraction provides a standard implementation of this
message consumption pattern for any given keyed topic.
TableView
● New Consumer type added in Pulsar 2.10 that provides a continuously
updated key-value map view of compacted topic data.
● An abstraction of a changelog stream from a primary-keyed table, where
each record in the changelog stream is an update on the primary-keyed table
with the record key as the primary key.
● READ ONLY DATA STRUCTURE!
What is it?
How does it work?
● When you create a TableView,
and additional compacted
topic is created.
● In a compacted topic, only the
most recent value associated
with each key is retained.
● A background reader
consumes from the compacted
topic and updates the map
when new messages arrive.
Event-Based Microservices
• When your microservices use a message bus to communicate, it is
referred to as an “event-driven” architecture.
• In an event-driven architecture, when a service performs some
piece of work that other services might be interested in, that
service produces an event. Other services consume those events
so that they can perform their own tasks.
• Messages are stored in intermediate topics to prevent message
loss. If a service fails, the message remains in the topic.
Event-Based Microservices Application
• A basic order entry use case
for a food delivery service
will involve several different
microservices.
• They communicate via
messages sent to Pulsar
topics.
• A service can also subscribe
to the events published by
another service.
Use Apache Pulsar For Ingest
Founded by the original creators of Apache Pulsar.
StreamNative has more experience designing,
deploying, and running large-scale Apache Pulsar
instances than any team in the world.
StreamNative employs more than 50% of the
active core committers to Apache Pulsar.
Pulsar and other
Ecosystems
Real-Time
Processor
Long-Term
Storage
Subscription B Group 0
Subscription B Group 20
...
Pulsar will feed data into
many different systems and
data format is crucial.
Pulsar feeds real-time
engines and data format
makes this integration easier.
Data is offloaded from Pulsar
to S3/HDFS and benefits
from a correct data format.
Any improvement in size
reduction is multiplied by the
number of consumers.
Apache Avro
Avro is a data serialization system used across Big Data ecosystem.
Avro provides:
● Rich data structures
● A compact, fast, binary data format
● A container file, to store persistent data
● Remote procedure call (RPC)
● Simple integration with dynamic languages
For more information: https://avro.apache.org/docs/current/
Messaging
versus
Streaming
Messaging Use Cases Streaming Use Cases
Service x commands service y to
make some change.
Example: order service removing
item from inventory service
Moving large amounts of data to
another service (real-time ETL).
Example: logs to elasticsearch
Distributing messages that
represent work among n
workers.
Example: order processing not in
main “thread”
Periodic jobs moving large amounts of
data and aggregating to more
traditional stores.
Example: logs to s3
Sending “scheduled” messages.
Example: notification service for
marketing emails or push
notifications
Computing a near real-time aggregate
of a message stream, split among n
workers, with order being important.
Example: real-time analytics over page
views
Differences in consumption
Messaging Use Case Streaming Use Case
Retention The amount of data retained is
relatively small - typically only a day
or two of data at most.
Large amounts of data are retained,
with higher ingest volumes and
longer retention periods.
Throughput Messaging systems are not designed
to manage big “catch-up” reads.
Streaming systems are designed to
scale and can handle use cases
such as catch-up reads.
WHAT? WHY?
{
"title" : "BUS 76 - Oct 17, 2022 04:38:41 PM",
"description" : "Bus route 76 is operating on a detour in Hasbrouck Heights. Terrace Avenue
is closed from Paterson Avenue to Madison Avenue for utility work.Buses will use Paterson",
"link" : "https://www.njtransit.com/node/1540960",
"guid" : "https://www.njtransit.com/node/1540960",
"advisoryAlert" : 0,
"pubDate" : "Oct 17, 2022 04:38:41 PM",
"ts" : "1666039443600",
"companyname" : "newjersey",
"uuid" : "3b19dd32-db1e-4320-ba9a-f6bfd4a87bb9",
"servicename" : "bus"
}
<?xml version="1.0" encoding="ISO-8859-1"?>
<?xml-stylesheet href="latest_ob.xsl" type="text/xsl"?>
<current_observation><temp_c>20.0</temp_c>
<pressure_string>1019.1 mb</pressure_string>
<pressure_mb>1019.1</pressure_mb>
<pressure_in>30.17</pressure_in>
<dewpoint_string>26.6 F</dewpoint_string>
</current_observation>
Data
Information?
'aircraft': [{'hex': 'ae6d7a',
'alt_baro': 25000, 'mlat': [],
'tisb': [], 'messages': 177,
'seen': 0.1, 'rssi': -22.7}
In Ancient Times
In Modern Times
Trains, Planes and Automobiles +++
Information Needed Data Feed(s)
Local weather conditions ● XML, JSON, RSS
Mass transit status & alerts ● XML, JSON, RSS
Regional highways & tunnels ● GeoRSS, XML, ProtoBuf, JSON
Local social media ● JSON
ADS-B Plane Data ● JSON
Local air quality ● JSON
HOW?
https://streamnative.io/blog/engineering/2022-04-14-what-the-flip-is-the-flip-stack/
TRANSCOM
Traffic
Sensors
Aggregates
Status
SQL
Analytics
APIs
REST
INGEST: Apache NiFi
(Queuing + Streaming)
https://streamnative.io/apache-nifi-connector/
Apache NiFi Pulsar Connector
Apache NiFi <-> Apache Pulsar
Apache NiFi - Data Lineage / Provenance
Apache NiFi - Producing to Pulsar
TRANSIT:
Infinite Message Bus with Apache Pulsar
Streaming
Consumer
Consumer
Consumer
Subscription
Shared
Failover
Consumer
Consumer
Subscription
In case of failure in
Consumer B-0
Consumer
Consumer
Subscription
Exclusive
X
Consumer
Consumer
Key-Shared
Subscription
Pulsar
Topic/Partition
Messaging
Unified Messaging
Model
Cloud native with decoupled
storage and compute layers.
Built-in compatibility with your
existing code and messaging
infrastructure.
Geographic redundancy and high
availability included.
Centralized cluster management
and oversight.
Elastic horizontal and vertical
scalability.
Seamless and instant partitioning
rebalancing with no downtime.
Flexible subscription model
supports a wide array of use cases.
Compatible with the tools you use
to store, analyze, and process data.
Apache Pulsar features
Schema Registry
Schema Registry
schema-1 (value=Avro/Protobuf/JSON) schema-2 (value=Avro/Protobuf/JSON) schema-3
(value=Avro/Protobuf/JSON)
Schema
Data
ID
Local Cache
for Schemas
+
Schema
Data
ID +
Local Cache
for Schemas
Send schema-1
(value=Avro/Protobuf/JSON) data
serialized per schema ID
Send (register)
schema (if not in
local cache)
Read schema-1
(value=Avro/Protobuf/JSON) data
deserialized per schema ID
Get schema by ID (if
not in local cache)
Producers Consumers
MQTT
On Pulsar
(MoP)
Kafka On Pulsar
(KoP)
Use Pulsar to Stream from Lakehouses
Use Pulsar to Stream to Lakehouses
ETL:
Streaming with Apache Spark
Building Spark SQL View
val dfPulsar = spark.readStream.format("pulsar")
.option("service.url", "pulsar://pulsar1:6650")
.option("admin.url", "http://pulsar1:8080")
.option("topic", "persistent://public/default/weather")
.load()
dfPulsar.printSchema()
val pQuery = dfPulsar.selectExpr("*")
.writeStream.format("console")
.option("truncate", false)
.start()
Example Spark Code
val dfPulsar = spark.readStream.format("pulsar")
.option("service.url", "pulsar://pulsar1:6650")
.option("admin.url", "http://pulsar1:8080")
.option("topic", "persistent://public/default/airquality").load()
val pQuery = dfPulsar.selectExpr("*").writeStream.format("parquet")
.option("truncate", false).start()
ENRICH, ML & ROUTE:
Pulsar Functions
Why Pulsar Functions for Microservices?
Desired Characteristic Pulsar Functions…
Highly maintainable and testable ● Small pieces of code in Java, Python, or Go.
● Easily maintained in source control repositories and tested
with existing frameworks automatically.
Loosely coupled with other services ● Not directly linked to one another and communicate via
messages.
Independently deployable ● Designed to be deployed independently
Can be developed by a small team ● Often developed by a single developer.
Inter-service Communication ● Support all message patterns using Pulsar as the underlying
message bus.
Deployment & Composition ● Can run as individual threads, processes, or K8s pods.
● Function Mesh allows you to deploy multiple Pulsar
Functions as a single unit.
Pulsar Functions
● Route
● Enrich
● Convert
● Lookups
● Run
Machine Learning
● Logging
● Auditing
● Parse
● Split
● Convert
Pulsar Functions
● Consume messages from one
or more Pulsar topics.
● Apply user-supplied
processing logic to each
message.
● Publish the results of the
computation to another topic.
● Support multiple programming
languages (Java, Python, Go)
● Can leverage 3rd-party
libraries to support the
execution of ML models on the
edge.
Java Pulsar Function
from pulsar import Function
import json
class Chat(Function):
def __init__(self):
pass
def process(self, input, context):
logger = context.get_logger()
logger.info("Message Content: {0}".format(input))
msg_id = context.get_message_id()
row = { }
row['id'] = str(msg_id)
json_string = json.dumps(row)
return json_string
Python Pulsar Function
CONTINUOUS ANALYTICS:
Apache Flink SQL
Flink SQL -> Apache Pulsar
SQL
select aqi, parameterName, dateObserved, hourObserved, latitude,
longitude, localTimeZone, stateCode, reportingArea from
airquality;
select max(aqi) as MaxAQI, parameterName, reportingArea from
airquality group by parameterName, reportingArea;
select max(aqi) as MaxAQI, min(aqi) as MinAQI, avg(aqi) as
AvgAQI, count(aqi) as RowCount, parameterName, reportingArea
from airquality group by parameterName, reportingArea;

Weitere ähnliche Inhalte

Ähnlich wie Princeton Dec 2022 Meetup_ StreamNative and Cloudera Streaming

OSA Con 2022: Streaming Data Made Easy
OSA Con 2022:  Streaming Data Made EasyOSA Con 2022:  Streaming Data Made Easy
OSA Con 2022: Streaming Data Made EasyTimothy Spann
 
OSA Con 2022 - Streaming Data Made Easy - Tim Spann & David Kjerrumgaard - St...
OSA Con 2022 - Streaming Data Made Easy - Tim Spann & David Kjerrumgaard - St...OSA Con 2022 - Streaming Data Made Easy - Tim Spann & David Kjerrumgaard - St...
OSA Con 2022 - Streaming Data Made Easy - Tim Spann & David Kjerrumgaard - St...Altinity Ltd
 
JConf.dev 2022 - Apache Pulsar Development 101 with Java
JConf.dev 2022 - Apache Pulsar Development 101 with JavaJConf.dev 2022 - Apache Pulsar Development 101 with Java
JConf.dev 2022 - Apache Pulsar Development 101 with JavaTimothy Spann
 
[March sn meetup] apache pulsar + apache nifi for cloud data lake
[March sn meetup] apache pulsar + apache nifi for cloud data lake[March sn meetup] apache pulsar + apache nifi for cloud data lake
[March sn meetup] apache pulsar + apache nifi for cloud data lakeTimothy Spann
 
Python Web Conference 2022 - Apache Pulsar Development 101 with Python (FLiP-Py)
Python Web Conference 2022 - Apache Pulsar Development 101 with Python (FLiP-Py)Python Web Conference 2022 - Apache Pulsar Development 101 with Python (FLiP-Py)
Python Web Conference 2022 - Apache Pulsar Development 101 with Python (FLiP-Py)Timothy Spann
 
Python web conference 2022 apache pulsar development 101 with python (f li-...
Python web conference 2022   apache pulsar development 101 with python (f li-...Python web conference 2022   apache pulsar development 101 with python (f li-...
Python web conference 2022 apache pulsar development 101 with python (f li-...Timothy Spann
 
Living the Stream Dream with Pulsar and Spring Boot
Living the Stream Dream with Pulsar and Spring BootLiving the Stream Dream with Pulsar and Spring Boot
Living the Stream Dream with Pulsar and Spring BootTimothy Spann
 
Living the Stream Dream with Pulsar and Spring Boot
Living the Stream Dream with Pulsar and Spring BootLiving the Stream Dream with Pulsar and Spring Boot
Living the Stream Dream with Pulsar and Spring BootTimothy Spann
 
Princeton Dec 2022 Meetup_ NiFi + Flink + Pulsar
Princeton Dec 2022 Meetup_ NiFi + Flink + PulsarPrinceton Dec 2022 Meetup_ NiFi + Flink + Pulsar
Princeton Dec 2022 Meetup_ NiFi + Flink + PulsarTimothy Spann
 
NYC Dec 2022 Meetup_ Building Real-Time Requires a Team
NYC Dec 2022 Meetup_ Building Real-Time Requires a TeamNYC Dec 2022 Meetup_ Building Real-Time Requires a Team
NYC Dec 2022 Meetup_ Building Real-Time Requires a TeamTimothy Spann
 
Apache Pulsar Development 101 with Python
Apache Pulsar Development 101 with PythonApache Pulsar Development 101 with Python
Apache Pulsar Development 101 with PythonTimothy Spann
 
DevNexus: Apache Pulsar Development 101 with Java
DevNexus:  Apache Pulsar Development 101 with JavaDevNexus:  Apache Pulsar Development 101 with Java
DevNexus: Apache Pulsar Development 101 with JavaTimothy Spann
 
Big mountain data and dev conference apache pulsar with mqtt for edge compu...
Big mountain data and dev conference   apache pulsar with mqtt for edge compu...Big mountain data and dev conference   apache pulsar with mqtt for edge compu...
Big mountain data and dev conference apache pulsar with mqtt for edge compu...Timothy Spann
 
ApacheCon2022_Deep Dive into Building Streaming Applications with Apache Pulsar
ApacheCon2022_Deep Dive into Building Streaming Applications with Apache PulsarApacheCon2022_Deep Dive into Building Streaming Applications with Apache Pulsar
ApacheCon2022_Deep Dive into Building Streaming Applications with Apache PulsarTimothy Spann
 
CODEONTHEBEACH_Streaming Applications with Apache Pulsar
CODEONTHEBEACH_Streaming Applications with Apache PulsarCODEONTHEBEACH_Streaming Applications with Apache Pulsar
CODEONTHEBEACH_Streaming Applications with Apache PulsarTimothy Spann
 
Let's keep it simple and streaming
Let's keep it simple and streamingLet's keep it simple and streaming
Let's keep it simple and streamingTimothy Spann
 
Let's keep it simple and streaming.pdf
Let's keep it simple and streaming.pdfLet's keep it simple and streaming.pdf
Let's keep it simple and streaming.pdfVMware Tanzu
 
[Conf42-KubeNative] Building Real-time Pulsar Apps on K8
[Conf42-KubeNative] Building Real-time Pulsar Apps on K8[Conf42-KubeNative] Building Real-time Pulsar Apps on K8
[Conf42-KubeNative] Building Real-time Pulsar Apps on K8Timothy Spann
 
Streaming Data with Apache Kafka
Streaming Data with Apache KafkaStreaming Data with Apache Kafka
Streaming Data with Apache KafkaMarkus Günther
 
Scenic City Summit (2021): Real-Time Streaming in any and all clouds, hybrid...
Scenic City Summit (2021):  Real-Time Streaming in any and all clouds, hybrid...Scenic City Summit (2021):  Real-Time Streaming in any and all clouds, hybrid...
Scenic City Summit (2021): Real-Time Streaming in any and all clouds, hybrid...Timothy Spann
 

Ähnlich wie Princeton Dec 2022 Meetup_ StreamNative and Cloudera Streaming (20)

OSA Con 2022: Streaming Data Made Easy
OSA Con 2022:  Streaming Data Made EasyOSA Con 2022:  Streaming Data Made Easy
OSA Con 2022: Streaming Data Made Easy
 
OSA Con 2022 - Streaming Data Made Easy - Tim Spann & David Kjerrumgaard - St...
OSA Con 2022 - Streaming Data Made Easy - Tim Spann & David Kjerrumgaard - St...OSA Con 2022 - Streaming Data Made Easy - Tim Spann & David Kjerrumgaard - St...
OSA Con 2022 - Streaming Data Made Easy - Tim Spann & David Kjerrumgaard - St...
 
JConf.dev 2022 - Apache Pulsar Development 101 with Java
JConf.dev 2022 - Apache Pulsar Development 101 with JavaJConf.dev 2022 - Apache Pulsar Development 101 with Java
JConf.dev 2022 - Apache Pulsar Development 101 with Java
 
[March sn meetup] apache pulsar + apache nifi for cloud data lake
[March sn meetup] apache pulsar + apache nifi for cloud data lake[March sn meetup] apache pulsar + apache nifi for cloud data lake
[March sn meetup] apache pulsar + apache nifi for cloud data lake
 
Python Web Conference 2022 - Apache Pulsar Development 101 with Python (FLiP-Py)
Python Web Conference 2022 - Apache Pulsar Development 101 with Python (FLiP-Py)Python Web Conference 2022 - Apache Pulsar Development 101 with Python (FLiP-Py)
Python Web Conference 2022 - Apache Pulsar Development 101 with Python (FLiP-Py)
 
Python web conference 2022 apache pulsar development 101 with python (f li-...
Python web conference 2022   apache pulsar development 101 with python (f li-...Python web conference 2022   apache pulsar development 101 with python (f li-...
Python web conference 2022 apache pulsar development 101 with python (f li-...
 
Living the Stream Dream with Pulsar and Spring Boot
Living the Stream Dream with Pulsar and Spring BootLiving the Stream Dream with Pulsar and Spring Boot
Living the Stream Dream with Pulsar and Spring Boot
 
Living the Stream Dream with Pulsar and Spring Boot
Living the Stream Dream with Pulsar and Spring BootLiving the Stream Dream with Pulsar and Spring Boot
Living the Stream Dream with Pulsar and Spring Boot
 
Princeton Dec 2022 Meetup_ NiFi + Flink + Pulsar
Princeton Dec 2022 Meetup_ NiFi + Flink + PulsarPrinceton Dec 2022 Meetup_ NiFi + Flink + Pulsar
Princeton Dec 2022 Meetup_ NiFi + Flink + Pulsar
 
NYC Dec 2022 Meetup_ Building Real-Time Requires a Team
NYC Dec 2022 Meetup_ Building Real-Time Requires a TeamNYC Dec 2022 Meetup_ Building Real-Time Requires a Team
NYC Dec 2022 Meetup_ Building Real-Time Requires a Team
 
Apache Pulsar Development 101 with Python
Apache Pulsar Development 101 with PythonApache Pulsar Development 101 with Python
Apache Pulsar Development 101 with Python
 
DevNexus: Apache Pulsar Development 101 with Java
DevNexus:  Apache Pulsar Development 101 with JavaDevNexus:  Apache Pulsar Development 101 with Java
DevNexus: Apache Pulsar Development 101 with Java
 
Big mountain data and dev conference apache pulsar with mqtt for edge compu...
Big mountain data and dev conference   apache pulsar with mqtt for edge compu...Big mountain data and dev conference   apache pulsar with mqtt for edge compu...
Big mountain data and dev conference apache pulsar with mqtt for edge compu...
 
ApacheCon2022_Deep Dive into Building Streaming Applications with Apache Pulsar
ApacheCon2022_Deep Dive into Building Streaming Applications with Apache PulsarApacheCon2022_Deep Dive into Building Streaming Applications with Apache Pulsar
ApacheCon2022_Deep Dive into Building Streaming Applications with Apache Pulsar
 
CODEONTHEBEACH_Streaming Applications with Apache Pulsar
CODEONTHEBEACH_Streaming Applications with Apache PulsarCODEONTHEBEACH_Streaming Applications with Apache Pulsar
CODEONTHEBEACH_Streaming Applications with Apache Pulsar
 
Let's keep it simple and streaming
Let's keep it simple and streamingLet's keep it simple and streaming
Let's keep it simple and streaming
 
Let's keep it simple and streaming.pdf
Let's keep it simple and streaming.pdfLet's keep it simple and streaming.pdf
Let's keep it simple and streaming.pdf
 
[Conf42-KubeNative] Building Real-time Pulsar Apps on K8
[Conf42-KubeNative] Building Real-time Pulsar Apps on K8[Conf42-KubeNative] Building Real-time Pulsar Apps on K8
[Conf42-KubeNative] Building Real-time Pulsar Apps on K8
 
Streaming Data with Apache Kafka
Streaming Data with Apache KafkaStreaming Data with Apache Kafka
Streaming Data with Apache Kafka
 
Scenic City Summit (2021): Real-Time Streaming in any and all clouds, hybrid...
Scenic City Summit (2021):  Real-Time Streaming in any and all clouds, hybrid...Scenic City Summit (2021):  Real-Time Streaming in any and all clouds, hybrid...
Scenic City Summit (2021): Real-Time Streaming in any and all clouds, hybrid...
 

Mehr von Timothy Spann

Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024Timothy Spann
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
2024 XTREMEJ_ Building Real-time Pipelines with FLaNK_ A Case Study with Tra...
2024 XTREMEJ_  Building Real-time Pipelines with FLaNK_ A Case Study with Tra...2024 XTREMEJ_  Building Real-time Pipelines with FLaNK_ A Case Study with Tra...
2024 XTREMEJ_ Building Real-time Pipelines with FLaNK_ A Case Study with Tra...Timothy Spann
 
28March2024-Codeless-Generative-AI-Pipelines
28March2024-Codeless-Generative-AI-Pipelines28March2024-Codeless-Generative-AI-Pipelines
28March2024-Codeless-Generative-AI-PipelinesTimothy Spann
 
TCFPro24 Building Real-Time Generative AI Pipelines
TCFPro24 Building Real-Time Generative AI PipelinesTCFPro24 Building Real-Time Generative AI Pipelines
TCFPro24 Building Real-Time Generative AI PipelinesTimothy Spann
 
2024 Build Generative AI for Non-Profits
2024 Build Generative AI for Non-Profits2024 Build Generative AI for Non-Profits
2024 Build Generative AI for Non-ProfitsTimothy Spann
 
2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...
2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...
2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...Timothy Spann
 
Conf42-Python-Building Apache NiFi 2.0 Python Processors
Conf42-Python-Building Apache NiFi 2.0 Python ProcessorsConf42-Python-Building Apache NiFi 2.0 Python Processors
Conf42-Python-Building Apache NiFi 2.0 Python ProcessorsTimothy Spann
 
Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg...
Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg...Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg...
Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg...Timothy Spann
 
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI PipelinesTimothy Spann
 
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and FlinkDBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and FlinkTimothy Spann
 
NY Open Source Data Meetup Feb 8 2024 Building Real-time Pipelines with FLaNK...
NY Open Source Data Meetup Feb 8 2024 Building Real-time Pipelines with FLaNK...NY Open Source Data Meetup Feb 8 2024 Building Real-time Pipelines with FLaNK...
NY Open Source Data Meetup Feb 8 2024 Building Real-time Pipelines with FLaNK...Timothy Spann
 
OSACon 2023_ Unlocking Financial Data with Real-Time Pipelines
OSACon 2023_ Unlocking Financial Data with Real-Time PipelinesOSACon 2023_ Unlocking Financial Data with Real-Time Pipelines
OSACon 2023_ Unlocking Financial Data with Real-Time PipelinesTimothy Spann
 
Building Real-Time Travel Alerts
Building Real-Time Travel AlertsBuilding Real-Time Travel Alerts
Building Real-Time Travel AlertsTimothy Spann
 
JConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and FlinkJConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and FlinkTimothy Spann
 
[EN]DSS23_tspann_Integrating LLM with Streaming Data Pipelines
[EN]DSS23_tspann_Integrating LLM with Streaming Data Pipelines[EN]DSS23_tspann_Integrating LLM with Streaming Data Pipelines
[EN]DSS23_tspann_Integrating LLM with Streaming Data PipelinesTimothy Spann
 
Evolve 2023 NYC - Integrating AI Into Realtime Data Pipelines Demo
Evolve 2023 NYC - Integrating AI Into Realtime Data Pipelines DemoEvolve 2023 NYC - Integrating AI Into Realtime Data Pipelines Demo
Evolve 2023 NYC - Integrating AI Into Realtime Data Pipelines DemoTimothy Spann
 
AIDevWorldApacheNiFi101
AIDevWorldApacheNiFi101AIDevWorldApacheNiFi101
AIDevWorldApacheNiFi101Timothy Spann
 

Mehr von Timothy Spann (20)

Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
2024 XTREMEJ_ Building Real-time Pipelines with FLaNK_ A Case Study with Tra...
2024 XTREMEJ_  Building Real-time Pipelines with FLaNK_ A Case Study with Tra...2024 XTREMEJ_  Building Real-time Pipelines with FLaNK_ A Case Study with Tra...
2024 XTREMEJ_ Building Real-time Pipelines with FLaNK_ A Case Study with Tra...
 
28March2024-Codeless-Generative-AI-Pipelines
28March2024-Codeless-Generative-AI-Pipelines28March2024-Codeless-Generative-AI-Pipelines
28March2024-Codeless-Generative-AI-Pipelines
 
TCFPro24 Building Real-Time Generative AI Pipelines
TCFPro24 Building Real-Time Generative AI PipelinesTCFPro24 Building Real-Time Generative AI Pipelines
TCFPro24 Building Real-Time Generative AI Pipelines
 
2024 Build Generative AI for Non-Profits
2024 Build Generative AI for Non-Profits2024 Build Generative AI for Non-Profits
2024 Build Generative AI for Non-Profits
 
2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...
2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...
2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...
 
Conf42-Python-Building Apache NiFi 2.0 Python Processors
Conf42-Python-Building Apache NiFi 2.0 Python ProcessorsConf42-Python-Building Apache NiFi 2.0 Python Processors
Conf42-Python-Building Apache NiFi 2.0 Python Processors
 
Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg...
Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg...Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg...
Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg...
 
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
 
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and FlinkDBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
 
NY Open Source Data Meetup Feb 8 2024 Building Real-time Pipelines with FLaNK...
NY Open Source Data Meetup Feb 8 2024 Building Real-time Pipelines with FLaNK...NY Open Source Data Meetup Feb 8 2024 Building Real-time Pipelines with FLaNK...
NY Open Source Data Meetup Feb 8 2024 Building Real-time Pipelines with FLaNK...
 
OSACon 2023_ Unlocking Financial Data with Real-Time Pipelines
OSACon 2023_ Unlocking Financial Data with Real-Time PipelinesOSACon 2023_ Unlocking Financial Data with Real-Time Pipelines
OSACon 2023_ Unlocking Financial Data with Real-Time Pipelines
 
Building Real-Time Travel Alerts
Building Real-Time Travel AlertsBuilding Real-Time Travel Alerts
Building Real-Time Travel Alerts
 
JConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and FlinkJConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and Flink
 
[EN]DSS23_tspann_Integrating LLM with Streaming Data Pipelines
[EN]DSS23_tspann_Integrating LLM with Streaming Data Pipelines[EN]DSS23_tspann_Integrating LLM with Streaming Data Pipelines
[EN]DSS23_tspann_Integrating LLM with Streaming Data Pipelines
 
Evolve 2023 NYC - Integrating AI Into Realtime Data Pipelines Demo
Evolve 2023 NYC - Integrating AI Into Realtime Data Pipelines DemoEvolve 2023 NYC - Integrating AI Into Realtime Data Pipelines Demo
Evolve 2023 NYC - Integrating AI Into Realtime Data Pipelines Demo
 
AIDevWorldApacheNiFi101
AIDevWorldApacheNiFi101AIDevWorldApacheNiFi101
AIDevWorldApacheNiFi101
 

Kürzlich hochgeladen

Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 

Kürzlich hochgeladen (20)

Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 

Princeton Dec 2022 Meetup_ StreamNative and Cloudera Streaming

  • 1. Streaming Data Platforms for Cloud-Native Event-driven Applications
  • 2. Apache Pulsar + Apache NiFi + Apache Flink + Cloudera Tim Spann
  • 3. Agenda 1. Welcome 2. Introduction to Apache Pulsar • Basics of Pulsar • Use Cases 3. Apache NiFi <-> Apache Pulsar 4. Apache Flink <-> Apache Pulsar 5. Let’s Build an App! • Demo 4. Resources 5. Q&A
  • 4. Tim Spann Developer Advocate Tim Spann Developer Advocate at StreamNative ● FLiP(N) Stack = Flink, Pulsar and NiFi Stack ● Streaming Systems & Data Architecture Expert ● Experience: ○ 15+ years of experience with streaming technologies including Pulsar, Flink, Spark, NiFi, Big Data, Cloud, MXNet, IoT, Python and more. ○ Today, he helps to grow the Pulsar community sharing rich technical knowledge and experience at both global conferences and through individual conversations.
  • 5. FLiP Stack Weekly This week in Apache Flink, Apache Pulsar, Apache NiFi, Apache Spark and open source friends. https://bit.ly/32dAJft
  • 6. Proprietary & Confidential | Apache Pulsar made easy.
  • 7. Apache Pulsar is a Cloud-Native Messaging and Event-Streaming Platform.
  • 8. CREATED Originally developed inside Yahoo! as Cloud Messaging Service GROWTH 10x Contributors 10MM+ Downloads Ecosystem Expands Kafka on Pulsar AMQ on Pulsar Functions . . . 2012 2016 2018 TODAY APACHE TLP Pulsar becomes Apache top level project. OPEN SOURCE Pulsar committed to open source. Apache Pulsar Timeline
  • 10. Pulsar Has a Built-in Super Set of OSS Features Durability Scalability Geo-Replication Multi-Tenancy Unified Messaging Model Reduced Vendor Dependency Functions Open-Source Features
  • 11. Schema Registry schema-1 (value=Avro/Protobuf/JSON) schema-2 (value=Avro/Protobuf/JSON) schema-3 (value=Avro/Protobuf/JSON) Schema Data ID Local Cache for Schemas + Schema Data ID + Local Cache for Schemas Send schema-1 (value=Avro/Protobuf/JSON) data serialized per schema ID Send (register) schema (if not in local cache) Read schema-1 (value=Avro/Protobuf/JSON) data deserialized per schema ID Get schema by ID (if not in local cache) Producers Consumers Integrated Schema Registry
  • 12. Ideal for app and data tiers Less sprawl and better utilization Cloud-native scalability Build globally without the complexity Cost effective long-term storage Pulsar across the organization
  • 13. Joining Streams in SQL Perform in Real-Time Ordering and Arrival Concurrent Consumers Change Data Capture Data Streaming
  • 14. Streaming Consumer Consumer Consumer Subscription Shared Failover Consumer Consumer Subscription In case of failure in Consumer B-0 Consumer Consumer Subscription Exclusive X Consumer Consumer Key-Shared Subscription Pulsar Topic/Partition Messaging
  • 15.
  • 16.
  • 17. Messages - the basic unit of Pulsar Component Description Value / data payload The data carried by the message. All Pulsar messages contain raw bytes, although message data can also conform to data schemas. Key Messages are optionally tagged with keys, used in partitioning and also is useful for things like topic compaction. Properties An optional key/value map of user-defined properties. Producer name The name of the producer who produces the message. If you do not specify a producer name, the default name is used. Message De-Duplication. Sequence ID Each Pulsar message belongs to an ordered sequence on its topic. The sequence ID of the message is its order in that sequence. Message De-Duplication.
  • 18. Connectivity • Libraries - (Java, Python, Go, NodeJS, WebSockets, C++, C#, Scala, Rust,...) • Functions - Lightweight Stream Processing (Java, Python, Go) • Connectors - Sources & Sinks (Cassandra, Kafka, …) • Protocol Handlers - AoP (AMQP), KoP (Kafka), MoP (MQTT) • Processing Engines - Flink, Spark, Presto/Trino via Pulsar SQL • Data Offloaders - Tiered Storage - (S3) hub.streamnative.io
  • 19. Schema Registry Schema Registry schema-1 (value=Avro/Protobuf/JSON) schema-2 (value=Avro/Protobuf/JSON) schema-3 (value=Avro/Protobuf/JSON) Schema Data ID Local Cache for Schemas + Schema Data ID + Local Cache for Schemas Send schema-1 (value=Avro/Protobuf/JSON) data serialized per schema ID Send (register) schema (if not in local cache) Read schema-1 (value=Avro/Protobuf/JSON) data deserialized per schema ID Get schema by ID (if not in local cache) Producers Consumers
  • 20. Apache Pulsar has a vibrant community 560+ Contributors 10,000+ Commits 7,000+ Slack Members 1,000+ Organizations Using Pulsar
  • 22. ● Serverless computing framework. ● Unbounded storage, multi-tiered architecture, and tiered-storage. ● Streaming & Pub/Sub messaging semantics. ● Multi-protocol support. ● Open Source ● Cloud-Native ● Multi-Tenant Features
  • 23. Message Queuing Data Streaming Unified Messaging & Streaming Platform
  • 24. Apache Pulsar features Cloud native with decoupled storage and compute layers. Built-in compatibility with your existing code and messaging infrastructure. Geographic redundancy and high availability included. Centralized cluster management and oversight. Elastic horizontal and vertical scalability. Seamless and instant partitioning rebalancing with no downtime. Flexible subscription model supports a wide array of use cases. Compatible with the tools you use to store, analyze, and process data.
  • 25. ● “Bookies” ● Stores messages and cursors ● Messages are grouped in segments/ledgers ● A group of bookies form an “ensemble” to store a ledger ● “Brokers” ● Handles message routing and connections ● Stateless, but with caches ● Automatic load-balancing ● Topics are composed of multiple segments ● ● Stores metadata for both Pulsar and BookKeeper ● Service discovery Store Messages Metadata & Service Discovery Metadata & Service Discovery Pulsar Cluster Metadata Storage Pulsar Cluster
  • 26. Tenants (Compliance) Tenants (Data Services) Namespace (Microservices) Topic-1 (Cust Auth) Topic-1 (Location Resolution) Topic-2 (Demographics) Topic-1 (Budgeted Spend) Topic-1 (Acct History) Topic-1 (Risk Detection) Namespace (ETL) Namespace (Campaigns) Namespace (ETL) Tenants (Marketing) Namespace (Risk Assessment) Pulsar Cluster Tenant - Namespaces - Topics
  • 27. Component Description Value / data payload The data carried by the message. All Pulsar messages contain raw bytes, although message data can also conform to data schemas. Key Messages are optionally tagged with keys, used in partitioning and also is useful for things like topic compaction. Properties An optional key/value map of user-defined properties. Producer name The name of the producer who produces the message. If you do not specify a producer name, the default name is used. Sequence ID Each Pulsar message belongs to an ordered sequence on its topic. The sequence ID of the message is its order in that sequence. Messages
  • 28. Streaming Consumer Consumer Consumer Subscription Shared Failover Consumer Consumer Subscription In case of failure in Consumer B-0 Consumer Consumer Subscription Exclusive X Consumer Consumer Key-Shared Subscription Pulsar Topic/Partition Messaging
  • 29. Schema Registry schema-1 (value=Avro/Protobuf/JSON) schema-2 (value=Avro/Protobuf/JSON) schema-3 (value=Avro/Protobuf/JSON) Schema Data ID Local Cache for Schemas + Schema Data ID + Local Cache for Schemas Send schema-1 (value=Avro/Protobuf/JSON) data serialized per schema ID Send (register) schema (if not in local cache) Read schema-1 (value=Avro/Protobuf/JSON) data deserialized per schema ID Get schema by ID (if not in local cache) Producers Consumers Integrated Schema Registry
  • 30. Flexible Pub/Sub API for Pulsar - Shared Consumer consumer = client.newConsumer() .topic("my-topic") .subscriptionName("work-q-1") .subscriptionType(SubType.Shared) .subscribe();
  • 31. Flexible Pub/Sub API for Pulsar - Failover Consumer consumer = client.newConsumer() .topic("my-topic") .subscriptionName("stream-1") .subscriptionType(SubType.Failover) .subscribe();
  • 32. Reader Interface byte[] msgIdBytes = // Some byte array MessageId id = MessageId.fromByteArray(msgIdBytes); Reader<byte[]> reader = pulsarClient.newReader() .topic(topic) .startMessageId(id) .create(); Create a reader that will read from some message between earliest and latest. Reader
  • 33. Built-in Back Pressure Producer<String> producer = client.newProducer(Schema.STRING) .topic("hellotopic") .blockIfQueueFull(true) // enable blocking // when queue is full .maxPendingMessages(10) // max queue size .create(); // During Back Pressure: the sendAsync call blocks // with no room in queues producer.newMessage() .key("mykey") .value("myvalue") .sendAsync(); // can be a blocking call
  • 34. • Functions - Lightweight Stream Processing (Java, Python, Go) • Connectors - Sources & Sinks (Cassandra, Kafka, …) • Protocol Handlers - AoP (AMQP), KoP (Kafka), MoP (MQTT) • Processing Engines - Flink, Spark, Presto/Trino via Pulsar SQL • Data Offloaders - Tiered Storage - (S3) Tenant - Namespaces - Topics
  • 36. MQTT on Pulsar (MoP)
  • 37. AMQP on Pulsar (AoP)
  • 38. Pulsar Functions ● Lightweight computation similar to AWS Lambda. ● Specifically designed to use Apache Pulsar as a message bus. ● Function runtime can be located within Pulsar Broker. A serverless event streaming framework
  • 39. ● Consume messages from one or more Pulsar topics. ● Apply user-supplied processing logic to each message. ● Publish the results of the computation to another topic. ● Support multiple programming languages (Java, Python, Go) ● Can leverage 3rd-party libraries to support the execution of ML models on the edge. Pulsar Functions
  • 40. Moving Data In and Out of Pulsar IO/Connectors are a simple way to integrate with external systems and move data in and out of Pulsar. https://pulsar.apache.org/docs/en/io-jdbc-sink/ ● Built on top of Pulsar Functions ● Built-in connectors - hub.streamnative.io Source Sink
  • 42. Apache NiFi Pulsar Connector https://github.com/streamnative/pulsar-nifi-bundle
  • 43. Apache NiFi Pulsar Connector https://github.com/david-streamlio/pulsar-nifi-bundle
  • 44. Apache NiFi Pulsar Connector https://www.datainmotion.dev/2021/11/producing-and-consuming-pulsar-messages.html
  • 45. Apache NiFi Pulsar Connector
  • 46. Apache NiFi Pulsar Connector
  • 47. Apache NiFi Pulsar Connector https://github.com/david-streamlio/pulsar-nifi-bundle/releases/tag/v1.14.0
  • 48.
  • 49. Events <-> Streaming FLiPS Apps StreamNative Hub StreamNative Cloud Unified Batch and Stream COMPUTING Batch (Batch + Stream) Unified Batch and Stream STORAGE Offload (Queuing + Streaming) Tiered Storage Pulsar --- KoP --- MoP --- Websocket Pulsar Sink Streaming Edge Gateway Protocols <-> Events <-> CDC Apps
  • 51.
  • 52. Building An App Code Along With Tim <<DEMO>>
  • 53. RESOURCES Here are resources to continue your journey with Apache Pulsar
  • 54. Links ● https://github.com/tspannhw/Flip-iot ● https://github.com/streamnative/examples/tree/master/cloud/go ● https://pulsar.apache.org/docs/en/client-libraries-go/ ● https://github.com/apache/pulsar-client-go ● https://github.com/tspannhw/Meetup-YourFirstEventDrivenApp ● https://github.com/tspannhw/SpeakerProfile ● https://github.com/tspannhw/pulsar-flinksql-1.13.2 ● https://github.com/tspannhw/pulsar-pychat-function ● https://www.datainmotion.dev/ ● https://www.meetup.com/SF-Bay-Area-Apache-Pulsar-Meetup/events/283083105/ ● https://medium.com/@tspann ● https://dzone.com/users/297029/bunkertor.html ● https://dev.to/tspannhw ● https://linktr.ee/tspannhw
  • 55. FLiP Stack Weekly This week in Apache Flink, Apache Pulsar, Apache NiFi, Apache Spark and open source friends. https://bit.ly/32dAJft
  • 56. Scan the QR code to learn more about Apache Pulsar and StreamNative.
  • 57. Scan the QR code to build your own apps today.
  • 58. Deploying AI With an Event-Driven Platform https://dzone.com/trendreports/enterprise-ai-1
  • 59.
  • 60. Let’s Keep in Touch! Tim Spann Developer Advocate @PaaSDev https://www.linkedin.com/in/timothyspann https://github.com/tspannhw @TimPulsar@hachyderm.io https://hachyderm.io/@TimPulsar
  • 62. Data Aggregation Concentrate data for multiple clusters into a single one where main processing happens
  • 63. Active - Active Processing happens identical in both clusters.
  • 64. Active - Passive Only active cluster processes data - Passive cluster is for failover
  • 66. ● Lightweight computation similar to AWS Lambda. ● Specifically designed to use Apache Pulsar as a message bus. ● Function runtime can be located within Pulsar Broker. ● Java Functions A serverless event streaming framework Pulsar Functions
  • 68. ● Consume messages from one or more Pulsar topics. ● Apply user-supplied processing logic to each message. ● Publish the results of the computation to another topic. ● Support multiple programming languages (Java, Python, Go) ● Can leverage 3rd-party libraries to support the execution of ML models on the edge. Pulsar Functions
  • 69. Simple Native Java Function
  • 70. Simple Function with Pulsar SDK
  • 71. from pulsar import Function from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer import json class Chat(Function): def __init__(self): pass def process(self, input, context): fields = json.loads(input) sid = SentimentIntensityAnalyzer() ss = sid.polarity_scores(fields["comment"]) row = { } row['id'] = str(msg_id) if ss['compound'] < 0.00: row['sentiment'] = 'Negative' else: row['sentiment'] = 'Positive' row['comment'] = str(fields["comment"]) json_string = json.dumps(row) return json_string Entire Function Pulsar Python NLP Function
  • 72. https://www.influxdata.com/integration/mqtt-monitoring/ https://www.influxdata.com/integration/mqtt-monitoring/ • Guaranteed delivery • Data buffering - Backpressure - Pressure release • Prioritized queuing • Flow specific QoS - Latency vs. throughput - Loss tolerance • Data provenance • Supports push and pull models • Hundreds of processors • Visual command and control • Over a 300 components • Flow templates • Pluggable/multi-role security • Designed for extension • Clustering • Version Control Apache NiFi Basics
  • 73. Apache NiFi - Apache Pulsar Connector
  • 74. Apache NiFi - Producing to Pulsar
  • 75. Apache NiFi - Live Metrics
  • 76. ● Unified computing engine ● Batch processing is a special case of stream processing ● Stateful processing ● Massive Scalability ● Flink SQL for queries, inserts against Pulsar Topics ● Streaming Analytics ● Continuous SQL ● Continuous ETL ● Complex Event Processing ● Standard SQL Powered by Apache Calcite Apache Flink
  • 77. Apache Flink + Apache Pulsar
  • 79. Apache Flink Job Running Against Apache Pulsar
  • 80. ● Java, Scala, Python Support ● Strong ETL/ELT ● Diverse ML support ● Scalable Distributed compute ● Apache Zeppelin and Jupyter Notebooks ● Fast connector for Apache Pulsar Apache Spark
  • 81. val dfPulsar = spark.readStream.format(" pulsar") .option(" service.url", "pulsar://pulsar1:6650") .option(" admin.url", "http://pulsar1:8080 ") .option(" topic", "persistent://public/default/airquality").load() val pQuery = dfPulsar.selectExpr("*") .writeStream.format(" console") .option("truncate", false).start() ____ __ / __/__ ___ _____/ /__ _ / _ / _ `/ __/ '_/ /___/ .__/_,_/_/ /_/_ version 3.2.0 /_/ Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 11.0.11) Apache Spark + Apache Pulsar
  • 82. Apache Spark Running Against Apache Pulsar
  • 83. Fanout, Queueing, or Streaming In Pulsar, you have flexibility to use different subscription modes. If you want to... Do this... And use these subscription modes... achieve fanout messaging among consumers specify a unique subscription name for each consumer exclusive (or failover) achieve message queuing among consumers share the same subscription name among multiple consumers shared allow for ordered consumption (streaming) distribute work among any number of consumers exclusive, failover or key shared
  • 84. 84 Unified Messaging Model Simplify your data infrastructure and enable new use cases with queuing and streaming capabilities in one platform.
  • 85. Ideal for app and data tiers Less sprawl and better utilization Cloud-native scalability Build globally without the complexity Cost effective long-term storage Pulsar across the organization
  • 86. Starting a Function - Distributed Cluster Once compiled into a JAR, start a Pulsar Function in a distributed cluster:
  • 88. Moving Data In and Out of Pulsar IO/Connectors are a simple way to integrate with external systems and move data in and out of Pulsar. https://pulsar.apache.org/docs/en/io-jdbc-sink/ ● Built on top of Pulsar Functions ● Built-in connectors - hub.streamnative.io Source Sink
  • 89. Why a Table Abstraction? ● In many use cases, applications want to consume data from a Pulsar Topic as if it were a database table, where each new message is an “update” to the table. ● Up until now, applications used Pulsar consumers or readers to fetch all the updates from a topic and construct a map with the latest value of each key for received messages. ● The Table Abstraction provides a standard implementation of this message consumption pattern for any given keyed topic.
  • 90. TableView ● New Consumer type added in Pulsar 2.10 that provides a continuously updated key-value map view of compacted topic data. ● An abstraction of a changelog stream from a primary-keyed table, where each record in the changelog stream is an update on the primary-keyed table with the record key as the primary key. ● READ ONLY DATA STRUCTURE!
  • 92. How does it work? ● When you create a TableView, and additional compacted topic is created. ● In a compacted topic, only the most recent value associated with each key is retained. ● A background reader consumes from the compacted topic and updates the map when new messages arrive.
  • 93. Event-Based Microservices • When your microservices use a message bus to communicate, it is referred to as an “event-driven” architecture. • In an event-driven architecture, when a service performs some piece of work that other services might be interested in, that service produces an event. Other services consume those events so that they can perform their own tasks. • Messages are stored in intermediate topics to prevent message loss. If a service fails, the message remains in the topic.
  • 94. Event-Based Microservices Application • A basic order entry use case for a food delivery service will involve several different microservices. • They communicate via messages sent to Pulsar topics. • A service can also subscribe to the events published by another service.
  • 95. Use Apache Pulsar For Ingest
  • 96. Founded by the original creators of Apache Pulsar. StreamNative has more experience designing, deploying, and running large-scale Apache Pulsar instances than any team in the world. StreamNative employs more than 50% of the active core committers to Apache Pulsar.
  • 97. Pulsar and other Ecosystems Real-Time Processor Long-Term Storage Subscription B Group 0 Subscription B Group 20 ... Pulsar will feed data into many different systems and data format is crucial. Pulsar feeds real-time engines and data format makes this integration easier. Data is offloaded from Pulsar to S3/HDFS and benefits from a correct data format. Any improvement in size reduction is multiplied by the number of consumers.
  • 98. Apache Avro Avro is a data serialization system used across Big Data ecosystem. Avro provides: ● Rich data structures ● A compact, fast, binary data format ● A container file, to store persistent data ● Remote procedure call (RPC) ● Simple integration with dynamic languages For more information: https://avro.apache.org/docs/current/
  • 99. Messaging versus Streaming Messaging Use Cases Streaming Use Cases Service x commands service y to make some change. Example: order service removing item from inventory service Moving large amounts of data to another service (real-time ETL). Example: logs to elasticsearch Distributing messages that represent work among n workers. Example: order processing not in main “thread” Periodic jobs moving large amounts of data and aggregating to more traditional stores. Example: logs to s3 Sending “scheduled” messages. Example: notification service for marketing emails or push notifications Computing a near real-time aggregate of a message stream, split among n workers, with order being important. Example: real-time analytics over page views
  • 100. Differences in consumption Messaging Use Case Streaming Use Case Retention The amount of data retained is relatively small - typically only a day or two of data at most. Large amounts of data are retained, with higher ingest volumes and longer retention periods. Throughput Messaging systems are not designed to manage big “catch-up” reads. Streaming systems are designed to scale and can handle use cases such as catch-up reads.
  • 102. { "title" : "BUS 76 - Oct 17, 2022 04:38:41 PM", "description" : "Bus route 76 is operating on a detour in Hasbrouck Heights. Terrace Avenue is closed from Paterson Avenue to Madison Avenue for utility work.Buses will use Paterson", "link" : "https://www.njtransit.com/node/1540960", "guid" : "https://www.njtransit.com/node/1540960", "advisoryAlert" : 0, "pubDate" : "Oct 17, 2022 04:38:41 PM", "ts" : "1666039443600", "companyname" : "newjersey", "uuid" : "3b19dd32-db1e-4320-ba9a-f6bfd4a87bb9", "servicename" : "bus" } <?xml version="1.0" encoding="ISO-8859-1"?> <?xml-stylesheet href="latest_ob.xsl" type="text/xsl"?> <current_observation><temp_c>20.0</temp_c> <pressure_string>1019.1 mb</pressure_string> <pressure_mb>1019.1</pressure_mb> <pressure_in>30.17</pressure_in> <dewpoint_string>26.6 F</dewpoint_string> </current_observation> Data Information? 'aircraft': [{'hex': 'ae6d7a', 'alt_baro': 25000, 'mlat': [], 'tisb': [], 'messages': 177, 'seen': 0.1, 'rssi': -22.7}
  • 105. Trains, Planes and Automobiles +++ Information Needed Data Feed(s) Local weather conditions ● XML, JSON, RSS Mass transit status & alerts ● XML, JSON, RSS Regional highways & tunnels ● GeoRSS, XML, ProtoBuf, JSON Local social media ● JSON ADS-B Plane Data ● JSON Local air quality ● JSON
  • 106.
  • 107. HOW?
  • 109.
  • 110.
  • 111.
  • 116. Apache NiFi <-> Apache Pulsar
  • 117. Apache NiFi - Data Lineage / Provenance
  • 118. Apache NiFi - Producing to Pulsar
  • 119. TRANSIT: Infinite Message Bus with Apache Pulsar
  • 120. Streaming Consumer Consumer Consumer Subscription Shared Failover Consumer Consumer Subscription In case of failure in Consumer B-0 Consumer Consumer Subscription Exclusive X Consumer Consumer Key-Shared Subscription Pulsar Topic/Partition Messaging Unified Messaging Model
  • 121. Cloud native with decoupled storage and compute layers. Built-in compatibility with your existing code and messaging infrastructure. Geographic redundancy and high availability included. Centralized cluster management and oversight. Elastic horizontal and vertical scalability. Seamless and instant partitioning rebalancing with no downtime. Flexible subscription model supports a wide array of use cases. Compatible with the tools you use to store, analyze, and process data. Apache Pulsar features
  • 122. Schema Registry Schema Registry schema-1 (value=Avro/Protobuf/JSON) schema-2 (value=Avro/Protobuf/JSON) schema-3 (value=Avro/Protobuf/JSON) Schema Data ID Local Cache for Schemas + Schema Data ID + Local Cache for Schemas Send schema-1 (value=Avro/Protobuf/JSON) data serialized per schema ID Send (register) schema (if not in local cache) Read schema-1 (value=Avro/Protobuf/JSON) data deserialized per schema ID Get schema by ID (if not in local cache) Producers Consumers
  • 125. Use Pulsar to Stream from Lakehouses
  • 126. Use Pulsar to Stream to Lakehouses
  • 128. Building Spark SQL View val dfPulsar = spark.readStream.format("pulsar") .option("service.url", "pulsar://pulsar1:6650") .option("admin.url", "http://pulsar1:8080") .option("topic", "persistent://public/default/weather") .load() dfPulsar.printSchema() val pQuery = dfPulsar.selectExpr("*") .writeStream.format("console") .option("truncate", false) .start()
  • 129. Example Spark Code val dfPulsar = spark.readStream.format("pulsar") .option("service.url", "pulsar://pulsar1:6650") .option("admin.url", "http://pulsar1:8080") .option("topic", "persistent://public/default/airquality").load() val pQuery = dfPulsar.selectExpr("*").writeStream.format("parquet") .option("truncate", false).start()
  • 130. ENRICH, ML & ROUTE: Pulsar Functions
  • 131. Why Pulsar Functions for Microservices? Desired Characteristic Pulsar Functions… Highly maintainable and testable ● Small pieces of code in Java, Python, or Go. ● Easily maintained in source control repositories and tested with existing frameworks automatically. Loosely coupled with other services ● Not directly linked to one another and communicate via messages. Independently deployable ● Designed to be deployed independently Can be developed by a small team ● Often developed by a single developer. Inter-service Communication ● Support all message patterns using Pulsar as the underlying message bus. Deployment & Composition ● Can run as individual threads, processes, or K8s pods. ● Function Mesh allows you to deploy multiple Pulsar Functions as a single unit.
  • 132.
  • 133. Pulsar Functions ● Route ● Enrich ● Convert ● Lookups ● Run Machine Learning ● Logging ● Auditing ● Parse ● Split ● Convert
  • 134. Pulsar Functions ● Consume messages from one or more Pulsar topics. ● Apply user-supplied processing logic to each message. ● Publish the results of the computation to another topic. ● Support multiple programming languages (Java, Python, Go) ● Can leverage 3rd-party libraries to support the execution of ML models on the edge.
  • 136. from pulsar import Function import json class Chat(Function): def __init__(self): pass def process(self, input, context): logger = context.get_logger() logger.info("Message Content: {0}".format(input)) msg_id = context.get_message_id() row = { } row['id'] = str(msg_id) json_string = json.dumps(row) return json_string Python Pulsar Function
  • 138. Flink SQL -> Apache Pulsar
  • 139. SQL select aqi, parameterName, dateObserved, hourObserved, latitude, longitude, localTimeZone, stateCode, reportingArea from airquality; select max(aqi) as MaxAQI, parameterName, reportingArea from airquality group by parameterName, reportingArea; select max(aqi) as MaxAQI, min(aqi) as MinAQI, avg(aqi) as AvgAQI, count(aqi) as RowCount, parameterName, reportingArea from airquality group by parameterName, reportingArea;