Princeton Dec 2022 Meetup_ NiFi + Flink + Pulsar
Streaming Data Platform for cloud-native event-driven applications
https://github.com/tspannhw/pulsar-csp-ce/blob/main/weather.md
https://github.com/tspannhw/create-nifi-pulsar-flink-apps
https://medium.com/@tspann/using-apache-pulsar-with-cloudera-sql-builder-apache-flink-b518aa9eadff
https://www.meetup.com/new-york-city-apache-pulsar-meetup/events/289674210/
For non-locals, we will Broadcast Live via Youtube. Sign up and we will send out the link.
Location:
TigerLabs in Princeton on the 2nd floor, walk up and the door will be open. Same that we were using for the old Future of Data - Princeton events 2016-2019.
Parking at the school is free. street parking nearby is free. there are meters on some streets, and a few blocks away is a paid parking garage.
We are joining forces with our friends Cloudera again on a FLiPN amazing journey into Real-Time Streaming Applications with Apache Flink, Apache NiFi, and Apache Pulsar.
Discover how to stream data to and from your data lake or data mart using Apache Pulsar™ and Apache NiFi®. Learn how these cloud-native, scalable open-source projects built for streaming data pipelines work together to enable you to quickly build applications with minimal coding.
|WHAT THE SESSION WILL COVER|
Apache NiFi
Apache Pulsar
Apache Flink
Flink SQL
We will show you how to build apps, so download beforehand to Docker, K8, your Laptop, or the cloud.
Cloudera CSP Setup
Getting Started with Cloudera Stream Processing Community Edition
You may download CSP-CE here:
Cloudera Stream Processing Community Edition
The Cloudera CDP User's page:
CDP Resources Page
https://youtu.be/s80sz3NWwHo
https://docs.cloudera.com/csp-ce/latest/index.html
https://www.cloudera.com/downloads/cdf/csp-community-edition.html
Apache Pulsar
https://pulsar.apache.org/docs/getting-started-standalone/
or
https://streamnative.io/free-cloud/
Cloudera + Pulsar
https://community.cloudera.com/t5/Cloudera-Stream-Processing-Forum/Using-Apache-Pulsar-with-SQL-Stream-Builder/m-p/349917
https://community.cloudera.com/t5/Community-Articles/Using-Apache-NiFi-with-Apache-Pulsar-for-Streaming/ta-p/337891
|AGENDA|
6:00 - 6:30 PM EST: Food, Drink, and Networking!!!
6:30 - 7:15 PM EST: Presentation - Tim Spann, StreamNative Developer Advocate
7:15 - 8:00 PM EST: Presentation - John Kuchmek, Cloudera Principal Solutions Engineer
8:00 - 8:30 PM EST: Round Table on Real-Time Streaming, Q&A
|ABOUT THE SPEAKERS|
John Kuchmek is a Principal Solutions Engineer for Cloudera. Before joining Cloudera, John transitioned to the Autonomous Intelligence team where he was in charge of integrating the platforms to allow data scientists to work with various types of data.
Tim Spann is a Developer Advocate for StreamNative. He works with StreamNative Cloud, Apache Pulsar™, Apache Flink®, Flink® SQL, Big Data, the IoT, machine learning, and deep learning. Tim has over a decade of experience with the IoT, big data, dist
2. Proprietary & Confidential |
Agenda
2
● Welcome
● Introduction to Apache Pulsar
○ Basics of Pulsar
○ Use Cases
● Let’s Build an App!
○ Demo
● Resources
● Q&A
3. Proprietary & Confidential | 3
Tim Spann
Developer Advocate
at StreamNative
FLiP(N) Stack = Flink, Pulsar and NiFi Stack
Streaming Systems & Data Architecture Expert
Experience:
● 15+ years of experience with streaming technologies
including Pulsar, Flink, Spark, NiFi, Big Data, Cloud, MXNet,
IoT, Python and more.
● Today, he helps to grow the Pulsar community sharing rich
technical knowledge and experience at both global
conferences and through individual conversations.
4. Proprietary & Confidential |
https://bit.ly/32dAJft
4
FLiP Stack Weekly
This week in Apache Flink, Apache
Pulsar, Apache NiFi, Apache Spark and
open source friends.
5. Proprietary & Confidential |
Agenda
5
• Introduction
• Demo
• What is Apache Pulsar?
• Apache Flink
• Pulsar to NiFi, Pulsar to SSB
• Q&A
6. Proprietary & Confidential | 6
Founded by the original creators of
Apache Pulsar.
StreamNative employs more than 50% of
the active core committers to Apache
Pulsar.
StreamNative has more experience
designing, deploying, and running
large-scale Apache Pulsar instances
than any team in the world.
13. Proprietary & Confidential | 13
Apache Pulsar has a vibrant community
560+
Contributors
10,000+
Commits
7,000+
Slack Members
1,000+
Organizations
Using Pulsar
15. Proprietary & Confidential | 15
Pulsar Cluster
● “Bookies”
● Stores messages and cursors
● Messages are grouped in
segments/ledgers
● A group of bookies form an
“ensemble” to store a ledger
● “Brokers”
● Handles message routing and
connections
● Stateless, but with caches
● Automatic load-balancing
● Topics are composed of
multiple segments
●
● Stores metadata for
both Pulsar and
BookKeeper
● Service discovery
Store
Messages
Metadata &
Service Discovery
Metadata &
Service Discovery
Metadata
Storage
17. Proprietary & Confidential |
Messages - the basic unit of Pulsar
17
Component Description
Value / data payload The data carried by the message. All Pulsar messages contain raw bytes, although message data
can also conform to data schemas.
Key Messages are optionally tagged with keys, used in partitioning and also is useful for things like
topic compaction.
Properties An optional key/value map of user-defined properties.
Producer name The name of the producer who produces the message. If you do not specify a producer name, the
default name is used.
Sequence ID Each Pulsar message belongs to an ordered sequence on its topic. The sequence ID of the
message is its order in that sequence.
18. Proprietary & Confidential | 18
Producer-Consumer
Publisher sends data and
doesn't know about the
subscribers or their status.
Producer Consumer
Topic
All interactions go through
Pulsar and it handles all
communication.
Subscriber receives data
from publisher and never
directly interacts with it.
19. Proprietary & Confidential |
Streaming
Consumer
Consumer
Consumer
Subscription
Shared
Failover
Consumer
Consumer
Subscription
In case of failure in
Consumer B-0
Consumer
Consumer
Subscription
Exclusive
X
Consumer
Consumer
Key-Shared
Subscription
Pulsar
Topic/Partition
Messaging
19
20. Proprietary & Confidential |
Pulsarʼs Publish-Subscribe model
20
● Producers send messages.
● Topics are an ordered, named channels that producers use to transmit messages to
subscribed consumers.
● Messages belong to a topic and contain an arbitrary payload.
● Brokers handle connections and routes messages between producers / consumers.
● Subscriptions are named configuration rules that determine how messages are
delivered to consumers.
● Consumers receive messages.
Broker
Subscription
Consumer 1
Consumer 2
Consumer 3
Topic
Producer 1
Producer 2
21. Proprietary & Confidential |
Pulsar Subscription Modes
21
Different subscription modes
have different semantics:
● Exclusive/Failover -
guaranteed order, single
active consumer
● Shared - multiple active
consumers, no order
● Key_Shared - multiple
active consumers, order
for given key
Producer 1
Producer 2
Pulsar Topic
Subscription D
Consumer D-1
Consumer D-2
Key-Shared
<
K
1,
V
10
>
<
K
1,
V
11
>
<
K
1,
V
12
>
<
K
2
,V
2
0
>
<
K
2
,V
2
1>
<
K
2
,V
2
2
>
Subscription C
Consumer C-1
Consumer C-2
Shared
<
K
1,
V
10
>
<
K
2,
V
21
>
<
K
1,
V
12
>
<
K
2
,V
2
0
>
<
K
1,
V
11
>
<
K
2
,V
2
2
>
Subscription A Consumer A
Exclusive
Subscription B
Consumer B-1
Consumer B-2
In case of failure in
Consumer B-1
Failover
22. Proprietary & Confidential |
Integrated Schema Registry
22
Schema Registry
schema-1 (value=Avro/Protobuf/JSON) schema-2
(value=Avro/Protobuf/JSON)
schema-3
(value=Avro/Protobuf/JSON)
Schema
Data
ID
Local Cache
for Schemas
+
Schema
Data
ID +
Local Cache
for Schemas
Send schema-1
(value=Avro/Protobuf/JSON) data
serialized per schema ID
Send (register)
schema (if not in
local cache)
Read schema-1
(value=Avro/Protobuf/JSON) data
deserialized per schema ID
Get schema by ID (if
not in local cache)
Producers Consumers
28. Proprietary & Confidential | 28
A serverless event streaming
framework
Pulsar Functions
● Lightweight computation similar to
AWS Lambda.
● Specifically designed to use Apache
Pulsar as a message bus.
● Function runtime can be located
within Pulsar Broker.
● Java Functions
29. Proprietary & Confidential | 29
Pulsar Functions
● Consume messages from one or
more Pulsar topics.
● Apply user-supplied processing
logic to each message.
● Publish the results of the
computation to another topic.
● Support multiple programming
languages (Java, Python, Go)
● Can leverage 3rd-party libraries to
support the execution of ML
models on the edge.