In today’s world, we are seeing a big shift toward the Cloud. With this shift comes a big shift in the expectations we have for a messaging system, especially when the messaging system is presented as managed service in a large-scale, multi-tenant environment. For any large-scale enterprise, it’s very important to evaluate messaging system and be confident before expanding complex distributed data systems like Apache Pulsar from on-premise to elastically scalable, fully managed services on cloud services. We must consider aspects such as: migration from and integration with large-scale on-premise clusters, security, cost efficiency, and the cloud friendliness of the architecture, modeling cost and capacity, tenant isolation, deployment robustness, availability, monitoring, etc. Not every messaging system is built to be cloud-native and run as a managed service with cost efficiency. We have been running large-scale Apache Pulsar at Yahoo for the last 8 years on various platforms and hardware configurations while meeting application SLAs and serving more than 1M topics in a cluster. In this talk, we will talk about Pulsar’s journey in Yahoo! from an on-premise platform to a hybrid cloud and on-premise system. We will talk about Pulsar’s architecture and features that make Pulsar a good cloud-native messaging-system choice for any enterprise.
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Pulsar's Journey in Yahoo!: On-prem, Cloud and Hybrid - Pulsar Summit SF 2022
1. Pulsar's Cloud Journey in Yahoo!
On-prem, Cloud, and Hybrid
Rajan Dhabalia rdhabalia@yahooinc.com
Ludwig Pummer ludwig@yahooinc.com
Pulsar summit 2022
1
3. Agenda
1. Pulsar in Yahoo!
2. Cloud challenges for a messaging system
3. Why platforms should choose Pulsar for public cloud
4. Why users choose Pulsar on cloud
5. Summary
6. QA
3
4. ● Developed by Yahoo! in 2014 to serve a hosted pub-sub service
○ open-sourced in 2016
● Global deployment
○ 6 DC (Asia, Europe, US)
○ Public cloud present on AWS
○ Full mesh replication
● Mission critical use cases
○ Lower latency bus for use by other low latency services
○ Write availability
○ Sherpa (PNUTS), Mail, Finance, News, Monitoring system, etc.
Pulsar's Journey in Yahoo!
4
5. Challenges on Cloud for Messaging Systems
Managed
Service
• Multi-tenancy (shared by different usecases)
• Cost calculation
Security
Connectivity
• Data security (EKS support)
• Network security (VPC, Security Groups,
Network ACLs)
• Auth
• Secured enterprise Proxy support (ATS,
HAProxy, etc.)
Performance
Reliability
Availability
• Availability even after all replicas crashed
• Fault tolerance
• Durability (EBS) vs performance (local
storage)
5
6. Cost effective
● High performance
with less cost
● Durability and
availability without
cost overhead
Availability and
Performance
● High availability
● Low latency Bus
● Native load balancer
and fault tolerance
● Data durability and
No data loss
Managed Service
● Multi-tenancy
● Enterprise proxy
support for secure
connectivity
● Hard/Soft isolation
● Cost management
Deployment and
Monitoring
● Easy deployment
● Zero downtime
● Blue-green cluster
support
● Stats and Monitoring
Why Platforms Should Choose Pulsar on Public Cloud
6
7. ● Multi tenancy
○ Multiple use cases on same cluster: Low latency publish, cold reads,
high fan-out supported due to Bookie’s I/O isolation (Figure 1)
○ Soft and Hard isolation at broker and bookie
● Cost calculation and management
● Enterprise proxy support to allow connectivity on cloud (PIP-60)
○ eg: ATS, HAProxy, etc.
● Support Hybrid mode by syncing cluster and ACLs metadata (PIP-136)
● Users do not require Pulsar expertise
● Reduce maintenance and upgrade efforts by maintaining shared cluster
Managed Service
Figure 1: Bookie I/O isolation and WAL
architecture
Writer Reader
Journal
Data
File
Data
Device
Journal Device
Write Reads (cold)
(High performance
small EBS storage
(eg: io1/io2))
(Less expensive
persistent EBS
storage (eg: gp3))
7
8. Availability and Performance
● Availability
- High availability during rolling upgrade or node crash due to
segmented oriented architecture. (Figure 2)
● Durability
- Bookie using highly durable EBS storage that allows crashed bookie
pods to recover and read
● Performance
- Maximize utilization and high performance on EBS
- WAL (journal) on high performance : small size io2/gp3
- Data storage on less expensive : gp3
● Scalability
- Container friendly deployment on kubernetes
- Auto scaling group for stateless brokers
Figure 2: Bookie segmentation
8
9. Effortless Deployment
● Rolling upgrade: Zero downtime and Durability
- High availability during rolling upgrade or node crash due to
segmented oriented architecture.
- Bookie using highly durable EBS storage that allows crashed
bookie pods to recover and read.
● Deployment component
- Deploying to EKS cluster using Helm-chart.
- Prometheus, and monitoring dashboard for alerts and monitoring
● Blue-Green deployment support
- Easy EKS cluster upgrade and migration using blue-green cluster
migration support (PIP-188)
● Legacy on-prem topic migration with custom topic factory
- Pulsar supports custom topic-Factory to manage custom topic
behavior for legacy topic migration. (PIP-100)
9
10. Cost Effective
● EBS Storage Vs Local storage on cloud
- Pulsar on EBS : cost effective, high performance, and durable
- EBS is more durable and cheaper than Local storage. But local storage is faster.
● High performance WAL and Cheaper durable storage
- Use high performance EBS storage only for WAL (requires small storage size) to achieve low latency. Eg: io2 or gp3 with high iops and
throughput thresholds
- Use cheaper durable EBS for durable storage (eg: gp3) that doesn’t impact publish latency
● Do not pay for extra replica to manage Availability
- During deployment Partitioned oriented architecture requires extra replica (RF=3) for availability vs Segment oriented Bookie
requires RF=2
- Bookie segments are created on the fly to continue topic writes
● Cheaper broker compute for high fanout
10
12. Secure Connectivity: Mutual TLS
● Mutual TLS for transport and authentication
- Each Tenant has distinct CN
- Cloud Brokers and On-Prem Brokers have distinct CNs
12
18. Availability, Performance, and Price
● Availability
- Cluster online through all maintenance operations
- EKS and Pulsar recovers nodes/pods/topics automatically
- Client Library reconnects and retries automatically
● Persistence Guarantee
- Every Acknowledged message is f-synced on 2 EBS volumes
● Low Latency
- < 8ms 99%ile publish latency @ 1KB (c5.4xlarge with gp3) with mTLS and Disk encryption
● Price
- About one-seventh of MSK for equivalent MB/s
18
19. Security and Encryption
● End-to-end (Envelope) Encryption
- Encrypt/Decrypt available in client library
- Pulsar platform never sees your keys or plaintext
- Multi-tenant friendly
● Multi-tenant Authorization
- Granular authorization from namespace to subscription name
- You grant other tenants access to your topics
19
20. Security and Encryption
● Network Encryption
- Encrypted during transport
- Mutual TLS between client, brokers, and bookies
● Storage Encryption
- Encrypted at rest
- Encrypted EBS volumes already included in publish latency
● Network Security
- PrivateLink simplifies Network ACLs, Security Groups, and Routing
- SNI Routing + mTLS protects against MITM
20
21. Geo Replication and Hybrid Access
● Full Mesh Replication under tenant control
- Cloud cluster and On-prem cluster are equals
- Publish anywhere, consume anywhere
- Replicate a topic into a new Cloud cluster with one pulsar-admin
command
● Hybrid Access
- Tenant in Cloud to Pulsar in Cloud: PrivateLink with SNI Proxy
- Tenant in Cloud to Pulsar On-Prem: Pulsar Proxy or SNI Proxy
- Tenant on Prem to Pulsar in Cloud: Public NLB with SNI Proxy
- Tenant on Prem to Pulsar on Prem: Direct Connect
- Same topic name
- Only change connect parameters: Service URL, Proxy Scheme,
Proxy Service URL
21
22. Summary
1. Cluster management requires little operational resources
2. Super secure ecosystem
3. Cost effective and highly performant
4. Multi and hybrid cloud geo replication
5. Happy platform and happy customers
22