Git Product home page Git Product logo

pulsar-neighborhood.github.io's Introduction

Pulsar Neighborhood web site

Welcome to the home for the Pulsar Neighborhood site. This is the source behind the magic. As an open community we welcome all ideas and contributions, whether it is a new feature on the site or content you would like to see.

About the site's framework

Pulsar Neighborhood was created on the Hugo framework. It uses bootstrap components to create the wondrous experience that is. There is a mix of core Hugo functions, good ideas from seasoned Hugo experts, as well as the odd idea found out in the wild. We try to follow the best practices of Hugo when it comes to the site's folder structure and naming. If you are familiar with the framework, then this site is [hopefully] very easy to get started with.

If you see something about the site's design that could be a little better please make a suggestion or PR that great idea right in to the project.

The site has 4 content types: article, guide, video, and spotlight. Articles are the most common and represent a blog post or some other written story about Pulsar. Guides are step by step instructions on how to achieve some goal using Pulsar. Videos are a brief about what is covered and a YouTube link. Spotlight is a special type used for the community spotlight section of the site.

"I love the spotlight" - said no software engineer, ever

Don't sweat this. It's less about you and more about what you've done with Pulsar. If you've got something to share but don't feel like writing it all out let's put it in the spotlight. Open a spotlight issue and give us a sentence or two about the cool things you've done. Share a link if it's something public.

Contributing ideas and content

It can be a simple idea, a written article, or a half though-out step by step guide. However far along the content is, the community is here to help make it a pulitzer candidate. Your options...

Opening an issue (simplest)

When you open an issue in the repo you are asked to choose from a template. Use this to choose the type of content the best fits your idea. Paste your markdown within the "content" area of the new issue or simply type out a few lines of idea. There are a few additional areas of interest like title, author info, image, etc. All of this is optional. A content moderator will guide you through getting everything right.

Running the site locally

If you're feeling adventurous (and have a basic understanding of Hugo) you could fork the repo and develop your idea locally. Then open a pull request to suggest the idea in the main site. Below are commands you can use to get content started:

Create a new article:

hugo new articles/my-really-great-idea.md

Create a new guide:

hugo new guides/my-step-by-step-guide.md

Create a new video:

hugo new videos/my-cool-video.md

Content formatting examples

As you write content for the community use the below markdown examples to take things to the next level. Spoiler - no one likes a wall of text and everyone loves reading code.

Guide steps [Markdown] [Live] - use this to get a step by step guide started

Tabs [Markdown] [Live] - share your code in multiple runtimes

Code Snippets [Markdown] [Live] - include snippets throughout your content, maybe even highlight a line and comment on it

Blockquote [Markdown] [Live] - make a statement

Information Callout [Markdown] [Live] - include a call out box with further info about a topic

Warning Callout [Markdown] [Live] - include a call out box with fair warning about something

Danger Callout [Markdown] [Live] - include a call out box that warns of dragons or ill-tempered creatures ahead

Success Callout [Markdown] [Live] - include a call out box declaring a win

Table [Markdown] [Live] - markdown tables aren't the best, use this to go a little further

pulsar-neighborhood.github.io's People

Contributors

ddieruf avatar aarondonwilliams avatar alexleventer avatar

Stargazers

Ebere Abanonu avatar Kiryl Valkovich avatar  avatar

Watchers

 avatar Ebere Abanonu avatar

Forkers

wingchoi compuguy

pulsar-neighborhood.github.io's Issues

[Video] Apache Pulsar and Machine Learning

Sep 16, 2021

Your Name

Apache Pulsar Neighborhood

Url of an image to represent you (or attach to the issue)

Level of Pulsar understanding

100 (beginner)|200 (intermediate)|300 (advanced)

Content (don't worry about formatting, the site moderators can help)

You have used Apache Pulsar for messaging and you know of AI/ML. Simba Khadder will be sharing his insights on why Pulsar Distributed Messaging system and its separation of storage and brokerage layers are great fit for AI/ML applications.

{{< youtube id="K2WXDwo1y0k" class="youtube" >}}

Happenings in the Pulsar Neighborhood April '22

Happenings in the Pulsar Neighborhood April '22

Your Name

Pulsar Neighborhood

Url of an image to represent you (or attach to the issue)

Level of Pulsar understanding

100 (beginner)

Content (don't worry about formatting, the site moderators can help)

Was this forwarded to you? Click here to get future copies of Happenings!

For this issue, ApacheCon ‘21, new Neighborhood article, a new committer, and our first in person event. Plus our normal features of a Stack Overflow question and some monthly community stats.

ApacheCon ‘21

ApacheCon North America dates and location have been announced. The event is October 3-6 in New Orleans. The Call for Papers has opened and there is an Apache Pulsar track! Other tracks include Big Data, Search, IoT, Cloud, Fintech, and Tomcat. And we bet that our neighbors can come up with a talk that would fit in each of these categories, so please submit your talks and lets have Pulsar take over Apache Con!

Also, we want to take a moment to thank David Fisher, PMC member (and former mentor to Pulsar when it was in incubation) for proposing and captaining the CFP process for the Pulsar track.

New Neighborhood Article

We have published our first article under the Pulsar Neighborhood byline. It is titled Understanding the Differences Between Message Queues and Streaming.

“While message queues and streaming apply to similar use cases and use similar technologies, on a technical level they’re entirely different. We’ll compare them here and examine the pros and cons of each solution, touching on message brokers, publisher-subscriber (pub/sub) architecture, and event-driven scenarios.”

So please check it out and let us know what you think. We will be releasing three articles a month, but we also want your content. Do you have a Pulsar article that you would like us to publish? Let us know by submitting our form. Already wrote an article that has been published? Great, we want to help you promote it. Why do we want to promote it? Because that is what Neighbors do, they work together to improve their Neighborhood. So you can use the same form as above and just tell us the abstract and the URL or you can post it in the #blogs-articles channel on the Apache Pulsar Workspace.

If you have a suggestion on what we should write next, let us know via the form above. And if you see a Pulsar article, blog, tutorial, event, etc. let us know and let your friends know too, by liking it and tagging it with #apachePulsar. Together we can raise the awareness of Apache Pulsar.

New Committer Announced

The PMC announced that Andrey Yegorov of DataStax was named as new a committer. Andrey made his first contribution to Pulsar in Feb 2021. He has done great work for Pulsar including Connector and Adaptor work plus updating dependencies for CVE’s. Andrey, thank you so much for what you have done and we look forward to your next contributions to the Neighborhood.

In Person Meetup

Coming up very quickly is our first in person event. On 8 April, we will be hosting our first in person event at The Hacker Building in Amsterdam, Netherlands. Neighbors and Memgraph’s developer relations engineers Ivan Despot and Katarina Supe will be talking about how Pulsar connects Memgraph and the backend server.

The second talk will be by Neighbor Christophe Bornet, Sr. Software Engineer from DataStax. He will be doing a deep dive into the Pulsar Binary Protocol. This part will be virtual, so everyone can join. But beer will only be served at the in person event. Christophe’s talk will begin about 8 pm and you can get all of the information on the NL Pulsar Meetup group’s site.

New Website’s Survey

We have mentioned this a couple of times (we are just really excited about it), a group of our neighbors has been working hard to improve the Apache Pulsar home page. The complete redesign of the site is in Beta and they would like you to fill out a survey giving them feedback about the new site. So check out the new site and take a moment to help them make it even better.

Upcoming events…

April 8 - NL Pulsar Meetup Group (see above)
Oct - ApacheCon- CFP is out now.

Would you and some colleagues like to set up a Neighborhood Meetup group or maybe you have someone who you would like to hear speak at a future meetup? Let us know and we can give you some help. Visit us at our Neighborhood Meetup page or our slack channel #meetup and ask questions.

Great questions from the Apache Pulsar Stack Overflow

As you know, we have a very active slack and Stack Overflow neighborhoods. You can ask questions at both locations and get answers quickly. Slack does have two big weaknesses. One, it is limited to the number of messages that can be saved at about 10k and we hit the limit about every three months. Two, it is not searchable by Google. Thus, when you put the error message that you received into Google, you won’t see that the question has already been answered once or twice on Slack. So to promote our great Stack Overflow channel, we thought that we would find a good question and include it here in Happenings.
Last month we pulled one from the archives, but for this month, we liked the newest one (well when this was published, it was the newest one.)
Question: non-persistent message is lost when throughput is high?
I found that non-persistent messages are lost sometimes even though the my pulsar client is up and running. Those non-persistent messages are lost when the throughput is high (more than 1000 messages within a very short period of time. I personally think that this is not high). If I increase the parameter receiverQueueSize or change the message type to persistent message, the problem is gone.
The question has been viewed well over a 100 times and has an accepted answer. Do you know the answer? Do you agree with the answer? Can you improve it?

Stats of the Month

For March, we had just over 3k conversations from 290 unique people and 84 people made 389 contributions to the code base and/or the documentation. Of the 84, 24 made their first contribution. To them, thank you for your contribution and we look forward to your next one!.

Apache Pulsar in the News

Here are some blog posts that we have found from around the web. We think that they are good, but we might not have read them all. Let us know what you have written and we will share it. Post links on our blogs-articles channel on the Apache Pulsar Slack. Or to see more, plus presentations, go here.
Pulsar or Kafka? And the lessons from doing our own testing
Apache Pulsar Client Application Best Practices
The Path to Getting the Full Data Stack on Kubernetes
Building Asynchronous Microservices with ZIO

The Pulsar Neighborhood on Social Media

Follow us on: twitter, YouTube, Meetup, and website
To sign up to receive Happenings click here.

A better readme

As the site takes shape, write a readme that includes direction on how to:

  • Contribute articles & guides
  • Do markdown formatting in content (include example shortcodes)
  • Run the site locally and create new content
  • Become a member of the repo

Add a link to join the community

Similar to the Apache AirFlow site, add a link in the footer that takes you to the "community" are of the apache pulsar site.
Screenshot 2022-02-13 061109

Happenings in the AP Neighborhood Dec. ‘21

Happenings in the AP Neighborhood Dec. ‘21 ## Your Name Pulsar Neighborhood ## Url of an image to represent you (or attach to the issue)

Level of Pulsar understanding

100

Content (don't worry about formatting, the site moderators can help)

Happenings in the AP Neighborhood Dec. ‘21

image

Was this forwarded to you? Click here to get future copies of Happenings!

Hello Everyone,

For this issue, we have three new committers, a new milestone, and lots of talks. Plus our normal features of a Stack Overflow question and some community stats.

New Committers

The Apache Pulsar PMC announced the addition of three new Committers, Marvin Cai, Jiwei Guo and Michael Marshall.

Marvin Cai is a Software Engineer from StreamNative and made his first PR in August 2020 and since then has made over 310 contributions (PR, comments, reviews, etc) to Apache Pulsar. You can check out his GitHub repo here.

Jiwei Guo is a Software Engineer from StreamNative and made his first PR in August 2020. Since then Jiwei has made over 250 contributions to Apache Pulsar. You can check out his GitHub repo here.

Michael Marshall is a Senior Software Engineer at DataStax. He made his first contribution in November 2020 and since then he has made over 360 contributions. His GitHub repo is located here.

So please join us in welcoming all three of our new committers and take a moment to check out their GitHub pages, check out the other work that they have done, and take a moment to follow them.

10k Stars on GitHub

image

And speaking of GitHub, Apache Pulsar hit a major milestone on Sunday 28 November. We surpassed 10,000 stars on GitHub. In February 2021, we had just over 7k stars, so in the last 3 quarters, we have increased our Stars by 45%. We wrote a short blog post about our history and how we have grown since our inception five years ago. At this rate, we will get our second 10,000 stars in about 18 months. If you haven’t already, please star Apache Pulsar and let’s see if we can get to 20k by the end of 2022!

Upcoming events…

As always, we have lots of talks going on in the community and even with a lot of holidays coming up, we have a lot of events coming up.

On 8 December, Pedro Silvestre of Imperial College London will be talking to the NorCal Neighborhood Meetup group about his blog post “On the Internals of Stream Processing”. So this talk will not be solely about Apache Pulsar but streaming in general. In his post, he used Apache Flink.

On 15 December Jowanza Joseph will be speaking at the Netherlands Apache Pulsar Meetup, about Apache Pulsar IO. Jowanza is a great speaker and very knowledgeable about Apache Pulsar, to the point that his book is coming out soon and is called Mastering Apache Pulsar: Cloud Native Event Streaming at Scale.

Encrio Olivelli will be speaking on December 21st (12:00 PST) at the Seattle Java Users Group. In this session you will see how to use Pulsar in a JakartaEE Web Application deployed on Apache TomEE via the JMS/EJB API, without installing any additional components to your cluster.

Our first event of 2022 will be with Rob Morrow of SigmaX on 12 Jan 2022 at the NorCal Neighborhood Meetup. The talk is titled “Using Open Source Software to Improve Streaming on the Edge” and about how to use Apache Pulsar and Apache Arrow to get sensor data from the billions of IoT devices into an IoT Gateway, because going to the Cloud is too slow and too costly.

Do you live in the Princeton, NJ area? The NYC Apache Pulsar Meetup is hosting a “Pulsar, Pizza, and Phun” event on 13 Jan 2022. Yes this is an in person event!

15–16 Jan ‘22- Pulsar Summit Asia

Would you and some colleagues like to set up a Neighborhood Meetup group or maybe you have someone who you would like to hear speak at a future meetup? Let us know and we can give you some help. Visit us at our Neighborhood Meetup page or our slack channel #meetup and ask questions.

Great questions from the Apache Pulsar Stack Overflow

As you know, we have a very active slack and Stack Overflow neighborhoods. You can ask questions at both locations and get answers quickly. Slack does have two big weaknesses. One, it is limited to the number of messages that can be saved at about 10k and we hit the limit about every three months. Two, it is not searchable by Google. Thus, when you put the error message that you received into Google, you won’t see that the question has already been answered once or twice on Slack. So to promote our great Stack Overflow channel, we thought that we would find a good question and include it here in Happenings.

Question: I have an application that produces messages to Pulsar under a specific topic and shut down the application when it’s finished; at the same time, no consumer exists to read this topic.

After a while, when I create a consumer and want to read the written data out, I found all data are lost since the topic I’ve written been deleted by Pulsar.

How can I disable the auto-deletion of inactive topics in Pulsar?

Follow the link above to get the answer. Do you have something to add?

Stats of the Month

For Nov, we had 77 contributors making 387 contributions, with 19 of those contributors making their first contribution. We also had over 2k conversations from 244 different people. So the community is as busy as ever!

Apache Pulsar in the News

Here are some blog posts that we have found from around the web. We think that they are good, but we might not have read them all. Let us know what you have written and we will share it. Post links on our blogs-articles channel on the Apache Pulsar Slack. Or to see more, plus presentations, go here

Distributed Locks With Apache Pulsar
Announcing Memgraph 2.1
Infinite Scale without Fail
Apache BookKeeper Observability — Part 1 of 5

Apache Pulsar Neighborhood on Social Media

Follow us on: twitter, YouTube, Meetup, and Medium
To sign up to receive Happenings click here.

Adding a spotlight page

Would it be possible to add a spotlight page? With the idea being that we can show everyone who has submitted their image..

Also on this page would be a link to add "yourself to the spotlight". If you click on an image, it would take you to the main page for the person. Much the same way that if you click on the picture on the first page.

Maybe turn it on when we get 10-20 people

Need a mission statement

We need a mission statement that we can link to from the main page of website.
Goal is to lay out what the the TPN is and what it is not.
First Pass:

Mission Statement

The Pulsar Neighborhood community enables Apache Pulsar users and developers to learn about Pulsar and how to use it in their professional lives; via useful, interactive content (including blogs, Meetups, and YouTube videos). This is all driven by a passionate, vendor neutral, and inclusive community of members.

What We Are:

The Pulsar Neighborhood is an independent community (Neighborhood) of Apache Pulsar users, developers, and enthusiasts looking to improve their Apache Pulsar skills, find new features & best practices of the project, and promote Apache Pulsar & its community enabling the project’s continued growth and prosperity. This is done both through creation of content (blogs, tutorials, videos, and speaking engagements) and by the consumption of content created by other “neighbors”. This is all driven by a passionate, vendor neutral community of members for the betterment of the Apache Pulsar community.

What We Are Not:

The Pulsar Neighborhood is not the official location of Apache Pulsar information nor is it part of the Project Management Committee (PMC). Our “neighbors” have no more influence over the Apache Pulsar project than any other community member. While we might have PMC members, Committers, contributors, and other community members attend speaking engagements, create blogs, tutorials, videos, etc., they are not doing so in their official roles. The official way to influence the Apache Pulsar project (and all ASF projects) is by creating discussion on the Apache Pulsar @dev and @user mailing lists. Formal proposals are made by creating a PIP (Pulsar improvement Plan).

Our Values-

“Community over Code”- this is the mantra of the ASF and The Pulsar Neighborhood (TPN) fully embraces this mantra by creating a Neighborhood where all are welcome to contribute.
Promote Apache Pulsar and its Neighborhood- like all ASF projects, Apache Pulsar is only as strong as the community (Neighborhood) around it.
Amplify the content of our fellow Neighbors- By working as a Neighborhood, we can better promote the great work of our neighbors in a louder and more effective way than we could as individuals.
Openness- Everything that we do is open to all, and all of our content freely available. (just give the Neighborhood and the individual neighbor or neighbors credit!)
Vendor Neutrality- We believe in it and our actions show it.
Success- When one neighbor succeeds, the entire Pulsar Neighborhood succeeds.

labels are not being added for Bug and feature

Describe the bug
A clear and concise description of what the bug is.
Is a label supposed to be added when someone creates a bug or feature request? I can add it afterwards, but I thought that it would be automatic.

To Reproduce
Steps to reproduce the behavior:

  1. Go to https://github.com/pulsar-neighborhood/pulsar-neighborhood.github.io/issues/new/choose
  2. select bug
  3. click submit

Expected behavior
A clear and concise description of what you expected to happen.
when I hit submit, the new issues would be labeled as a bug..

[Video] Running Apache Pulsar in Multiple Regions

Jul 20, 2021

Your Name

Apache Pulsar Neighborhood

Url of an image to represent you (or attach to the issue)

Level of Pulsar understanding

100 (beginner)|200 (intermediate)|300 (advanced)

Content (don't worry about formatting, the site moderators can help)

Zeke Dean walks us through how to run Apache Pulsar in multiple regions

{{< youtube id="HKdjPpsDUPc" class="youtube" >}}

Apache Pulsar is #5 in Commits to ASF Projects

Apache Pulsar is #5 in Commits to ASF Projects

Your Name

Pulsar Neighborhood

Url of an image to represent you (or attach to the issue)

Level of Pulsar understanding

100

Content (don't worry about formatting, the site moderators can help)

Apache Pulsar is #5 in Commits to ASF Projects

image
Source: Apache by the digits

The Apache Software Foundation (ASF) just released its annual blog post called “APACHE by the digits”. In it, there are a lot of interesting facts, like there are over 283 Billion lines of code across 2,300 repositories, which is just amazing. But we all know that lines of code is not the best metric to measure a project. If you want to use a metric from Git, it really should be commits. And there were a lot of those over the last year with over 200,000 commits across all Apache Projects. Apache Pulsar had 4632 of them, 2.2% of all ASF commits, putting us in fifth place for the number of commits!

image

Source: APACHE by the digits

There are a lot of other interesting insights in the ASF numbers. Like 42% of contributors to ASF projects have worked with the project for under a year. There are over 15k commits each month to an ASF project, with November and March being the two busiest months and October being the least active. Check out the post here and let us know what you find interesting in the comments below

Some Quick Apache Pulsar Numbers from 2021

In 2021, we had 350 different people make a contribution to the project in some way, with an average of 9.6 per person. What is really amazing is that 250 people made their first contribution in 2021. That means that over 70% of our neighbors who made a contribution to Apache Pulsar in 2021, made their first contribution in the last 12 months. And that makes up over half of the people (475) who have ever made a contribution to Apache Pulsar.

Our top ten Neighbors and the number of contributions in 2021:
Lari Hotari (268)
LI Li (161)
Lin Lin (160)
Matteo Merli (155)
Enrico Olivelli (132)
Penghui Li (106)
Bo Cong (94)
HeZhangJian (90)
Michael Marshall (78)
Jiwei Guo (78)

Apache Pulsar Neighborhood on Social Media

Follow us on: twitter, YouTube, Meetup, and Medium
To sign up to receive Happenings click here.

Create getting started type

Getting started with Pulsar should be an ever-growing area, not just a collection of guides.

  • Create a new area called "get-started" with its own archetype and _index.md
  • Add a button to the navigation labeled "Get Started Now" that takes you to _index

The index can be organized by different platforms & infrastructures one might be interested in deploying Pulsar to. Within each is a list of applicable guides.

Suggested:
Desktop

  • binaries
  • docker

Kubernetes

  • pulsar instance (manual)
  • pulsar instance (helm)
  • pulsar operator (and instance)

AWS

  • EC2
  • EKS
  • ECS ?

Azure

  • VMs
  • AKS

GCP

  • VMs
  • GKE

9,966 Stars, who will be the 10,000?

9,966 Stars, who will be the 10,000?

Your Name

Pulsar Neighborhood

Url of an image to represent you (or attach to the issue)

Level of Pulsar understanding

100

Content (don't worry about formatting, the site moderators can help)

9,966 Stars, who will be the 10,000?

Since the first cut of the code was pushed into GitHub just over 5 years ago and 3 years since Apache Pulsar became a high level project, the Neighborhood has grown incredibly. We thought that since it is a little quiet here in the Neighborhood and we are about to pass 10,000 stars on GitHub, we would like to take a moment and highlight some of the milestones of the Apache Pulsar community.

In Oct of 2019 the neighborhood passed 100 people making a contribution of some type, with that number doubling to over 200 by just five months later. In August of 2020, we passed 500 and then passed 1000 neighbors by May of this year. And today, just 5 months later we have over 1600 members and are on pace to reach 2000 by March 2022. These 1600 members have have over 31,000 conversations and over 400 of them have made 4,300 contributions to GitHub in some way.

image
And our next milestone is coming up quickly, 10,000 stars on GitHub!

image

Who will be the neighbor that puts us over the 10,000 star mark? When will it happen? Today? Visit our GitHub to see the latest number.

FYI to see the accurate number, click the Star (Unstar) button and it will open up. Just make sure that you leave it as “stared”.

*Update- On 28November 2021, we passed 10,000! Can we get to 20k by the end of 2022?
image

Apache Pulsar Neighborhood on Social Media

How to follow us on: twitter, YouTube, Meetup, and Medium

Spotlight Shiv

Your name

Shivji Kumar Jha

Shiv is a an architect at Nutanix and runs the stream platforms team to support multiple Nutanix products. Shiv is a Pulsar geek who contributes code, docs, blogs and talks to the Pulsar codebase and community. Shiv loves spending time on data stores (databases, streams, analytics etc), is an avid reader (tech, fiction, sports, economics, leadership to name a few) and is always looking at ways to simplify software architectures.

Url to the image that represents you

shiv_400x400

(all the below are optional)

GitHub URL

Github: https://github.com/shiv4289

Twitter URL

https://twitter.com/ShivjiJha

LInkedIn URL

https://www.linkedin.com/in/shivjijha/

Facebook URL

YouTube URL

https://www.youtube.com/watch?v=Bx4csRi1b8Y&list=PLA7KYGkuAD071myyg4X5ShsDHsOaIpHOq

Twitch URL

Email address

Other URL

https://sessionize.com/shivjijha/

Apache Pulsar Versus Apache Kafka

Your Name

(Image1)[https://user-images.githubusercontent.com/16946028/163627966-dc0b4e18-d7ac-4d75-889f-2844926d11be.png]

Url of an image to represent you (or attach to the issue)

Level of Pulsar understanding

100 (beginner)|200 (intermediate)|300 (advanced)

Content (don't worry about formatting, the site moderators can help)

In today’s modern age of social media and online interconnectivity, real-time information relies on efficient systems that quickly and consistently distribute data to a wide range of consumers.

Pulsar and Kafka are both industry-standard messaging systems.

In 2011, LinkedIn developed Apache Kafka and published it as an open source system. Since then, it has surged in popularity and established itself as the industry standard for a highly scalable publish/subscribe (pub/sub) messaging platform. It’s still a fantastic free and open source technology for distributed streaming applications. However, there’s now an alternative.

Apache Pulsar began as a Yahoo project in 2013. It was open sourced in 2016 and adopted by the Apache Software Foundation. It’s a cloud-native distributed messaging and streaming platform. Hundreds of businesses — including Verizon Media, Tencent, Comcast, and Overstock to name a few — have embraced this solution.

In this article, we'll evaluate the features, architectures, performance, and use cases of Apache Pulsar versus Kafka, to help you decide which is the better solution for you.

Apache Pulsar Versus Apache Kafka

Speed and Performance

Pulsar is considerably faster than Kafka. It can provide higher throughput with reduced latency in real-world situations.

Confluent found that Kafka offers the lowest latency at higher throughputs, with high availability and durability. Kafka’s efficiency and fewer moving parts also lower its cost.

However, StreamNative countered that Confluent’s tests used a narrow set of parameters that did not measure Pulsar’s performance accurately. StreamNative’s benchmarks actually indicate that Pulsar surpasses Kafka in tasks that are more useful in the real world, like streaming messages and events.

For example, Pulsar’s catch-up read throughput was 3.5 times faster. Unlike Kafka, Pulsar was unaffected by increasing the number of partitions and changing durability levels. Pulsar also had lower latency — at milliseconds versus Kafka’s seconds — and was less impacted by the number of topics and subscriptions.

Pulsar also comes out on top when it comes to pricing and performance according toGigaOm: with a 35 percent higher performance at a lower 3-year cost than Kafka and 81 percent savings when handling higher data volumes.

Geo-Replication

Pulsar's most significant native capability,geo-replication, truly differentiates it from Kafka. Geo-replication systems use geographically-distributed data centers to increase availability and disaster resilience. The built-in geo-replication function can synchronize data between clusters usually located in various geographical regions by duplicating topics. This approach aids diverse service availability needs, such as disaster recovery, data migration, and data backup.

Because Pulsar’s geo-replication is built-in, it does not require complicated setups or add-ons. If you publish a message to a topic in a replicated namespace, Pulsar automatically copies the message to the selected location or multiple locations.

Cloud-Native Approach

Cloud-native means building, running, and distributing applications on the cloud’s distributed computing architecture. Developers build cloud-native applications using various technologies, including Docker, APIs, serverless functions, and microservices. Kubernetes is the industry standard for cloud-native orchestration.

Apache Pulsar integrates seamlessly with Kubernetes, supporting rolling upgrades, automatic monitoring, and horizontal scalability. Its multilayer architecture also integrates well with cloud infrastructures, separating computing (which is handled by brokers) from storage (which is managed by Apache BookKeeper).

Pulsar is cloud-native at its core. Kafka wasn’t a fully cloud-native solution from the start, so you need services like Confluent Operator to work with cloud services.

Messaging and Event-Streaming Architecture

Apache Pulsar's multi-layered design completely decouples the message routing and storage layers, allowing each to scale independently. It also integrates the advantages of classic messaging systems such as RabbitMQ with the benefits of publish/subscribe systems such as Kafka.

The publish/subscribe pattern lies at Pulsar’s heart. Producers publish messages to the server, and consumers must subscribe to receive the notifications.

In the publish/subscribe design pattern, message publishers don’t deliver messages to specific subscribers. Instead, message consumers subscribe to topics of interest. Each time a publisher publishes a message linked with that topic, Pulsar promptly sends it to all subscribers.

Pulsar’s all-in-one platform offers publish/subscribe systems, queues, and streams all in one place, offering an advantage over Kafka’s publish/subscribe-only system.

Available Clients

Pulsar exposes a client API with language bindings for Java, Go, Python, C++, and C#. The client API optimizes and encapsulates Pulsar's client-broker communication protocol and exposes a simple and intuitive API for applications to use.

Because Kafka is older, it has more developed client libraries. However, it’s only a matter of time before Pulsar catches up given its fast-developing community.

Additionally, if you cannot locate a client library for your preferred language, you can use Pulsar's WebSocket proxy. You can also usePulsar Beam, a standalone service enabling applications to interact with Apache Pulsar using HTTP.

As we can see, Kafka may be a better choice depending on which language you use, although Pulsar has its workarounds.

Storage

Kafka employs a distributed commit log as its storage layer. Pulsar, in contrast, employs anindex-based storage method that maintains data in a tree structure, allowing quick access when addressing individual messages.

Both Pulsar and Kafka provide long-term or permanent message storage. However, their mechanisms differ. Kafka stores data in logs shared across brokers, while Pulsar stores data in Apache BookKeeper.

Kafka’s storage costs are higher than Pulsar, which provides tiered storage. This tiered storage lets you keep outdated and infrequently accessed data on low-cost storage alternatives.

Pulsar also allows you to add new brokers without altering or re-partitioning the data. Additionally, it addresses latency while working with massive data sets using a quorum-based replication technique resulting in more consistent latencies.

So, while both Pulsar and Kafka offer long-term storage, Pulsar’s approach may be more cost-effective.

Brokers

Pulsar and Kafka operate as clusters, with nodes called brokers. Brokers can act as either leaders or replicas to offer the system high availability and fault tolerance.

Kafka has many significant limitations compared to Pulsar. Most significantly, its broker-tied storage limits scalability.

Each broker in Kafka maintains a complete partition log. Brokers must synchronize data with all other brokers responsible for the same partition and their duplicates. In contrast, Pulsar keeps the state outside the brokers, separating them from the data storage layer.

Apache Pulsar's stateless brokers are a significantly competitive feature over Kafka. You can launch them rapidly and in vast numbers to meet increased demand.

Documentation and Community Support

While both solutions may accomplish your objectives, other factors like access to support and community resources are often equally or more relevant.

Kafka's community is far more extensive and active than Pulsar's because of Kafka’s popularity and longstanding presence. Therefore, many enterprises consider Kafka a more logical choice.

Yet, Pulsar has rich documentation. Plus, its community is also growing daily as more organizations migrate from Kafka and other messaging systems to Pulsar for its built-in features and improvements over Kafka.

Pulsar Use Cases

Pulsar excels in various applications, including real-time machine learning using Pulsar Functions. It offers many advantages, such as its quorum-based replication algorithm for consistent latency, tiered storage, and distinct storage and broker layers.

Pulsar Functions is an integrated, easy-to-use stream processor. It helps reduce the operational complexity of configuring and managing a stand-alone stream processing engine such as Apache Heron or Apache Storm. Extending Pulsar’s architecture with an additional engine can handle more complex use cases.

Pulsar excels in stream processing for various applications, including real-time health monitoring, financial transaction processing, Internet of things (IoT) stream analysis, and more. Additionally, Pulsar's focus on machine learning and artificial intelligence make it a more viable option for modern enterprises valuing data as a strategic resource.

Conclusion

Pulsar is gaining traction among modern enterprises seeking a more current, innovative solution that fulfills modern requirements. Additionally, Pulsar features a Kafka-compatible API, making migration simple for developers.

Pulsar is user-friendly and feature-rich, plus it’s more scalable and faster than Kafka. It alleviates the discomfort and operational expenses associated with deploying several systems offering similar services.

Although there may be times when Kafka comes out on top, such as for built-in language support — overall, Pulsar looks more suitable for most use cases.

[Video] Leveraging Pulsar's Next Gen Streaming Capabilities from a JavaEE Application

Sep 1, 2021

Your Name

Apache Pulsar Neighborhood

Url of an image to represent you (or attach to the issue)

Level of Pulsar understanding

100 (beginner)|200 (intermediate)|300 (advanced)

Content (don't worry about formatting, the site moderators can help)

For a long time Java Messaging Service has been the API to handle messaging systems in the Java World, and now the messaging ecosystem is moving towards the next generation of streaming services like Apache Pulsar.

Why? Because Pulsar is free, Open Source, Cloud Native and it comes with cool new features that are not well supported by traditional JMS vendors.

In this session you will see how to use Pulsar in a JakartaEE Web Application deployed on Apache TomEE via the JMS/EJB API, without installing any additional components to your cluster.

{{< youtube id="0NA0BIvkQrs" class="youtube" >}}

Tracking the official client's supported features

It would be nice to have a dedicated page listing the official Pulsar clients and the features they support. Not all clients support all features and it's something that could save a lot of discovery time.

Of course, someone would need to maintain it...

Pulsar: Building a High Available Messaging System - A Step by Step CookBook

Pulsar: Building a High Available Messaging System - A Step by Step CookBook

Your Name

Ivan Garcia

Url of an image to represent you (or attach to the issue)

Level of Pulsar understanding

200 (intermediate)

Content (don't worry about formatting, the site moderators can help)

Pulsar: Building a High Available Messaging System

 

A step by step cookbook


✏️ Ivan Garcia / 📆 Feb 2022 / 🕑 20 min read

https://github.com/IvanGDR/Pulsar-Building-A-High-Available-Messaging-System


 

This second part of this blog series over apache Pulsar, is all about building the stack. The intention is to provide a cookbook guide for you to do it.
It is important to highlight that I am trying to provide a deployment that can offer high availability properties hence the use of a zookeeper ensemble. According to documentation, it may be acceptable to use just one Zookeeper node as the workload pushed from Pulsar and BookKeeper is not expected to be a real constraint but a single point of failure is expected to be assumed in this way.

 

Context

From an architectural point of view, the aim is to build a Pulsar instance with multi cluster properties and functions worker enabled within brokers.

This guide will walk you step by step to deploy a Pulsar instance with one cluster but prepared already to extend the deployment further, including more Pulsar clusters at a later stage.
Additionally all the stack will be built up separately, Zookeeper, BookKeeper and Pulsar all will be deployed and installed on their own cluster/ensemble and configured accordingly so they can interact with each other.

The procedure followed here is for a multi cluster baremetal deployment.
 

Resources

The open source software to be used are the following binaries:

  • apache-pulsar-2.8.1-bin.tar.gz
  • bookkeeper-server-4.11.1-bin.tar.gz
  • apache-zookeeper-3.5.8-bin.tar.gz

In terms of hardware:

  • Zookeeper ensemble (3 nodes)
  • BookKeeper cluster (3 bookies)
  • Pulsar cluster (3 brokers)
     

Guide Summary

- Deploying Binaries on Each Node
  - Creating directories
  - Changing directory ownership
  - Moving tar binary (scp)
  - Untar Binary
  - Remove tar binary
- Zookeeper Configuration
  - Cluster Info
  - Setting up Local Zookeeper for Pulsar
    - Cluster Info
    - Configuring Local Zookeeper
    - Setting up zoo.cfg file for Local Zookeeper
    - Start/Stop Local Zookeeper
    - Launch Client Zookeeper for Local Zookeeper
    - Creating Znode for Local Zookeeper Metadata
  - Setting up Global Zookeeper for Pulsar (store)
    - Cluster Info
    - Configuring Global Zookeeper
    - Setting up zoo_pulsar_global.cfg file for Global Zookeeper
      - Instructions to add new Pulsar cluster Zookeeper Configuration
    - Start/Stop Global Zookeeper
    - Launch Client Zookeeper for Global Zookeeper
    - Creating Znode for Global Zookeeper Metadata
- Bookkeeper Configuration
  - Cluster Info
  - Creating a Znode for BookKeeper metadata in Zookeeper Local
  - Setting up bk_server.conf file for BookKeeper
  - Sending BookKeeper metadata to Zookeeper Local
  - Start BookKeeper
- Pulsar Configuration
  - Cluster Info
  - Setting up bk_server.conf file for Pulsar (brokers)
    - Enabling Functions within Brokers
  - Sending Pulsar metadata to Zookeeper (Local and Global) and Registering BookKeeper
  - Start Pulsar Broker
  - Confirming Brokers available
- Conclusion

 

1) Deploying Binaries On Each Node

Do the following if you can ssh to your remote machines. Install the binaries accordingly within your ensemble/clusters, 3 times for Zookeeper, 3 times for BookKeeper and 3 times for Pulsar.

Creating directories

igdr@<ip-hostname>:/opt$ sudo mkdir <directory_name>

do this for each of your nodes, where your <ip-hostname> is your hostname and <directory_name> is either zookeeper, bookkeeper, pulsar

Changing directory ownership

igdr@<ip-hostname>:/opt$ sudo chown -R igdr:igdr <directory_name>

Copying tar binary to destination nodes (scp)

Instead downloading the binaries individually on each remote machine, using e.g. "wget" command, I downloaded once the bin.tar.gz files on my local machine and sent the files using scp command to the remote machines.

/Downloads/Project_Pulsar: scp -i /path_to/ssh_key \
file-name.bin.tar.gz igdr@<ip-hostname>:/opt/<directory_name>/

Here <directory_name> is the directory created initially.

Untar binary

igdr@<ip-hostname>:/opt/<directory_name>$ tar xvzf file-name.bin.tar.gz

Remove tar binary

igdr@<ip-hostname>:/opt/<directory_name>$ rm file-name.bin.tar.gz

2) Zookeeper Configuration

Cluster Info (Local and Global)

Nodes:
Node 0: public hostname: 101.36.207
Node 1: public hostname: 101.36.165
Node 2: public hostname: 101.36.179

2.A) Setting up Local Zookeeper for Pulsar

Configuring Local Zookeeper

Creating myid file within datadir=/opt/zookeeper/data

Notes for configuration of "myid" files within each Zookeeper node (Local)

Node 0: 101.36.207 -> insert 1
Node 1: 101.36.165 -> insert 2
Node 2: 101.36.179 -> insert 3

echo "1" > /opt/zookeeper/data/myid
echo "2" > /opt/zookeeper/data/myid
echo "3" > /opt/zookeeper/data/myid

Setting up zoo.cfg files for Local Zookeeper

Main variables, according to pulsar documentation:

tickTime=2000
initLimit=10
syncLimit=5
dataDir=/opt/zookeeper/data
clientPort=2181
admin.enableServer=true
admin.serverPort=9990
#maxClientCnxns=60
autopurge.snapRetainCount=3
autopurge.purgeInterval=1
forceSync=yes
sslQuorum=false
portUnification=false
metricsProvider.className=
org.apache.zookeeper.metrics.prometheus.PrometheusMetricsProvider
metricsProvider.httpPort=8000
metricsProvider.exportJvmInfo=true
server.1=101.36.207:2888:3888
server.2=101.36.165:2888:3888
server.3=101.36.179:2888:3888

Same file content for each of the Zookeeper nodes.

Start/Stop Local Zookeeper

Start local Zookeeper
Do this for each node, this example corresponds to Zk server.1 only

igdr@ip-101-36-207:/opt/zookeeper/apache-zookeeper-3.5.8-bin$
./bin/zkServer.sh \
start /opt/zookeeper/apache-zookeeper-3.5.8-bin/conf/zoo.cfg

 
Stop local Zookeeper
Do this for each node, this example corresponds to Zk server.1 only

igdr@ip-101-36-207:/opt/zookeeper/apache-zookeeper-3.5.8-bin$
./bin/zkServer.sh \
stop /opt/zookeeper/apache-zookeeper-3.5.8-bin/conf/zoo.cfg 

Launch Client Zookeper for Local Zookeeper

igdr@ip-101-36-207:/opt/zookeeper/apache-zookeeper-3.5.8-bin$
./bin/zkCli.sh -server 101.36.207:2181

Creating Znode for Local Zookeeper Metadata

Creating the Znode from one Zk client is enough

[zk: 101.36.207:2181(CONNECTED) 0] create /PulsarZkLocal
Created /PulsarZkLocal

Verifying Znode has been created as expected:

[zk: 101.36.207:2181(CONNECTED) 1] ls /
[PulsarZkLocal, zookeeper]

2.B) Setting up Global Zookeeper for Pulsar (store)

Configuring Global Zookeeper

Creating myid file within datadir=/opt/zookeeper/data_global

Notes for configuration of "myid" files within each Zookeeper node (Global)

Node 0: 101.36.207 -> insert 1
Node 1: 101.36.165 -> insert 2
Node 2: 101.36.179 -> insert 3

echo "1" > /opt/zookeeper/data_global/myid
echo "2" > /opt/zookeeper/data_global/myid
echo "3" > /opt/zookeeper/data_global/myid

Setting up zoo_pulsar_global.cfg file for Global Zookeeper

Main variables, according to Pulsar documentation:

tickTime=2000
initLimit=10
syncLimit=5
dataDir=/opt/zookeeper/data_global
clientPort=2184
admin.enableServer=true
admin.serverPort=9991
#maxClientCnxns=60
autopurge.snapRetainCount=3
autopurge.purgeInterval=1
server.1=101.36.207:2889:3889
server.2=101.36.165:2889:3889
server.3=101.36.179:2889:3889

Note 1: It is important to note that a separate zookeeper ensemble for a Global Pulsar Store will require the same numbers of local pulsar zookeeper ensemble machines per cluster, it will make it not cost effective, considering the workload for the global store is very reduced.

Note 2: If another cluster is added to the pulsar instance, within this file the new zk nodes IP’s will need to be added and additionally we may use the :observer option. E.g. adding 2 more clusters of 3 nodes ensemble each and making a 7 nodes ensemble using the observer option to avoid election we can tolerate up to 3 nodes down or an entire region to be down (2 x 3 Nodes Down+1=7). The modified file would look like this:

 

Instructions to add new Pulsar cluster Zookeeper Configuration (modified: zoo_pulsar_global.cfg file)

peerType=observer
server.1=101.36.207:2889:3889
server.2=101.36.165:2889:3889
server.3=101.36.179:2889:3889

server.4=zk-4IP-Region2:2889:3889
server.5=zk-5IP-Region2:2889:3889
server.6=zk-6IP-Region2:2889:3889:observer

server.7=zk-7IP-Region3:2889:3889
server.8=zk-8IP-Region3:2889:3889
server.9=zk-9IP-Region3:22889:3889:observer 

When adding two new clusters for example as shown above, the new configuration file will have to be the same on all ZK nodes. where myid files for the new two local ZK (2181) will contain 1,2,3 and 1,2,3 also for the global ZK (2184) will be 4,5,6 and 7,8,9 following the example above.

Start/Stop Global Zookeeper

Start global Zookeeper

igdr@ip-101-36-207:/opt/zookeeper/apache-zookeeper-3.5.8-bin$
./bin/zkServer.sh \
start /opt/zookeeper/apache-zookeeper-3.5.8-bin/conf/zoo_pulsar_global.cfg

 
Stop global Zookeeper

igdr@ip-101-36-207:/opt/zookeeper/apache-zookeeper-3.5.8-bin$
./bin/zkServer.sh \
stop /opt/zookeeper/apache-zookeeper-3.5.8-bin/conf/zoo_pulsar_global.cfg

Launch Client Zookeper for Global Zookeeper

igdr@ip-101-36-207:/opt/zookeeper/apache-zookeeper-3.5.8-bin$
./bin/zkCli.sh -server 101.36.207:2184

Creating Znode for Global Zookeeper Metadata

Creating the Znode fron one Zk client is enough

[zk: 101.36.207:2184(CONNECTED) 1] create /PulsarZkGlobal
Created /PulsarZkGlobal

Verifying Znode for Global Pulsar metadata has been created:

[zk: 101.36.207:2184(CONNECTED) 2] ls /
[PulsarZkGlobal, zookeeper]

3) Bookkeeper Configuration

Cluster Info

Nodes:
Node 0: public hostname: 101.33.97
Node 1: public hostname: 101.35.236
Node 2: public hostname: 101.32.196

Creating a Znode for BookKeeper metadata in Zookeeper Local

After connecting with the ZK client to the local ZK ensemble:

[zk: 101.36.207:2181(CONNECTED) 3] create /PulsarZkBk
Created /PulsarZkBk

furthermore:

[zk: 101.36.207:2181(CONNECTED) 4] create /PulsarZkBk/ledgers
Created /PulsarZkBk/ledgers

Verifying Znode for BookKeeper metadata has been created:

[zk: 101.36.207:2181(CONNECTED) 1] ls /
[PulsarZkLocal, PulsarZkBk, zookeeper]

Setting up bk_server.conf file for BookKeeper

Main variables, according to Pulsar documentation:

bookiePort=3181
advertisedAddress=101.33.97
journalDirectories=/opt/bookkeeper/data/bk-journals
ledgerStorageClass=org.apache.bookkeeper.bookie.storage.ldb.DbLedgerStorage
ledgerDirectories=/opt/bookkeeper/data/bk-ledgers
metadataServiceUri=zk+hierarchical:
//101.36.207:2181;101.36.165:2181;101.36.179/PulsarZkBk/ledgers
#`zkServers` is deprecated in favor of using `metadataServiceUri`
#zkServers=localhost:2181
Important
**Note 1:** metadataServiceUri points to Local ZK IP’s.
**Note 2:** in metadataServiceUri use “;” instead of “,” for IP:PORT separation.

Sending BookKeper metadata to Zookeeper Local

Just from one BookKeeper node

igdr@ip-101-33-97:/opt/bookkeeper/bookkeeper-server-4.11.1$
./bin/bookkeeper shell metaformat

The output confirming this:

INFO  BookKeeper metadata driver manager initialized
INFO  Initialize zookeeper metadata driver at metadata service uri
zk+hierarchical:
//101.36.207:2181;101.36.165:2181;101.36.179/PulsarZkBk/ledgers : 
zkServers = 101.36.207:2181,101.36.165:2181,101.36.179,
ledgersRootPath = /PulsarZkBk/ledgers.
Ledger root already exists. Are you sure to format bookkeeper metadata?
This may cause data loss. (Y or N) Y
INFO  Successfully formatted BookKeeper metadata

Additionally check BookKeeper Znode in one of the “local” Zookeeper servers

[zk: 101.36.207:2181(CONNECTED) 8] ls /PulsarZkBk/ledgers
[INSTANCEID, LAYOUT, available]

Start BookKeeper

Do this for each node/hostname

igdr@ip-101-33-97:/opt/bookkeeper/bookkeeper-server-4.11.1$
./bin/bookkeeper bookie

Output received:

INFO  - [main:Main@274] - Hello, I'm your bookie, listening on port 3181.
Metadata service uri is zk+hierarchical:
//101.36.207:2181;101.36.165:2181;101.36.179/PulsarZkBk/ledgers.
Journals are in [/opt/bookkeeper/data/bk-journals].
Ledgers are stored in /opt/bookkeeper/data/bk-ledgers.
INFO  - [main:Bookie@991] - Finished reading journal, starting bookie
INFO  - [main:ComponentStarter@86] - Started component bookie-server.

4) Pulsar Configuration

Cluster Info

Nodes:
Node 0: public hostname: 101.32.178
Node 1: public hostname: 101.34.49
Node 2: public hostname: 101.34.42

Setting up broker.conf file for Pulsar (brokers)

Main variables, according to Pulsar documentation:

zookeeperServers=
101.36.207:2181,101.36.165:2181,101.36.179:2181/PulsarZkLocal
configurationStoreServers=
101.36.207:2184,101.36.165:2184,101.36.179:2184/PulsarZkGlobal
brokerServicePort=6650
brokerServicePortTls=6651
webServicePort=8080
webServicePortTls=8443
bindAddress=0.0.0.0
advertisedAddress=101.32.178
clusterName=Chinchaysuyo
bookkeeperMetadataServiceUri=zk+hierarchical:
//101.36.207:2181;101.36.165:2181;101.36.179:2181/PulsarZkBk/ledgers

Enabling Functions within Brokers

Additionally to enable function worker in brokers, also in bk_server.conf file:

### --- Functions --- ###
# Enable Functions Worker Service in Broker
functionsWorkerEnabled=true

In functions_worker.yml file

################################
# Function package management
################################
numFunctionPackageReplicas: 2

Sending Pulsar metadata to Zookeeper (Local and Global) and Registering BookKeeper

igdr@ip-101-32-178:/opt/pulsar/apache-pulsar-2.8.1$ ./bin/pulsar \
initialize-cluster-metadata \
--cluster Chinchaysuyo \
--zookeeper 101.36.207:2181,101.36.165:2181,101.36.179:2181/PulsarZkLocal \
--configuration-store
101.36.207:2184,101.36.165:2184,101.36.179:2184/PulsarZkGlobal \ 
--existing-bk-metadata-service-uri "zk+hierarchical:
//101.36.207:2181;101.36.165:2181;101.36.179:2181/PulsarZkBk/ledgers" \
--web-service-url http:
//101.32.178:8080,101.34.49:8080,101.34.42:8080 \
--web-service-url-tls https:
//101.32.178:8443,101.34.49:8443,101.34.42:8443 \
--broker-service-url pulsar:
//101.32.178:6650,101.34.49:6650,101.34.42:6650 \
--broker-service-url-tls pulsar+ssl:
//101.32.178:6651,101.34.49:6651,101.34.42:6651

The output after execution:

INFO  Setting up cluster Chinchaysuyo with zk
=101.36.207:2181,101.36.165:2181,101.36.179:2181/PulsarZkLocal
configuration-store=
101.36.207:2184,101.36.165:2184,101.36.179:2184/PulsarZkGlobal
INFO  EventThread shut down for session: 0x10016a8eb830004
INFO  Pulsar Cluster metadata for 'Chinchaysuyo' setup correctly

Start Pulsar Broker

Do this for all the Pulsar nodes/brokers

igdr@ip-101-32-178:/opt/pulsar/apache-pulsar-2.8.1$ ./bin/pulsar broker

Output after execution:

INFO  org.apache.pulsar.broker.PulsarService - Starting Pulsar Broker service;
version: '2.8.1'
INFO  org.apache.pulsar.PulsarBrokerStarter - PulsarService started.

Confirming Brokers available:

igdr@ip-101-32-178:/opt/pulsar/apache-pulsar-2.8.1$ ./bin/pulsar-admin \
brokers list Chinchaysuyo

Output after execution:

"101.32.178:8080"
"101.34.49:8080"
"101.34.42:8080"

Conclusion

Implementing a high available Pulsar instance is relatively easy. All the configurations shown in this guide need to be done as many times as nodes are available in the Pulsar instance, except the metadata sent from BookKeeper to Zookeeper and from Pulsar to Zookeeper which is only done from one node. The Pulsar Cluster then should be ready to publish and consume messages and additionally use I/O functions.
In the next and last blog of this series, I will analyse the logs generated when each of these components are initialised.

Pulsar: Queuing and Streaming - An All in One Messaging System

Pulsar: Queuing and Streaming - An All in One Messaging System

Your Name

Ivan Garcia

Url of an image to represent you (or attach to the issue)

Level of Pulsar understanding

100 (beginner)

Content (don't worry about formatting, the site moderators can help)

Pulsar: Queuing and Streaming

An All in One Messaging System


✏️ Ivan Garcia / 📆 Feb 2022 / 🕑 15 min read

https://github.com/IvanGDR/Pulsar-Queuing-and-Streaming


 

In this first delivery of this blog series over apache Pulsar, the aim is to provide an introductory reading about this topic.
I would also like to focus this reading on what is Pulsar and how it works under its own merits rather than comparing it against apache Kafka. I am taking this approach as there is plenty literature doing this already and because from a messaging system architecture both of these paradigms, up to a point, are different.

 

What is Apache Pulsar?

Apache Pulsar is an open-source distributed publish-subscrib messaging system with a robust, highly available, high performance streaming messaging platform, offering messaging consumption, acknowledging and retention facilities as well as tenant implementation options.

Tenant Implementation

On a high level view, Apache Pulsar has been conceived to support multi-tenant deployments. The key pieces that allows this configuration topology are: properties and namespaces.

!!! An analogy for this would be:
Pulsar Cluster(s) = Company (Regional Branches)
Propert(y/ies)    = Department(s)
Namespace(s)      = Operative action(s)

 

Fig. 1: Apache Pulsar tenant facilities.

 

Message Consumption

In a lower level view, the Apache Pulsar Model is based on the following components:

Producer → Topic ←→ Subscription ← Consumer

 
1. Producer: It is the application that sends messages to the topic and each message sent by the publisher is only stored once on a topic partition. It uses a Routing to determine which internal topic a message should be published to.

 
2. Topic (partition): It is the place where all the messages arrive. Each topic partition (stateless) is backed by a distributed log stored in Apache BookKeeper (stateful). The partition is the core resource in Pulsar, that performs the critical administrative operations inherited from the namespace in which it resides. Furthermore these namespaces can have a local or global scope. By default Pulsar Topics are created as non-partitioned but this can be modified to obtain high throughput to manage big data. As Producers and consumers are partition agnostic, a non-partitioned topic can be converted into partitioned on the fly (modifying the BROKER process), it is a pure administrative task.

 
3. Subscription: Determines which consumer(s) a message should be delivered to. Each topic can have multiple subscriptions and a subscription can have one or more consumer group(s). There are three types of subscriptions that can co-exist on the same topic:
 

  • Exclusive Subscription: For streaming use case. One consumer only in order to respect ordered consumption.
  • Failover Subscription: For streaming use case. Two consumers but one active only in order to respect ordered consumption.
  • Shared Subscription: For queuing use case. Multiple consumers can be active, allowing consumption in an unordered manner.

 
4. Consumer: It is the application that reads messages from Pulsar. These are grouped together for consuming messages. Each consumer group can have its own way of consuming the messages.

 

Fig. 2: Apache Pulsar Model and Message Consumption

 

Message Acknowledgement

These implementation policies apply in case there is a system failure situation and message(s) cannot be delivered in time.

Pulsar provides both cumulative acknowledgment and individual acknowledgment. In cumulative acknowledgement, any message before the acknowledged message will not be redelivered or consumed again. In individual acknowledgment, only the messages marked as acknowledged will not be redelivered in the case of failure.

For both exclusive or failover subscriptions (streaming) cumulative or individual acknowledge can be applied. For shared subscription (queuing) only individual acknowledge is applicable.
 

Fig. 3: Cumulative (above) vs Individual (below) Message Acknowledgment, blue cells will be redelivered.

 

Message Retention and Time To Live (TTL) Policies

A Pulsar cluster consists of 2 fundamental layers: a set of brokers as a serving layer, and a set of bookie nodes as a persistent storage layer. The Brokers as stateless components handle the partitioned parts of topics: store the received messages to the cluster, retrieve messages from the cluster, and send to the consumers on demand. The physical storage of the messages is handled by “bookie” nodes, which are the persistent storage for the Pulsar cluster. The Apache BookKeeper is the configuration used to manage bookie nodes. Since the broker layer and bookie layer are separated, scaling one layer is independent of scaling the other.

Having said that, in Pulsar, messages can only be deleted after all the subscriptions have already consumed it (the messages are marked as acknowledged in their cursors). However, Pulsar also allows to keep messages for a longer time even after all subscriptions have already consumed them. This is done by configuring a message retention period.
Additionally, Pulsar allows TTL, a message will automatically be marked as acknowledged if it is not consumed by any consumers within the configured TTL time period.
 

Fig. 4: Pulsar message retention policies.

 

!!! Fig 4 interpretation
Figure 4 shows how retention policies work with and without retention policy in place in a topic partition.
Without retention policy, Grey cells can be deleted as these have been acknowledged by subscriptions A and B. 
Blue cells cannot be deleted yet as not acknowledged by subscription A. 
Green cells cannot be deleted as not acknowledged by any subscription.

Using retention policy, only grey cells can be marked as retained, for the configured time period, as both 
subscription A and B have consumed those messages.

On this topic partition subscription B has a TTL in place, cell M10 has been marked as acknowledged even if
this cell has not been consumed yet.

 

Pulsar Arquitecture

In order to configure a highly available Pulsar multicluster, the following is required:

  • A zookeeper ensemble (Local Zk for Pulsar and Bookeeper metadata and Global Zk for Store multicluster metadata)
  • A Bookeeper cluster (Stateful storage layer)
  • A Pulsar broker cluster (Stateless coordinator)
     

Fig. 5: Pulsar arquitecture stack.

 

Conclusion

In this first part of a trio blog, I reviewed the principal concepts of Apache Pulsar, briefly considering a high and low level description and highlighting the main benefits that it offers, as for example, robust unified messaging system, streaming and queuing paradigms support, multi tenancy facilities, geo replication, retention policies, high availability and big data performance amongst others.
In my next blogs I will walk through the installation of a high available Pulsar cluster and finally conclude the blog series with an overview of the logs location and analysis as an attempt to understand how the stack is being built up in order to troubleshoot and solve any potential connection problem that may happen within one or more of the components that Pulsar utilises.

[Video] Using Open Source Software to improve Streaming on the Edge

Nov 2, 2021

Your Name

Apache Pulsar Neighborhood

Url of an image to represent you (or attach to the issue)

Level of Pulsar understanding

100 (beginner)|200 (intermediate)|300 (advanced)

Content (don't worry about formatting, the site moderators can help)

IoT devices are expected to number in the billions, each device enabled by one or more sensors, generating data about the device and its surrounding environment. Analyzing this data using machine learning (ML)/Artificial Intelligence (AI) models, for example, frequently requires transferring this data to the cloud. However for many applications the latency and throughput is too slow. And given that the amount of data being transferred and required custom solutions the cost would be prohibitive.

However, what if you could stream this data using Open Source Solutions, such as Apache Pulsar and Apache Arrow, to compute nodes that were optimized for IoT/Industry 4.0? You would be able to harness the power of Apache Arrow’s throughput and data format, which is understood everywhere (no translation penalty) and it is done “On Wire”; with Apache Pulsar’s end to end encryption, multi-tenant, and geo-replication. Robert Morrow, CEO and founder of SigmaX will walk us through how they combine open source software and optimized computing to give IIoT 20x faster throughput.

Robert is the CEO and Founder of SigmaX.ai. SigmaX.ai has a combined hardware and software approach when it comes to solving Enterprise Data Management problems at-scale. SigmaX builds and extends Apache Open Source Software so that their customers can benefit from the combined development efforts of open source without the obligation of vendor lock-in.

This was a joint event with IoTHub Meetup. You can check them out at https://www.meetup.com/IoT-Hub/

{{< youtube id="P3j1pJj4-oY" class="youtube" >}}

Incorrect link?

Describe the bug
I think that the link on the main page "Improve this page" goes to the wrong URL.
Should it go here: https://github.com/pulsar-neighborhood/pulsar-neighborhood.github.io (the main ReadMe) or to the github issue page (https://github.com/pulsar-neighborhood/pulsar-neighborhood.github.io/issues) or maybe here: https://github.com/pulsar-neighborhood/pulsar-neighborhood.github.io/issues/new/choose . I am thinking the middle one (normal github issues page

To Reproduce
Steps to reproduce the behavior:

  1. Go to the home page and scroll to the bottom
  2. click the "improve this page"
  3. it will take you to https://github.com/pulsar-neighborhood/pulsar-neighborhood.github.io/blob/main/content/_index.md

Expected behavior
A clear and concise description of what you expected to happen.

  • expected to be taken to a place where I can create an issue

Screenshots
If applicable, add screenshots to help explain your problem.
none needed

Desktop (please complete the following information):
Mac/Chrome although it would be the same on all

Smartphone (please complete the following information):
same

Additional context
This is a good issue page, nice and simple but has a good reminder for the steps. I just kept filling it out, even though I am pretty sure that it wasn't needed for me to fill this much out...

Happenings in the Pulsar Neighborhood March '22

Your Name

Aaron Williams

Url of an image to represent you (or attach to the issue)

Level of Pulsar understanding

none

Content (don't worry about formatting, the site moderators can help)

Was this forwarded to you? Click here to get future copies of [Happenings]!

For this issue, another new milestone, a preview of the new website, 2 new committers, the community technical meetings, and new events. Plus our normal features of a Stack Overflow question and some monthly community stats.

Another New Milestone

In November, we passed 10k stars on Github, which was pretty amazing (Currently at 10.5k), in January we passed 6k members on Slack (currently 6,224). And as March starts we just passed 500 contributors on Github. The graph below can be found here.

Screen Shot 2022-03-05 at 10 41 36 AM

As you can tell from the graph, the number of contributions from our neighbors continues to grow. In 2021, the ASF named us as #5 in contributions. Already in 2022, we have had over 700 contributions (over 55% more than 2021), from 125 unique neighbors (25% increase from 2021). But these numbers only capture about 10-20% of our neighbors. They miss all of the users, who have spoken at events, tweeted/retweeted messages, or simply just put Apache Pulsar as one of their skills on their LinkedIn page. Each of these things helps spread the word about Apache Pulsar. The simplest thing that you can do to help Pulsar grow is to amplify Pulsar on your social media platforms. Doing things is to star the GitHub repo, tweet blog posts, like articles on LinkedIn, and put Pulsar as one of your skills on LinkedIn.

New Website

Shhh… don’t tell anyone, but the new website beta has been released.

A group of our neighbors has been working hard to improve the Apache Pulsar home page. The complete redesign of the site should be completed soon and will correspond with the upgrade of the documentation. Go here if you would like to catch a sneak peek of the new site and to see what is being discussed, visit the website channel of the Pulsar Slack workspace.

Community Technical Meetings

Did you know that there are Community Technical Meetings that happen every two weeks? During these meetings, items such as PIPs (Pulsar Improvement Proposals) and other technical items are discussed. Plus questions about the more technical aspects of Pulsar are asked. Many of the committers and contributors are present and give their answers/opinions in this informal setting. Everyone is welcome and if you just want to dial in and listen, that is welcome too! But a quick reminder, per ASF rules, no final decisions are made at these meetings. Final decisions must be made on the dev@ mailing list where they can be tracked and commented upon by the entire neighborhood. But as we all know, sometimes it is just nice to ask a question and get an answer in real time. Neighbor Michael Marshall posted his notes to the mailing list and can be found here. If you would like to attend the meetings, they are on Tuesdays from 4:00 - 4:50 pm and Thursdays 8:30-9:30 am (Pacific Time). For more information and to subscribe to the calendar, please visit the Events page on the website.

New Committers

The PMC also announced the addition of two new committers, Aloys Zhang of Tencent and Li Li of StreamNative.

Aloys has made a number of contributions to the code base, making his first contribution in March of 2020. You can check out his GitHub page here.

Li Li has made the second most contributions over the last year and made his first contribution in May 2021. You can check out his GitHub page here.

Thank you to both Aloys and Li for all of your past work and contributions to the neighborhood and we look forward to your future contributions!

Upcoming Events

Some of our neighbors have a couple of great events coming up.
March 10: FLiPN stack for Cloud Data Lakes
March 22: Event Streaming with Apache Pulsar and Kotlin
March 30: Open Source Bristol (hybrid event)

Do you speak/understand Italian or just want to listen to native speakers talk about Pulsar? Well you are in luck. We have the recording of Enrico Olivelli giving an introduction to Pulsar talk to the Java User Group, Bolona. The link to the video might change in the future, but you can always find it by going to the neighborhood YouTube page (and hit subscribe to get the latest videos and reminders when we go live).

Would you and some colleagues like to set up a Neighborhood Meetup group or maybe you have someone who you would like to hear speak at a future meetup? Let us know and we can give you some help. Visit us at our Neighborhood Meetup page or our slack channel #meetup and ask questions.

Great questions from the Apache Pulsar Stack Overflow

As you know, we have a very active slack and Stack Overflow neighborhoods. You can ask questions at both locations and get answers quickly. Slack does have two big weaknesses. One, it is limited to the number of messages that can be saved at about 10k and we hit the limit about every three months. Two, it is not searchable by Google. Thus, when you put the error message that you received into Google, you won’t see that the question has already been answered once or twice on Slack. So to promote our great Stack Overflow channel, we thought that we would find a good question and include it here in Happenings.
Last month we pulled one from the archives, but for this month, we liked the newest one (well when this was published, it was the newest one.)

Question: How to consume only the latest message published from the topic and ignore all previously published message in Pulsar?

The question has some code examples and what the expected output from the author. Go and check it out and did you know the answer? Can you improve upon the answer?

Stats of the Month

We have talked a lot about stats above and here we will just pull some of the stats from Feb 2022. There were over 3k conversations across all of our channels from 288 unique neighbors. There were 320 contributions from 90 neighbors, with about one third of those from people making their first contribution!

#ApachePulsar is growing

We have seen #ApachePulsar appearing in some new places. We asked that Meetup add it as a tag that you can add to your events and they just created the tag. Also, we have see #ApachePulsar appear on Peloton, so if you are part of that community, add in #ApachePulsar and ride with others from the neighborhood. If you have seen #ApachePulsar somewhere new, please let us know.

Apache Pulsar in the News

Here are some blog posts that we have found from around the web. We think that they are good, but we might not have read them all. Let us know what you have written and we will share it. Post links on our blogs-articles channel on the Apache Pulsar Slack. Or to see more, plus presentations, go here.

Pulsar in Python on Pi for Sensors - DZone IoT
Using Apache Pulsar WebSockets for Real-Time Messaging in Front-End Applications
Multi-Tenancy Systems: Apache Pulsar vs. Kafka
Podcast of Neighbor Jowanza Joseph being interviewed about his new Pulsar book.
Generating Simulated Streaming Data

Apache Pulsar Neighborhood on Social Media
Follow us on: twitter, YouTube, Meetup, and Medium
To sign up to receive Happenings click here.

[Video] Apache Bookkeeper and Apache Zookeeper for Apache Pulsar

Oct 13, 2021

Your Name

Apache Pulsar Neighborhood

Url of an image to represent you (or attach to the issue)

Level of Pulsar understanding

100 (beginner)|200 (intermediate)|300 (advanced)

Content (don't worry about formatting, the site moderators can help)

Enrico Olivelli speaks to the Japan Apache Pulsar User Group about how Apache Pulsar uses Apache ZooKeeper and Apache BookKeeper, with a particular focus on how Pulsar guarantees consistency even in spite of common failures in distributed systems.

{{< youtube id="PWnO6YQ1HiM" class="youtube" >}}

a link on the home page to submit content

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
There should be a link on the home page submit content. (also I am testing out the feature request submission button)

Describe the solution you'd like
A clear and concise description of what you want to happen.
We want people to submit content. Even if they aren't here today to submit content, we want them to know that they can...

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
-none

Additional context
Add any other context or screenshots about the feature request here.
-none

Apache Pulsar Schema versus Schemaless — Who’s the Winner?

Your Name

Pulsar Neighborhood

Url of an image to represent you (or attach to the issue)

Touchscreen

Level of Pulsar understanding

200

Content (don't worry about formatting, the site moderators can help)

Apache Pulsar is a cloud-native, open-source distributed streaming and messaging platform. Yahoo designed the system to fulfill the need for durability and scalability. One of its unique and critical components is the Apache Pulsar Schema.

A schema defines a data structure that a message should follow to be understood and processed across a network. Essentially, it provides a blueprint of how the message should look.

Initially, data streaming did not use a schema; raw bytes were efficient and neutral vehicles. However, developers had to overlay their serialization mechanisms to ensure that processes on the receiving end were able to read and interpret the bytes fed into the system. This situation required creating an extra layer to monitor bytes flowing through the system.

Another challenge that raw bytes presented was the pipeline’s inability to cope with changes in data structures. If a developer slightly changes a field, the entire system starts throwing errors because the end systems’ binaries store all information related to structures.

Pulsar Schema solved all of these problems by introducing a new sender-receiver coordination mechanism. It has since become a system of choice for many developers. Verizon, Tencent, Comcast, Overstock.com, and numerous other enterprises have adopted this method.

In this article, we’ll examine how Pulsar Schemas work and contrast them with schemaless systems to determine the best approach. We’ll also demonstrate how to use Java clients with Pulsar.

How Schemas Work

Pulsar Schemas have two endpoints: a producer side that sends data and a consumer side that receives data. The two connect through brokers that communicate with the back-end processes.

Although the producer and consumer don’t connect directly, a contract specifies that the data a producer writes should follow a schema that a consumer can read. A schema registry enforces this contract by verifying the compatibility of schemas that each forwards to the brokers. It applies and enforces Pulsar Schemas at the topic level.

A topic is just an abstraction in the form of a Uniform Resource Identifier (URI). It groups messages based on context. A topic structure looks like this:

{type}://{tenant}/{namespace}/{topic}

How a Schema Works on the Producer Side

Before a producer sends data, it forwards the schema info to a broker. The broker checks the schema storage to verify that the received schema has been registered.

The broker sends the schema version back to the producer if it is registered. If it is not registered, the broker checks whether it can automatically generate the schema in that namespace. If possible, the broker creates a schema, registers it, and returns the schema version to the producer. If it is not possible, it rejects the connection to the broker.

Below is a producer-side schema verification chart:

producer-side-schema-verification

How a Schema Works on the Consumer Side

First, the application uses a schema instance to create a consumer instance that has the schema info. The consumer then forwards the schema information to the broker.

The broker checks if the topic has any of the following:

  • A schema
  • Data
  • A local consumer
  • A local producer

If it doesn’t contain them all and isAllowAutoUpdateSchema is set to true, the broker registers the schema and the consumer connects to the broker. If isAllowAutoUpdateSchema is false, the broker rejects the connection.

If the schema only has one of the listed components, the broker performs a schema compatibility check. If it is compatible, the consumer is allowed to connect to the broker. The broker rejects the connection if it doesn’t pass the compatibility check.

consumer-side-schema-verification

When to Use Schemaless

A Pulsar instance natively supports multiple clusters. It can seamlessly geo-replicate messages across clusters and scale out to more than a million topics with low latency.

Nonetheless, schemas are not a one-size-fits-all solution. In some cases, schemaless is more efficient and sustainable.

Consider a situation wherein we want to manage documents instead of uniform data structures. If we have documents in JSON with fields that are not uniform, we want a flexible solution that can adjust accordingly.

Using Java Clients with Pulsar

We have explored how a consumer and producer communicate with a schema registry. Now, we’ll create a Java application that uses Apache Pulsar.

We’ll start by creating a Maven project that contains an Apache Pulsar client as a dependency:

 <dependencies>
   <dependency>
       <groupId> org.apache.pulsar</groupId>
       <artifactId>pulsar-client</artifactId>
       <version>2.4.1</version>
   </dependency>
</dependencies>

Adding a Producer

Before we can create a producer, we must initiate the Pulsar client using this code:

PulsarClient pulsarClient = PulsarClient.builder()
       .serviceUrl("pulsar://localhost:8080")
       .build();

We can now create a producer client that is attached to a topic. We’ll use the Pulsar client we created above to initiate the new producer client:

Producer<String> producer = pulsarClient.newProducer(Schema.STRING)
       .topic("fun-topic")
       .create();

At this point, the broker will block the send() method until the schema registry verifies the schema. After the broker sends an acknowledgment, we can call the send() function like this:

producer.send("Hello there!");

After that, we close the producer using this code:

producer.close();

Adding a Consumer

We’ll use the Pulsar client to create a new consumer. In this example, we’ll create an exclusive consumer specified by subscriptionType():

Consumer<String> consumer = pulsarClient.newConsumer(Schema.STRING)
       .topic("fun-topic")
       .subscriptionName("fun-subscription")
       .subscriptionType(SubscriptionType.Exclusive)
       .subscribe();

Next, we’ll call the method receiver() in a while loop to fetch any produced messages in the subscribed topic:

while (true) {
   Message<String> myMessage = consumer.receive();
   try {
       System.out.printf("Here is the message: %s", myMessage);
       consumer.acknowledge(myMessage);
   } catch (Exception e) {
       consumer.negativeAcknowledge(myMessage);
   }
}

In the code above, the consumer instance alerts the broker when it receives the message by invoking acknowledge(). It also sends an alert if it fails by invoking negativeAcknowledge().

That is all that’s required. We now have a simple Java application that communicates to Pulsar using a Pulsar client.

Pulsar Schema’s Pros and Cons

Schema Pros:

  • Pulsar stamps every piece of data passing through its system with a name and schema version. This makes all data passing through the system easily discoverable.
  • Schema provides an easy way for producers and consumers to coordinate their data structure. If the producer’s schema changes, the registry ensures it is compatible with the old consumer schemas. This approach enables us to create systems that can adapt to data structure changes without the message pipeline failing.

Schema Cons:

  • Schemas are relatively restrictive. We must know the data structure beforehand.

Schemaless Pros and Cons

Schemaless Pros:

  • Schemaless approaches are efficient and neutral to the data they transmit.
  • They enable us to create a streaming pipeline even when the data structure is unclear.

Schemaless Cons:

  • Data structures are stored locally on both ends, making it challenging to synchronize them with one another.
  • Schemaless requires significant work to change the data structure.

Conclusion

Pulsar Schema’s distributed data streaming platform provides an easy and efficient way of coordinating data producers and consumers. It can adapt to schema changes without shutting down the streaming pipeline.

Although it requires some effort to set up, Pulsar Schema provides a consistent system once complete. There are some use cases where schemaless is a more appropriate solution. In all other instances, Pulsar Schema offers a resilient, scalable data-streaming method that works across multiple clusters.

Change data capture guides

It would be nice to offer guides focused on CDC using

  • MySql
  • MsSQL
  • Postgres
  • Cassandra
  • ?? maybe some of the managed db solutions

[Video] Apache Pulsar Deep Dive- an End-to-end view of the Data Flow

Sep 20, 2021

Your Name

Apache Pulsar Neighborhood

Url of an image to represent you (or attach to the issue)

Level of Pulsar understanding

100 (beginner)|200 (intermediate)|300 (advanced)

Content (don't worry about formatting, the site moderators can help)

Enrico Olivelli gives us a tour of how data moves from the Producer to the Consumer. [Client -- to the proxy, broker, bookie and then all the way back to the client on the consumer side]

Enrico is a member of the Apache Pulsar PMC. He is also a member of the PMC for ZooKeeper and BookKeeper, a committer for Apache Maven and the PMC Chair of Apache Curator. Enrico is a Senior Software Engineer at DataStax.

{{< youtube id="oLXCCCGsrWM" class="youtube" >}}

Moving from Java Message Service (JMS) to Apache Pulsar

Your Name

Pulsar Community

Url of an image to represent you (or attach to the issue)

Level of Pulsar understanding

100

Content (don't worry about formatting, the site moderators can help)

Many mission-critical business applications rely on Java Message Service (JMS), a popular enterprise messaging API. Enterprise messaging systems provide a way to create, send, receive, and read messages in an application, and JMS defines a common way for Java programs to interact with messaging systems.

Despite its popularity, JMS may not be able to keep up with the needs of modern businesses. For example, it might not be suitable for real-time complex event processing, change data capture, cross-region message replication, and seamless horizontal scalability. Apache Pulsar is a newer message-oriented middleware (MOM) that addresses these modern business dynamics. To fill these needs, we can switch from JMS to Pulsar, or use Pulsar to fit JMS solutions to the needs of modern enterprises.

Let’s explore some advantages and challenges of moving from JMS to Pulsar.

Moving from JMS to Pulsar

Jakarta EE applications frequently use JMS — especially applications requiring distributed transactions. While not many developers are building Jakarta EE applications now, vendors have evolved runtimes and Jakarta EE servers by adding extensions and continue to run their Jakarta EE applications. However, the technology is quickly falling behind the needs of modern data applications.

We’ll compare message consumption and change data capture (CDC) support in each technology to understand why business applications are shifting from JMS to Apache Pulsar.

Message Consumption

It’s up to the provider to decide how to implement JMS and transfer data. So, we need the provider library to consume JMS messages.

We need both a web application server and an entire Jakarta EE application server to use JMS. After deploying JMS into the Jakarta EE application server, we can use it to integrate applications. JMS, in this case, serves as a message broker.

We can use a modern open-source message broker, Apache Pulsar, as an alternative to JMS. Unlike JMS, Pulsar client libraries allow us to consume and produce messages using various programming languages, including Java, C++, Go, and Python. Also, we don’t need to buy a costly application server to use Pulsar.

Change Data Capture

Change data capture (CDC) records database changes but requires a messaging service to deliver change notifications to the relevant applications and systems in real-time. CDC code treats these changes as events with event-driven architectures and sends them asynchronously.

JMS uses a message-driven bean to support synchronous message processing. The bean acts as a JMS message listener and can only process JMS messages — not events.

Although modern applications require a design pattern that can process changes in real-time, such as CDC, JMS doesn’t support the CDC architecture. Apache Pulsar 2.3.0 and newer versions support database CDC.

Pulsar CDC connectors integrate with Alibaba’s open source Canal framework and RedHat’s Debezium CDC framework. We can use these connectors to capture log changes from common databases, like Oracle, PostgreSQL, MongoDB, MySQL, and MS SQL Server, sending them into Pulsar in real-time.

Modernizing JMS Systems

Since JMS systems fail to provide solutions for modern business challenges, we can use Pulsar to improve and modernize legacy JMS infrastructure. Fast JMS for Apache Pulsar is an example of a modernized JMS system.

Pulsar supports modern enterprises and provides a unified streaming and queuing design. So, it provides an ideal platform to build modern JSM infrastructure.

Real-Time Data

While JMS allows processing messages asynchronously, it uses the imperative programming paradigm. This means it fails to keep up with Pulsar’s execution speed. Its speed isn’t near real-time, and as such, it’s not suited for businesses that require high availability of real-time data.

Let’s consider a financial institution as an example. Real-time messaging is crucial for processing financial transactions. It helps alert staff to check suspicious transactions as they pass through the bank system. Also, with real-time data, we can automate anti-fraud systems to block fraudulent transactions as they occur.

Unlike JMS, Pulsar provides a stream processing framework that processes data as it enters the system. We can use the Pulsar IO API to build real-time streaming data pipelines for financial service systems.

We can create a real-time streaming data pipeline that runs new data through various processes by plugging adapters (connectors) into Pulsar using the Pulsar IO API. We can then send this modified data out of Pulsar by plugging sink adapters into the pipeline.

Challenges When Switching from JMS to Pulsar

Shifting from JMS to Pulsar involves some challenges. Let’s consider three that we might encounter: code refactoring, libraries, and architecture.

Code Refactoring

JMS is a Java API that provides a standard set of interfaces for typical enterprise messaging systems. So, we need to refactor the code when we switch from JMS to Pulsar.

Determining how to replace the JMS application programming interface (API) is the main challenge, as there’s no one-to-one mapping between JMS and Pulsar.

For example, Pulsar doesn’t have a concept of “message selectors.” Instead, we can partition messages using a “topic name.” If we used message selectors in JMS, we would need to refactor our code when migrating to Pulsar.

Luckily, Starlight for JMS helps avoid refactoring code when moving from JSM to Pulsar. Starlight for JMS is the first highly-compliant JMS implementation designed to run on a modern streaming platform. It enables enterprises to take advantage of a modern streaming platform’s scalability and resiliency to run their existing JMS applications.

Apache Pulsar, which powers Starlight for JMS, is open source and cloud-native. We can run it on-premises or in any cloud environment, public, private, or hybrid.

Relying on New Libraries

In the JMS client library, the client application connects to a broker then creates a session where it can publish or consume messages. The JMS API provides a standard interface that all messaging systems supporting the API can implement.

The Pulsar Java client works slightly differently. The first step is creating a producer or consumer object that references a specific topic. The Pulsar client connects to the cluster and automatically discovers which brokers host the topics it wants to produce or consume messages from. Then it handles transparent load-balancing across those brokers.

The Apache Pulsar Java client library is now part of the Apache Pulsar codebase. We can use the JMS 2.0-compliant Pulsar Java client library to develop Java applications that communicate with Apache Pulsar.

Architecture Changes

JMS uses a broker-centric messaging system, whereas Pulsar uses a distributed architecture. In JMS, all producers and consumers connect to a single server called a broker or message queue. Brokers store messages that producers and consumers exchange. The brokers aren’t connected, so no data exchange is possible between brokers.

Pulsar doesn’t have an equivalent of a JMS queue. Instead, it uses topics for message delivery. It divides topics into partitions, and only one consumer can consume each partition.

There’s no single broker or point of contact in Pulsar for producers and consumers. So, we can scale the system horizontally by adding new brokers. Producers can publish messages to any broker in the cluster, and consumers can subscribe to any topic in the cluster. \

Additionally, Pulsar is generally more performant than traditional broker-based message queuing systems like JMS because of its unique architecture and tiered storage. Unlike other queuing systems that rely on a single layer of storage (typically disks), Pulsar uses tiered storage that combines disks with flash or dynamic random-access memory (DRAM). This approach allows for higher throughput and lower latencies than traditional architectures.

Final Thoughts

Pulsar offers a viable alternative to Java Message Service, especially for organizations and developers that want to do more with their messaging technology. By understanding how Pulsar compares to Java Message Service, JMS users can better prepare for the change and make the transition more easily.

Happenings in the Pulsar Neighborhood Feb '22

Title

Happenings in the Pulsar Neighborhood Feb '22

Search categories

  • Newsletter

Tags

  • Newsletter
  • Blogs

Level

100

Content

Was this forwarded to you? Click here to get future copies of Happenings!

For this issue, a new PMC member, 2 new committers, a call to help improve the quality of the code, a meetup in Italian, and a new and improved home page coming soon. Plus our normal features of a Stack Overflow question and some monthly community stats.

New PMC Member

For the second month in a row, the Apache Pulsar PMC announced a new member. Lari Hotari of DataStax was invited to join the PMC and accepted. Lari made his first contribution in April 2020 and since then, he has been involved in more conversations and has made more contributions than any other Neighbor. His most recent high profile contribution was for the Log4j where many technology press outlets picked up and reported on his tweets and Github testing applications. We want to thank Lari for all of his hard work and dedication to the Apache Pulsar project.

New Committers

The PMC also announced the addition of two new committers, Zhangjian He of Huawei and Haiting Jiang of DiDi.
Zhangjian made his first contribution almost 2 years ago in March of 2020. Since then, he has contributed in some manner over 500 times.

Haiting Jiang made his first contribution to Pulsar about 8 months ago and since then has been one of Pulsar’s most prolific contributors. He is also a committer to RocketMQ.

Thank you to both of you for your hard work over the last couple of years and we look forward to seeing what you contribute going forward.

Call to Action

Are you new to Apache Pulsar Neighborhood or have been here for awhile and are looking to learn how the internals of Apache Pulsar actually work? Well, Lari has put out a call for help working on Pulsar’s Flaky tests. By helping out finding issues with the tests, you will be digging deep into the code to get a great understanding of it. To learn more, read Lari’s post on the dev@ mailing list.

Approaching Milestone

Slack approaching 6k

In November, we passed 10k stars on Github, which was pretty amazing (Currently at 10,328). And as 2022 starts, we are about to pass 6k members on Slack. At the time of this writing we are at 5,951 with an average of about 60 new members a week (which is also growing). So very soon we will pass 6,000. If you are not on our Slack, you are missing out. Last week we had 592 messages, which was up over 20% from the previous week. To join the conversations, go to the Pulsar Contact page and follow the link to the Pulsar Slack Workspace. While there, you can also sign up for the official dev@ and user@ Pulsar mailing lists.

New and Improved Website Coming Soon

A group of our neighbors has been working hard to improve the Apache Pulsar home page. The complete redesign of the site should be completed soon and will correspond with the upgrade of the documentation. If you would like to catch a sneak peek of the new site and what is being discussed, visit the website channel of the Pulsar Slack workspace.

Upcoming events…

February has been a little quiet for our meetups with only one schedule at the moment. The Bologna Java Users Group is going to host some of our Italian neighbors on 24 Feb. more details will be posted on our Neighborhood Meetup page.

The quietness does make for a great opportunity to go over to our YouTube channel and check out the events that have already happened. Also, we have playlists of our neighbors at other events. To be the first to learn when we post our next event, go to our Meetup page and follow it.

Would you and some colleagues like to set up a Neighborhood Meetup group or maybe you have someone who you would like to hear speak at a future meetup? Let us know and we can give you some help. Visit us at our Neighborhood Meetup page or our slack channel #meetup and ask questions.

Great questions from the Apache Pulsar Stack Overflow

As you know, we have a very active slack and Stack Overflow neighborhoods. You can ask questions at both locations and get answers quickly. Slack does have two big weaknesses. One, it is limited to the number of messages that can be saved at about 10k and we hit the limit about every three months. Two, it is not searchable by Google. Thus, when you put the error message that you received into Google, you won’t see that the question has already been answered once or twice on Slack. So to promote our great Stack Overflow channel, we thought that we would find a good question and include it here in Happenings.

For this month’s question, we thought that we would go back and highlight an older question.
Question: Let’s say for that one has a Pulsar Producer for a persistent topic topic1 (namespace and tenant are not relevant for the question).And let’s say we have multiple consumers for the same topic (topic1) with different subscription names.Is it possible to configure the consumers to receive same message? So for example if message msg1 is sent to the topic both consumer1 and consumer2 receive this message? Both consumers and producer are written in Java but programming language is not important.

It has two good answers already, plus a link to a blog that goes into more detail, so it is a great way to learn a little bit more about Pulsar. While there, take a moment and see if you can answer one of the open questions.

Stats of the Month

For January, we had 81 contributors making 392 contributions, with 21 of those contributors making their first contribution. We also had just under 3k conversations from 293 different people. This is about 10% more interaction in January than in December and more than double the engagement a year ago in January 2021, it just shows how much the community is growing. Plus we now have over 5,900 members on our Slack channel!

#ApachePulsar is growing

We have seen #ApachePulsar appearing in some new places. We asked that Meetup add it as a tag that you can add to your events and they just created the tag. Also, we have see #ApachePulsar appear on Peloton, so if you are part of that community, add in #ApachePulsar and ride with others from the neighborhood. If you have seen #ApachePulsar somewhere new, please let us know.

Apache Pulsar in the News

Here are some blog posts that we have found from around the web. We think that they are good, but we might not have read them all. Let us know what you have written and we will share it. Post links on our blogs-articles channel on the Apache Pulsar Slack. Or to see more, plus presentations, go here.

Proven: Starlight for JMS Can Send One Million Messages Per Second
Using Apache NiFi with Apache Pulsar For Fast Data On-Ramp
Kafka or Pulsar? A Battle of the Giants Concerning Streaming
Taager’s Foray in Messaging Part 1; Apache Kafka vs. Apache Pulsar
Auto-Scaling Pulsar Functions in Kubernetes Using Custom Metrics

Apache Pulsar Neighborhood on Social Media

Follow us on: twitter, YouTube, Meetup, and Medium
To sign up to receive Happenings click here.

[Video] Internals of Stream Processing

Dec 30, 2021

Your Name

Apache Pulsar Neighborhood

Url of an image to represent you (or attach to the issue)

Level of Pulsar understanding

200

Content (don't worry about formatting, the site moderators can help)

Stream Processing Systems (SPSs) are an integral part of modern data-intensive companies. In a world where streams are becoming king, they are commonly employed for much more than data analytics. Yet, most of developers only use them and have never dove deep into the internals of the system.

This was true for Pedro Silvestre, who with colleagues at TU Delft started to create Clonos , a novel local recovery and high availability approach for SPS.

" I knew very little about the internals of SPSs. But now, I needed to modify one quite heavily and that would require a good mental model of their design. I understood what SPSs did from a theoretical standpoint and I had even used them extensively in previous projects, but I had no clue about their internal design. I scoured the internet for resources, and while I was able to piece together a little bit of what was going on, in the end I would need to spend weeks looking through the code of an SPS before feeling confident enough to start designing Clonos."[1]

From this starting point he needed to dig deep into to code to get the understanding needed to finish Clonos. Pedro will walk us through his learnings about SPS and his recommendations for improving them.

Pedro F. Silvestre is a PhD student with the Large-Scale Data & Systems (LSDS) Group at Imperial College London. His research focuses on the interplay between Dataflow Systems and novel Deep Learning Use Cases. Previously he was a Research Engineer at TU Delft's Web Information Systems Group working on Clonos.

[1] https://www.doc.ic.ac.uk/~pms20/post/stream-processing-thread-model/

{{< youtube id="3suR1CV-zOE" class="youtube" >}}

Enrico Olivelli

Your name

Enrico Olivelli

Url to the image that represents you

Enrico_Pic

Bio

Enrico is a PMC member of Apache Pulsar, Sr. Software Engineer at DataStax, PMC for Apache ZooKeeper & BookKeeper projects, Committer for Apache Maven, and PMC Chair of Apache Curator. Enrico has done multiple talks at Neighborhood Meetups, plus outside talks at various JUGs.

(all the below are optional)

GitHub URL

https://github.com/eolivelli

Twitter URL

https://twitter.com/eolivelli

Aaron Williams

Your name

Aaron Williams

Url to the image that represents you

AaronPic-1

##Bio Text?
Aaron is a developer advocate and community manager focusing on expanding and growing the Apache Pulsar community. Aaron held a similar position for LF Edge (the Linux Foundation's Edge umbrella project) and previously was the Global Director for SAP's internal community and makerspace called the d-shop, overseeing the volunteers and staff of over 30 d-shops around the world.

(all the below are optional)

GitHub URL

https://github.com/aarondonwilliams

Twitter URL

https://twitter.com/aarondonw

LInkedIn URL

https://www.linkedin.com/in/aaron-don-williams/

Add a calendar of upcoming events to the site

It would be nice to have a searchable calendar that lists all Pulsar-related events. Ideally the list of events is maintained as markdown, using the folder structure to represent years and months.

/content
/2021
/January
05.md
10.md
/February
15.md
/March
/2022
/January
02.md
30.md
... (you get the point)

Log4Shell- Security Update

Log4Shel- Security Update

Your Name

Pulsar Neighborhood

Url of an image to represent you (or attach to the issue)

Level of Pulsar understanding

100

Content (don't worry about formatting, the site moderators can help)

Log4Shell- Security Update

image

Image Credit: LunaSec

*There is a blog post on the Apache Pulsar website that has the latest instructions *

Within the last 10 hours (current time 10:00 am Pacific, 10 December 21), there has been a severe RCE 0-day exploit found in the Java library log4j that when used, results in a Remote Code Execution (RCE). This has been given the name CVE-2021–44228 (LunaSec has begun calling it Log4Shell). A detailed write up for the issue can be found on the LunaSec site.
This affects all Log4J releases (2.0<= Apache log4j <= 2.14.1) and therefore affects all Apache Pulsar versions, since we use an affected Log4J version.

That is the bad news, the good news is that since the Apache Pulsar Neighborhood is made up of residents from around the world, a work-around was quickly created and soon after that, a fix. The fix will be in all future updates (2.7.4, 2.8.2, and 2.9.1). In the meantime, Pulsar Neighbor and Apache Pulsar Committer Lari Hotari has posted instructions (and a second about Helm and Docker) on the [email protected] mailing list to mitigate this problem. We have copied parts of the email below, but recommend that you follow the link to his post (and to subscribe to dev@ mailing list) to see if there are any other updates.

We have not heard of any exploits affecting Apache Pulsar, but we highly recommend that you follow the instructions above and update your systems and then install the latest versions of Apache Pulsar once they are released.

By the way, a little side note on how fast all of this was completed. At time=0 (about 10 pm EST) Log4j released 2.15 and announced the vulnerability. Neighbor and Apache Pulsar Committer ZhangJian He had created a PR for the latest version about 2 hours later. It was soon reviewed and suggestions made from other Neighbors in Japan, China, Finland, Italy, and the US. By t=+7 hours, workarounds were created and the email was released. About this time the vulnerability was given its number.

To everyone who helped with this and for doing it so quickly, on behalf of all your Neighbors, a big THANK YOU for your hard work!

For everyone running Apache Pulsar, please update your systems. And if you find a security issue, please let us know by emailing private{a}pulsar.apache.org or security{a}apache.org.

From the email:

This [..] affects all Pulsar versions after 2.0.0-incubating since a
vulnerable Log4J version is used. I'm not aware of a confirmed exploit for
Pulsar. The fix to Pulsar is to upgrade to Log4J 2.15.0 . The PR is
https://github.com/apache/pulsar/pull/13226 . The fix will be release as
part of Pulsar 2.8.2 , 2.7.4 and 2.9.1 . Before the fixed version is
available, there's an immediate workaround to mitigate the security issue.
I'd like to share mitigation instructions for this severe vulnerability:

- Add -Dlog4j2.formatMsgNoLookups=true system property to the JVM arguments
of all Pulsar processes. There are multiple ways to achieve this in Pulsar.
It can be added to either OPTS, PULSAR_GC or PULSAR_MEM environment
variables.

- Upgrade to Pulsar 2.8.2 , 2.7.4 or 2.9.1 once they are available.
There's a PR to handle the adding of -Dlog4j2.formatMsgNoLookups=true
system property in the Apache Pulsar Helm chart, that is
https://github.com/apache/pulsar-helm-chart/pull/186 . Until that is
available, the recommended approach is to add
"-Dlog4j2.formatMsgNoLookups=true" to OPTS, PULSAR_GC or PULSAR_MEM
manually and ensure that the Java process picks up the system property.
It's also necessary to check that the property doesn't have typos. The
setting is case sensitive.

Apache Pulsar Neighborhood on Social Media
Follow us on: twitter, YouTube, Meetup, and Medium
To sign up to receive Happenings click here.

Understanding The Differences Between Message Queues And Streaming

Your Name

Pulsar Neighborhood

Url of an image to represent you (or attach to the @issue)

Ux-Design

Level of Pulsar understanding

100

Content (don't worry about formatting, the site moderators can help)

Almost any application that requires real-time or near-real-time data processing benefits from having a message queue or streaming data processing component in its architecture. Online food ordering apps, e-commerce sites, media streaming services, and online gaming are straightforward examples. But weather apps, smart cars, health status apps with smartwatch technology, or anything Internet of things (IoT) typically rely on a message queue or streaming engine as well.

While message queues and streaming apply to similar use cases and use similar technologies, on a technical level they’re entirely different. We’ll compare them here and examine the pros and cons of each solution, touching on message brokers, publisher-subscriber (pub/sub) architecture, and event-driven scenarios.

We’ll touch on some use cases to highlight why sometimes one approach is better than the other. Finally, we’ll discuss how the open-source Apache Pulsar platform supports both message queues and streams, with a few subtle differences.

Differences Between Message Queues and Streaming

Let’s start by exploring the major differences between message queues and event streaming.

Message Queues

Message queues transport messages between application components, across applications, or across services in traditional monolith applications, containers, or microservices. Any online transaction processing (OLTP) is a good candidate for message queues.

Think of a message queue as a sequential list of data blocks waiting to be processed.

While queues are a fantastic way to send data across different application or service components, they also have some challenges. For example, they have latency, so processing a message takes time.

Reliability may also be an issue if the message queue’s unavailability affects the application’s stability. Its reliability also depends on how the application handles failed messages. Intelligence may also be an issue, depending on how developers recognize if the broker has already picked up or processed a message.

For example, application source A sends a message to a message broker. In turn, application target B picks up and processes the message from the message broker.

Message queues

Streaming

Stream processing typically involves a more significant stream of data events that have already occurred. The events go to a message bus, where the streaming service picks them up.

Streaming

Any workload generating a large flow of data (a_ _stream) that needs to be processed in real-time is well-suited for stream processing. A stream is an infinite sequence of messages that are generated and sent continuously.

Stream processing architectures do have their challenges. The first is performance, as the application must handle the load of incoming data streams. Other challenges include order and logic, as we have to determine how our application should process the input stream.

Finally, stream processing is real-time. While this may be what we want in most cases, we might also have a batch-processing requirement.

Message Brokers

Mapping both technologies with each other immediately introduces another component: the message broker or message bus. A message broker is an interface between the message’s originator — a producer or publisher — and the destination handling system — a receiver or consumer, sometimes called a subscriber. It handles the message queue.

However, message brokers can combine several queues, providing scalability and high availability. Apache Kafka is one example of a message broker system.

Brokers are sometimes considered the more intelligent part of the solution stack. They’re typically responsible for message persistence and replication. So, if a message queue fails, the broker recognizes this and sends the incoming flow of messages to another queue. Since the message broker manages the communication between the producer and the consumer, neither component experiences downtime nor interruptions in message handling.

To address one of the message queue challenges, the broker can also recognize the message arrival order and how to process them.

Message Queue Versus Streaming Architecture

Performance is critical in a modern microservices architecture. So, we need to make sure we’re choosing an architecture that benefits us the most.

A message queue is asynchronous since messages move into a queue, waiting to be picked up. The receiving component may need to poll the message queue to find out if there are any new messages.

In contrast, an application should process a continuous stream of messages as they’re generated, using an active, ongoing process. Event-driven processing or event-based architecture often accomplishes this.

The magic keyword in event-based architecture is “trigger.” Whenever some event occurs, another process kicks off.

This trigger could move something to a queue, a stand-alone activity, like saving the camera image to storage or validating credit card details with the credit card company. However, the event-based architecture can also work in a more significant stream, like checking for a robbery via hundreds of surveillance cameras writing to storage or validating thousands of payment transactions to detect fraud. In these cases, the architecture moves data to a stream and performs real-time analytics.

Message queues and streaming are both valid solutions for event-based architectures. Deciding which is best depends on the nature of the application workload and solution and the foreseeable outcome. For example, it would be acceptable to use a message queue for a stand-alone credit card validation. However, it wouldn’t be a viable architecture for payment transaction fraud detection.

Let’s compare the benefits of message queues and stream processing.

Message queue Stream
Controls data volumes Handles real-time data generation
Enables batch processing Allows real-time analytics
Routing logic based on message brokers Multiple subscribers to control message flow traffic
Asynchronous data processing Synchronous, continuous data flow

Message Queue and Streaming Solutions

There are many message queue and streaming solutions available. Let’s take a closer look at Apache Pulsar as an example.

Yahoo originally developed Apache Pulsar to enable various data flows within their cloud environment. Now, it’s open-sourced through the Apache Software Foundation.

Developers find it to be a robust and scalable messaging and streaming platform. We can deploy Pulsar on bare metal as physical or virtual machines, run it inside Docker containers, or scale it within Kubernetes clusters, depending on the organization or workload application’s needs.

Pulsar’s core is the publisher-subscriber (pub/sub) architecture. Producers create messages and publish them to topics. Consumers subscribe to topics to recognize the specific messages they must handle.

Message queue and streaming solutions

Each Pulsar instance contains one or more Pulsar clusters. The Pulsar clusters are message brokers, delivering messages from the producers to the consumers. Pulsar can replicate data across clusters to optimize performance and scale.

Apart from the message brokers, Pulsar also relies on Apache BookKeeper as a temporary message store and Apache ZooKeeper to orchestrate and coordinate across Pulsar clusters. Pulsar’s architecture provides everything we need to handle traditional message queues or resource-intensive continuous streams.

Let’s say our application needs direct communication between the producer and consumer. A message queue can manage the messages that the application can’t process immediately. Once the consumer acknowledges the message, it is removed from the queue. Payment transactions, food ordering apps, and e-commerce benefit from a Pulsar message queue architecture.

Or, perhaps our application workload involves producers generating a vast amount of data that we need to analyze in real-time, like IoT or payment fraud detection. Pulsar can integrate this streaming analytics processing as well. Pulsar’s stream processing unit, Pulsar Functions, takes ownership of the analysis and handles the messages routing across the producer and consumer.

In the past, developers needed to use different solutions for messaging and streams. Pulsar’s most significant benefit is combining traditional message queueing with stream processing within the same architecture. This ability simplifies the architecture, reducing the burden of maintaining skills across different platforms and solutions.

Conclusion

E-commerce order processing benefits from message queueing, while streaming offers real-time data analytics on a continuous influx of data, such as IoT or global payment transactions. Each solution has its challenges and benefits.

Your decision on which to use depends on your specific needs. If your application workload needs queuing and streaming, you may consider a highly-available, scalable, robust all-in-one solution like Pulsar.

Now that you understand the differences between these seemingly-similar solutions, you’re better equipped to decide which is best for your application.

Happenings in the AP Neighborhood Jan. ‘22

Title

Happenings in the Pulsar Neighborhood Jan. ‘22

Search categories

  • Newsletter

Tags

  • Newsletter
  • Blog

Level

100

Content

Was this forwarded to you? Click here to get future copies of Happenings!

Happy New Year Everyone!

For this issue, we have some end of the year stats, Log4j updates, a new PMC member, and lots of talks. Plus our normal features of a Stack Overflow question and some monthly community stats.

Apache Pulsar is #5 in Commits

The Apache Software Foundation (ASF) released their “Apache by the digits” report and among the interesting details about the ASF is the stat that Apache Pulsar is #5 in commits. We wrote a short blog on the Neighborhood’s Medium page, check it out here and learn about how we grew so much in 2021.

pieChart

Source: Apache by the digits

Log4j Update

The log4shell exploit is shaping up to be a very large issue going forward because of how widely used Log4j is. The exploit has even made it into the popular press and will pop up over the next year as some company will realize that they didn’t update their systems and will be exploited. That is the bad news. The good news is how quickly the Apache Pulsar community reacted. We wrote a quick post that captured those first hours here. And the quick actions by our neighbors were noticed by the wider open source community. Specifically, Lari Hotari was mentioned in a Sonatype article (and in many other places) talking about his tester and docker updates.

The official mitigation actions that you need to take are located on the Apache Pulsar website. Although the best way to keep up this and other happenings is to sign up for the dev@ mailing list.

Year End Stats

The Apache Pulsar Neighborhood grew a lot this year. We had over 3.3k contributions from 351 different neighbors. For comparison, in 2020 we had 1415 contributions (an increase of about 140%), from 212 unique contributors. For conversations in 2021, we had over 24k across GitHub, Slack, and Stack Overflow from 1.2k unique people. In 2020, we had 9,256 conversations from 665 speakers. As you can tell, the neighborhood has more than doubled in 2021!
And a fun final little stat that you probably already heard about, on 29 November, we passed 10k stars on Github. We wrote a short blog about it in November. We are currently at 10,200, so if you haven’t starred the Pulsar GitHub, please take a moment and do so.

New PMC Member

The PMC announced the addition of a new member in December, Liu Yu from StreamNative. If you don’t recognize the name Liu Yu, you probably recognize Anonymitaet who has made many contributions, especially around the documentation of the project. She brings the number of neighbors elevated to the PMC this year to four. The others were Hang Chen from StreamNative, Lin Lin from Tencent, and Enrico Olivelli from DataStax.

Upcoming Events

Our first event of 2022 will be with Rob Morrow of SigmaX on 12 Jan 2022 at the NorCal Neighborhood Meetup. The talk is titled “Using Open Source Software to Improve Streaming on the Edge” and about how to use Apache Pulsar and Apache Arrow to get sensor data from the billions of IoT devices into an IoT Gateway, because going to the Cloud is too slow and too costly.

Do you live in the Princeton, NJ area? The NYC Apache Pulsar Meetup is hosting a “Pulsar, Pizza, and Phun” event on 13 Jan 2022. Yes this is an in person event!

15–16 Jan ‘22- Pulsar Summit Asia- Note that this event is now virtual only.
With more on the way in February and March.

If you missed any of the previous events, such as Pedro Silvestre of Imperial College London or Jowanza Joseph of Finicity, we just added both of their talks to the Neighborhood YouTube Channel. There you will find the rest of the talks that we recorded in 2021, plus links to the ApacheCon talks, and some links to other talks.

Would you and some colleagues like to set up a Neighborhood Meetup group or maybe you have someone who you would like to hear speak at a future meetup? Let us know and we can give you some help. Visit us at our Neighborhood Meetup page or our slack channel #meetup and ask questions.

Great questions from the Apache Pulsar Stack Overflow

As you know, we have a very active slack and Stack Overflow neighborhoods. You can ask questions at both locations and get answers quickly. Slack does have two big weaknesses. One, it is limited to the number of messages that can be saved at about 10k and we hit the limit about every three months. Two, it is not searchable by Google. Thus, when you put the error message that you received into Google, you won’t see that the question has already been answered once or twice on Slack. So to promote our great Stack Overflow channel, we thought that we would find a good question and include it here in Happenings.

Question: I am curious about how compression works in pulsar. from the public doc, it states “You can compress messages published by producers during transportation” does it mean client compress the data and the data get decompressed when it arrives at broker so the decompressed data is persisted and consumer later? or it means the compression happens from end-to-end and the decompression happens at consumer side?

Follow the link above to get the answer. Can you expand on the answer given?

Stats of the Month

For Dec, we had 91 contributors making 411 contributions, with 19 of those contributors making their first contribution. We also had just under 3k conversations from 277 different people. This is about 50% more interaction in December than in November and with all of the holidays that are in December, it just shows how much the community is growing. Plus we now have over 5,700 members on our Slack channel!

#ApachePulsar is Growing

We have seen the tag #ApachePulsar appearing in some new places. We asked that Meetup to add it as a tag that you can add to your events a couple of months ago and they just added it. Also, we have see #ApachePulsar appear on Peloton, so if you are part of that community, add in #ApachePulsar and ride with others from the neighborhood.

Apache Pulsar in the News

Here are some blog posts that we have found from around the web. We think that they are good, but we might not have read them all. Let us know what you have written and we will share it. Post links on our blogs-articles channel on the Apache Pulsar Slack. Or to see more, plus presentations, go here.

Apache Pulsar Performance Testing with NoSQL Bench
Vulnerability Affecting Multiple Log4j Versions Permits RCE Exploit (Neighbor Lari Hotari mentioned)
Distributed Locks With Apache Pulsar
Announcing Memgraph 2.1
Infinite Scale without Fail
Apache BookKeeper Observability — Part 1 of 5

Apache Pulsar Neighborhood on Social Media

Follow us on: twitter, YouTube, Meetup, and Medium
To sign up to receive Happenings click here.

Pedro Silvestre

Your name

Pedro Silvestre

Url to the image that represents you

PedroS

(all the below are optional)

Bio

Pedro is a PhD student with the Large-Scale Data & Systems (LSDS) Group at Imperial College London, where his research focuses on the interplay between Dataflow Systems and novel Deep Learning Use Cases. Pedro wrote a blog post titled "On the Internals of Stream Processing" which was starting point for his talk at our meetup.

##GitHub URL
https://github.com/PSilvestre

Twitter URL

https://twitter.com/pmfsilvestre

LInkedIn URL

https://www.linkedin.com/in/pedro-silvestre/

Imperial College London's Bio Page

https://www.doc.ic.ac.uk/~pms20/

Markdown examples folder

Create a folder of markdown files showing how to do different styling/layout in articles & guides

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.