Don’t be afraid of the Storm (Part 3)

This is the third part of the series on learning Apache Storm which includes description about tuple grouping. Here’s the link to Part 1 which has introduction and Part 2 which has code example.


In Part 2, we created a topology to read a stream of sentences and calculate counts of words. This post will cover challenges of running multiple bolts of same type and how to overcome them.

Let’s first discuss what is parallelism hint and what’s the need of it.

Parallelism hint is a concept of apache storm. It specify how many number of a particular spout/bolt should be running parallely. E.g. if a parallelism hint of 2 will specify supervisor that 2 instance of that particular spout/bolt should be running parallely in storm cluster.

One great thing about storm is that you can change the number of instances at runtime. You don’t have to redeploy your topology to change it.

Next question is, why do we need it?

It is needed when our system is not able to keep up with the incoming load i.e. incoming load is higher than what we can process. To actually solve this problem, we will first needs to find out the “bottleneck”. By “bottleneck” i mean that particular spout/bolt(s) which is not able to keep up with the incoming load. To increase the performance, we will then have to increase the number of instances of the bottleneck spout/bolt(s).

Now lets dig into our example and learn more about parallelism in storm.

Our topology consist of following spout and bolts:

  1. kafka-spout
  2. sentence-splitter bolt
  3. word counter bolt
  4. output-bolt

We will try to increase parallelism hint of each spout/bolt and analyse the effect of it. In between I will introduce the concept of tuple grouping and bolt stream.


Increasing instances of spout

Sometime it may happen that spout becomes the bottleneck i.e. our bolts are able to process load easily but they are not getting enough data to achieve our performance goal.

Our first instinct will be to increase the parallelism hint of spout. But don’t be surprised when you find that your performance remains the same.

This is because the spout depends on external entity for its input i.e. if your source can handle only spout then even if we increase the number to 10 or 20 it won’t make the difference.

Let’s take an example of Kafka as input source. Each spout of our topology will connect to each partition of Kafka. So if we our kafka topic has only one partition then even if you run multiple nodes of spout it won’t increase our performance.So our maximum performance depends on partition of kafka.

So increasing the performance by spouts will depend on your input source.


Increasing instances of bolt

Increasing the instances of bolt raises different challenges. This is because this is where we are doing processing and we may abide some storm’s principals because of our usecases. Let’s take the example of our topology.

sentence-splitter bolt

We will face no problem in increasing number of instances of “sentence-splitter” bolt because it’s just splitting the incoming sentence into words and forwarding them to next bolt.

It has no context of the previous sentence and does not depend on saved context to process current input.

But why it makes it easy to increase the number? What will happen if we increase the instance of a bolt which needs the previous context to process current input?

For that we need to know about different types of tuple grouping.


Let’s start with “What is a Tuple Grouping?”

Data gets transferred between two bolts in the form of tuple. Each tuple has a number of fields and tuple grouping specify how these tuples needs to be grouped to transfer it to other bolt.

Type of Field Grouping :

  1. Shuffle Grouping : We don’t specify field and tuples get randomly distributed between parallelly running bolts.
  2. Field Grouping : We specify a field(s) which storm will use in hashing function to decide which tuple should go to which bolt. This way a particular bolt will always get same set of tuples.
  3. Direct Grouping : In this grouping, producer of the tuple decides which tuple goes to which bolt.
  4. Custom Grouping : Instead of specifying field statically, you pass on the implementation of CustomStreamGrouping and decide the distribution at the runtime.

Coming back to original discussion, in our topology we have specified our grouping as “Shuffle Grouping” and the tuples can randomly go to any bolt.

word-couter bolt

Let’s take an example of our “word-couter” bolt. In that bolt we are keeping the count of each word which came since the topology is up and when new data comes in, we add that with the stored data and move the count to next bolt.

So let’s assume we have two instances of same bolt, WC1 and WC2, and we have selected shuffle grouping.

 

Now a two tuple comes in with values

tuple   : {'word',count_in_sentence}

tuple 1 : {"hello",1}

tuple 2 : {"hello",1}

storm send tuple1 to WC1 and tuple2 to WC2. When both bolt process them the output bolt will receive info that “hello” word has occured 1 time in that period of time. But in actual it occurred two twice.

This makes the perfect usecase for “Field Grouping”

There is one more usecase which fits the bill for “Field Grouping”.

Let’s say you suddenly got bombarded with huge amount of data from one user. Now if you use “userId” as one of your field grouping parameter then all the traffic will go into one bolt and only one set of users will get hampered and other’s data will get processed easily.


I hope by reading this article you get an understanding about tuple grouping in storm. You can comment on this post if you are facing any specific issue related to this.

Next post will be about how multiple bolts can use the same bolt’s data in their feed and when would you need that.

By,

Tiwan Punit

www.coviam.com


Also published on Medium.

1 thought on “Don’t be afraid of the Storm (Part 3)

Leave a Reply

Your email address will not be published.