- Real-time Analytics with Storm and Cassandra
- Shilpi Saxena
- 889字
- 2025-04-04 21:00:14
Stream groupings
Next we need to get acquainted with various stream groupings (a stream grouping is basically the mechanism that defines how Storm partitions and distributes the streams of tuples amongst tasks of bolts) provided by Storm. Streams are the basic wiring component of a Storm topology, and understanding them provides a lot of flexibility to the developer to handle various problems in programs efficiently.
Local or shuffle grouping
Local or shuffle grouping is the most common grouping that randomly distributes the tuples emitted by the source ensuring equal distribution, that is, each instance of the bolt gets to process the same number of events. Load balancing is automatically taken care of by this grouping.
Due to the random nature of distribution of this grouping, it's useful only for atomic operations by specifying a single parameter—source of stream. The following snippet is from WordCount
topology (which we reated earlier), which demonstrates the usage of shuffle grouping:
TopologyBuilder myBuilder = new TopologyBuilder(); builder.setSpout("spout", new RandomSentenceSpout(), 5); builder.setBolt("split", new SplitSentence(), 8).shuffleGrouping("spout"); builder.setBolt("count", new WordCount(), 12).fieldsGrouping("split", new Fields("word"));
In the following figure, shuffle grouping is depicted:

Here Bolt A and Bolt B both have a parallelism of two, each; so two instances of each of these bolts is spawned by the Storm framework. These bolts are wired together by shuffle grouping. We will now discuss the distribution of events.
The 50 percent events from Instance 1 of Bolt A would go to Instance 1 of Bolt B, and the remaining 50 percent would go to Instance 2 of Bolt B. Similarly, 50 percent of events emitted by Instance 2 of Bolt B would go to Instance 1 of Bolt B, and the remaining 50 percent would go to Instance 2 of Bolt B.
Fields grouping
In this grouping, we specify two parameters—the source of the stream and the fields. The values of the fields are actually used to control the routing of the tuples to various bolts. This grouping guarantees that for the same field's value, the tuple will always be routed to the same instance of the bolt.
In the following figure, field grouping is depicted between Bolt A and Bolt B, and each of these bolts have two instances each. Notice the flow of events based on the value of the field grouping parameter.

All the events from Instance 1 and Instance 2 of Bolt A, where the value of Field is P are sent to Instance 1 of Bolt B.
All the events from Instance 1 and Instance 2 of Bolt A, where the value of Field is Q are sent to Instance 2 of Bolt B.
All grouping
All grouping is a kind of broadcaster grouping that can be used in situations where the same message needs to be sent to all instances of the destination bolt. Here, each tuple is sent to all the instances of the bolt.
This grouping should be used in very specific cases, for specific streams, where we want the same information to be replicated to all bolt instances downstream. Let's take a use case that has some information related to a country and its currency value and the bolts following the bolt, which does need this information for some currency conversion. Now whenever currency bolt has any changes, it uses all grouping to publish it to all the instances of the following bolts:

Here we have a diagrammatic representation of all grouping, where all the tuples from Bolt A are sent to all the instances of Bolt B.
Global grouping
Global grouping makes sure that the entire stream from the source component (spout or bolt) goes to a single instance of target bolt, to be more precise and specific to the instance of the target bolt with the lowest ID. Well let's understand the concept with an example, let's say my topology is as follows:

I will assign the following parallelism to the components:

Also, I will use the following stream groupings:

Then, the framework will direct all data from the myboltA stream instances, that are emitting onto one instance of myboltB stream, which would be the one to which Storm has assigned a lower ID while instantiation:

As in the preceding figure, in the case of global grouping, all tuples from both instances of Bolt A would go to Instance 1 of Bolt B, assuming it has a lower ID than Instance 2 of Bolt B.
Custom grouping
Storm, being an extendible framework, provides the facility to developers to create their own stream grouping. This can be done by providing an implementation to the backtype.storm.grouping.CustomStreamGroupinginterface
class.
Direct grouping
In this kind of grouping, the Storm framework provides the ability to the sender
component (spout or bolt) to decide which task of the consumer bolt would receive the tuple while the sender component is emitting a tuple to the stream.
The tuple must be emitted to the stream using a special emitDirect
method to the stream, and the task of consuming a component has to be specified (note that the tasked can be fetched using the TopologyContext
method).