Stream groupings_Real-time Analytics with Storm and Cassandra-QQ阅读男频轻小说网

书名：Real-time Analytics with Storm and Cassandra
作者名：Shilpi Saxena
本章字数：889字
更新时间：2025-04-04 21:00:14

Stream groupings

Next we need to get acquainted with various stream groupings (a stream grouping is basically the mechanism that defines how Storm partitions and distributes the streams of tuples amongst tasks of bolts) provided by Storm. Streams are the basic wiring component of a Storm topology, and understanding them provides a lot of flexibility to the developer to handle various problems in programs efficiently.

Local or shuffle grouping

Local or shuffle grouping is the most common grouping that randomly distributes the tuples emitted by the source ensuring equal distribution, that is, each instance of the bolt gets to process the same number of events. Load balancing is automatically taken care of by this grouping.

Due to the random nature of distribution of this grouping, it's useful only for atomic operations by specifying a single parameter—source of stream. The following snippet is from WordCount topology (which we reated earlier), which demonstrates the usage of shuffle grouping:

TopologyBuilder myBuilder = new TopologyBuilder();
builder.setSpout("spout", new RandomSentenceSpout(), 5);
builder.setBolt("split", new SplitSentence(),  8).shuffleGrouping("spout");
builder.setBolt("count", new WordCount(),  12).fieldsGrouping("split", new Fields("word"));

In the following figure, shuffle grouping is depicted:

Here Bolt A and Bolt B both have a parallelism of two, each; so two instances of each of these bolts is spawned by the Storm framework. These bolts are wired together by shuffle grouping. We will now discuss the distribution of events.

The 50 percent events from Instance 1 of Bolt A would go to Instance 1 of Bolt B, and the remaining 50 percent would go to Instance 2 of Bolt B. Similarly, 50 percent of events emitted by Instance 2 of Bolt B would go to Instance 1 of Bolt B, and the remaining 50 percent would go to Instance 2 of Bolt B.

Fields grouping

In this grouping, we specify two parameters—the source of the stream and the fields. The values of the fields are actually used to control the routing of the tuples to various bolts. This grouping guarantees that for the same field's value, the tuple will always be routed to the same instance of the bolt.

In the following figure, field grouping is depicted between Bolt A and Bolt B, and each of these bolts have two instances each. Notice the flow of events based on the value of the field grouping parameter.

All the events from Instance 1 and Instance 2 of Bolt A, where the value of Field is P are sent to Instance 1 of Bolt B.

All the events from Instance 1 and Instance 2 of Bolt A, where the value of Field is Q are sent to Instance 2 of Bolt B.

All grouping

All grouping is a kind of broadcaster grouping that can be used in situations where the same message needs to be sent to all instances of the destination bolt. Here, each tuple is sent to all the instances of the bolt.

This grouping should be used in very specific cases, for specific streams, where we want the same information to be replicated to all bolt instances downstream. Let's take a use case that has some information related to a country and its currency value and the bolts following the bolt, which does need this information for some currency conversion. Now whenever currency bolt has any changes, it uses all grouping to publish it to all the instances of the following bolts:

Here we have a diagrammatic representation of all grouping, where all the tuples from Bolt A are sent to all the instances of Bolt B.

Global grouping

Global grouping makes sure that the entire stream from the source component (spout or bolt) goes to a single instance of target bolt, to be more precise and specific to the instance of the target bolt with the lowest ID. Well let's understand the concept with an example, let's say my topology is as follows:

I will assign the following parallelism to the components:

Also, I will use the following stream groupings:

Then, the framework will direct all data from the myboltA stream instances, that are emitting onto one instance of myboltB stream, which would be the one to which Storm has assigned a lower ID while instantiation:

As in the preceding figure, in the case of global grouping, all tuples from both instances of Bolt A would go to Instance 1 of Bolt B, assuming it has a lower ID than Instance 2 of Bolt B.

Note

Storm basically assigns IDs to each instance of a bolt or spout that it creates in the topology. In global grouping, the allocations are directed to the instance that has a lower value on the ID allocated from Storm.

Custom grouping

Storm, being an extendible framework, provides the facility to developers to create their own stream grouping. This can be done by providing an implementation to the backtype.storm.grouping.CustomStreamGroupinginterface class.

Direct grouping

In this kind of grouping, the Storm framework provides the ability to the sender

component (spout or bolt) to decide which task of the consumer bolt would receive the tuple while the sender component is emitting a tuple to the stream.

The tuple must be emitted to the stream using a special emitDirect method to the stream, and the task of consuming a component has to be specified (note that the tasked can be fetched using the TopologyContext method).

本周热推：

深入浅出Serverless：技术原理与应用实践 C语言非常道 C++程序设计基础教程深入理解LLVM：代码生成 Python网络爬虫与数据分析从入门到实践