- Mastering Hadoop 3
- Chanchal Singh Manish Kumar
- 186字
- 2025-04-04 14:54:50
Composite join
The map side join on a very large dataset is known as a composite join. The advantage will be the same as we discussed in map side join previously in that the shuffling and sorting phase will be skipped as there will be no reducer. The only condition for composite join is that data needs to be prepared with a specific condition before it gets processed.
One of the conditions is that the dataset must be sorted with the key that was used for the join. It must also partition by the key and both datasets must have the same number of partitions. Hadoop provides a special InputFormat to read such datasets with CompositeInputFormat.
Before using the following template, you must process your input data to sort and partition to make the data be in the format that's required for composite join. The first step should be to prepare the input data and we must preprocess input data to sort and partition it using a join key. Let's look into mapper and reducer to sort and partition the input data.