书名：Mastering Hadoop 3
作者名：Chanchal Singh Manish Kumar
本章字数：296字
更新时间：2025-04-04 14:54:50

Filtering patterns

The filtering pattern is simply filtering out records based on a particular condition. Data cleansing is one of the commonly used examples of a filtering pattern. The raw data may have records in which a few fields are not present or it's just junk that we cannot use in further analysis. Filtering logic can be used to validate each record and remove any junk records. The other example could be web article filtering based on particular word/regex matches. These web articles can be further used in classification, tagging, or machine learning use cases. The other use case could be filtering out all the customers who do not buy anything that is more than 500 dollars in value and then process it further for any other analysis. Let's look at the following regex filtering example:

import org.apache.Hadoop.io.NullWritable;
import org.apache.Hadoop.io.Text;
import org.apache.Hadoop.mapreduce.Mapper;

import java.io.IOException;

public class RegexFilteringMapper extends Mapper<Object, Text, NullWritable, Text> {

    private String regexPattern = "/* REGEX PATTERN HERE */";

    @Override
    protected void map(Object key, Text value, Context context) throws IOException, InterruptedException {

        if (value.toString().matches(regexPattern)) {
            context.write(NullWritable.get(), value);
        }
    }
}

The other example could be random sampling of data, which is required in many use cases such as data for testing applications, training machine learning models, and so on. The other common use case is to find out top-k records based on a specific condition. In most organizations, it is important to find out the outliers/customers who are genuinely loyal to the merchant and offer them good rewards or to find out about customers who have not used the application for a long time and offer them a good discount to get them to re-engage. Let's look into how we can find out about the top-k records using MapReduce based on a particular condition.