Getting ready

In this recipe, we will build a classifier that will allow us to classify the membership of a topic into a particular discussion group. This operation will be useful to verify whether the topic is relevant to the discussion group. We will use the data contained in the 20 newsgroups dataset, available at the following URL: http://qwone.com/~jason/20Newsgroups/.

This is a collection of about 20,000 newsgroup documents, divided into 20 different newsgroups. Originally collected by Ken Lang, and published in Newsweeder paper: Learning to filter netnews, the dataset is particularly useful for dealing with text classification problems.