How to do it...

  1. Initialize a new Python file by importing the following file:
import numpy as np 
from nltk.corpus import brown 
from chunking import splitter 
  1. Define the main function and read the input data from Brown corpus:
if __name__=='__main__': 
        content = ' '.join(brown.words()[:10000]) 
  1. Split the text content into chunks:
    num_of_words = 2000 
    num_chunks = [] 
    count = 0 
    texts_chunk = splitter(content, num_of_words) 
  1. Build a vocabulary based on these text chunks:
    for text in texts_chunk: 
      num_chunk = {'index': count, 'text': text} 
      num_chunks.append(num_chunk) 
      count += 1
  1. Extract a document word matrix, which effectively counts the amount of incidences of each word in the document:
  from sklearn.feature_extraction.text      
import CountVectorizer
  1. Extract the document term matrix:
from sklearn.feature_extraction.text import CountVectorizer 
vectorizer = CountVectorizer(min_df=5, max_df=.95) 
matrix = vectorizer.fit_transform([num_chunk['text'] for num_chunk in num_chunks]) 
  1. Extract the vocabulary and print it:
vocabulary = np.array(vectorizer.get_feature_names()) 
print "nVocabulary:" 
print vocabulary 
  1. Print the document term matrix:
print "nDocument term matrix:" 
chunks_name = ['Chunk-0', 'Chunk-1', 'Chunk-2', 'Chunk-3', 'Chunk-4'] 
formatted_row = '{:>12}' * (len(chunks_name) + 1) 
print 'n', formatted_row.format('Word', *chunks_name), 'n' 
  1. Iterate throughout the words, and print the reappearance of every word in various chunks:
for word, item in zip(vocabulary, matrix.T): 
# 'item' is a 'csr_matrix' data structure 
 result = [str(x) for x in item.data] 
 print formatted_row.format(word, *result)
  1. The result obtained after executing the bag-of-words model is shown as follows:

In order to understand how it works on a given sentence, refer to the following: