- Develop and import the following packages using Python:
import numpy as np
from nltk.corpus import brown
- Describe a function that divides text into chunks:
# Split a text into chunks
def splitter(content, num_of_words):
words = content.split(' ')
result = []
- Initialize the following programming lines to get the assigned variables:
current_count = 0
current_words = []
- Start the iteration using words:
for word in words:
current_words.append(word)
current_count += 1
- After getting the essential amount of words, reorganize the variables:
if current_count == num_of_words:
result.append(' '.join(current_words))
current_words = []
current_count = 0
- Attach the chunks to the output variable:
result.append(' '.join(current_words))
return result
- Import the data of Brown corpus and consider the first 10000 words:
if __name__=='__main__':
# Read the data from the Brown corpus
content = ' '.join(brown.words()[:10000])
- Describe the word size in every chunk:
# Number of words in each chunk
num_of_words = 1600
- Initiate a pair of significant variables:
chunks = []
counter = 0
- Print the result by calling the splitter function:
num_text_chunks = splitter(content, num_of_words)
print "Number of text chunks =", len(num_text_chunks)
- The result obtained after chunking is shown in the following screenshot: