Classification I – Detecting Poor Answers

A continuous challenge for owners of Q&A sites is to maintain a decent level of quality in the posted content. Sites such as StackOverflow make considerable effort to encourage users with diverse possibilities to score content, and offer badges and bonus points to spend more energy on carving out the question or crafting a possible answer.

One particularly successful incentive is the ability for the asker to flag one answer to their question as the accepted answer (there are incentives for the asker to flag answers as such). This will result in a higher score for the author of the flagged answer.

Would it not be very useful for the user to immediately see how good their answer is while typing it in? That means the website would continuously evaluate the user's work-in-progress answer and provide feedback as to whether the answer has room for improvement. This will encourage the user to put more effort into writing the answer (such as providing a code example or even including an image), and thus improve the overall system.

Let's build such a mechanism! In this chapter, we'll cover the following topics:

  • Fetching and preprocessing the raw data
  • Creating a first nearest-neighbor classifier
  • Looking into how to improve the classifier's performance
  • Switching from nearest-neighbor to logistic regression
  • Learning about precision and recall to better understand the classifier's performance
  • Thinking about the necessary steps for shipping it