Machine Learning: Where to Start with building a Text Classification Model

Topic Modeling with PoC2Ops

In this story I will take you through the journey of analyzing a statement to determine if you give it a Thumbs Up or Thumbs Down. The key to building a efficient model is to have a good set of training data. In this article I’m assuming that you have already prepared a set of training data. But I’m going to give you an example of the training data that I used in this story. This story is part of a bigger story series that I’m breaking down steps in the Machine Learning process. This story will cover Data Prep, Training the Model and Testing the Model. Additional stories cover Tuning the Model and Using the Model.

Process Overview

Data Prep -> Training the Model -> Testing the Model

Tools Used in this Story

  1. Python
  2. Jupyter Notebooks
  3. Spacy
  4. AWS DynamoDB

Data Prep

For this example we are going to evaluate a statement and give it a Thumbs Up or Thumbs Down based if it fits our business need or not. We are loading the Training Data into a DynamoDB table that has three fields. trainingDataRequirementId (Key), feedback, requirementText.

Start by creating a new DynamoDB Table

DynamoDB Create Table — AWS

Populate the DynamoDB Table. In this scenario I have populated a bunch of statements from a ReactJS application that I’ll cover in another story on Data Prep Techniques. In this case I’m using a binary feedback of 1 (Thumbs Up) or 0 (Thumbs Down).

Example Data in the DynamoDB Table

Training the Model

For this step we assume you have a set of data in a DynamoDB table described in the steps above. We are now going to train a Spacy model with this data. I’m going to use Jupyter Notebooks to take you through the steps of training data. Open Jupyter Notebooks and install Spacy from a terminal window. Also you will need to download the “en_core_web_md” model. We are going to start with this model and resume training on it.

conda install -c conda-forge spacypython -m spacy download en_core_web_md

Next we will create a cell to import the required data, you will need to insert your secret access key and access key to pull from your DynamoDB instance.

this will give you an output at the end. For this example I’m just printing one row of the training data, in this case it was a Thumbs Down for the record.

Jupyter Notebooks Output

Now we have the a variable in python “TRAINING_DATA” that has a list of tuples to begin re-training a model.

Testing the Model

We now have a model trained on the examples we have set and can test a statement to see how the model classifies it. In this case based on the statement it gives a 64% Thumbs down. Based on one of the training data sets we gave a Thumbs down to Video Teleconference in one statement but had a couple where we did not related to Video Teleconference.