Sneak peak at the new gotcha! homepage!See More arrow right

Automated Intent Classification Using Deep Learning

Automated Intent Classification Using Deep Learning

906
READS
There have been major changes in how search engines operate that should question our traditional take on SEO:
Research keywords.
Write content.
Build links.
Nowadays, search engines are able to match pages even if the keywords are not present. They are also getting better at directly answering questions.
At the same time, searchers are growing more comfortable using natural language queries. I’ve even found growing evidence where new sites are ranking for competitive terms without building links.
Recent research from Google even questions a fundamental content marketing framework: the buyer’s journey .
Their conclusion is that we should no longer consider visitors moving on a linear path from awareness to decision. We should adapt to unique paths taken by each potential customer.
Considering all these major changes taking place, how do we adapt?
Using machine learning, of course!
Automate everything: Machine learning can help you understand and predict intent in ways that simply aren’t possible manually.
In this article, you will learn to do just that.
This is such an important topic that I will depart from my intense coding sessions in past articles . I will keep it light on Python code to make it practical to the whole SEO community.
Here is our plan of action:
We will learn how to classify text using deep learning and without writing code.
We will practice by building a classification model trained in news articles from the BBC.
We will test the model on news headlines we will scrape from Google Trends.
We will build a similar model but we will train it on a different dataset with questions grouped by their intention.
We will use Google Data Studio to pull potential questions from Google Search Console.
We will use the model to classify the questions we export from Data Studio.
We will group the questions by their intention and extract actionable insights we can use to prioritize content development efforts.
We will go over the underlying concepts that make this possible: word vectors, embeddings, and encoders/decoders.
We will build a sophisticated model that can parse not just intent but also specific actions like the ones you give to Siri and Alexa.
Finally, I’ll share some resources to learn more.
Uber Ludwig
Completing the plan I described above using deep learning generally requires writing advanced Python code.
Fortunately, Uber released a super valuable tool called Ludwig that makes it possible to build and use predictive models with incredible ease.
We will run Ludwig from within Google Colaboratory in order to use their free GPU runtime.
Training deep learning models without using GPUs can be the difference between waiting a few minutes to waiting hours.
Automated Text Classification
In order to build predictive models, we need relevant labeled data and model definitions.
Let’s practice with a simple text classification model straight from the Ludwig examples .
We are going to use a labeled dataset of BBC articles organized by category. This article should give you a sense of the level of coding we won’t have to do because we are using Ludwig.
Setting up Ludwig
Google Colab comes with tensorflow 1.12. Let’s make sure we use the right version expected by Ludwig and also that it supports GPU runtime.
Under the Runtime menu item, select Python 3 and GPU.
!pip install tensorflow-gpu==1.13.1
Prepare the Dataset for Training
Download the BBC labeled dataset.
!gsutil cp gs://dataset-uploader/bbc/bbc-text.csv .
Let’s create a model definition. We will use the first one from the examples .
Run Ludwig to Build & Evaluate the Model
!ludwig experiment \
--data_csv bbc-text.csv\
--model_definition_file model_definition.yaml
When you review Ludwig’s output, you will find that it saves you from performing tasks you’d otherwise needed to perform manually . For example, it automatically split the dataset into training, development and testing datasets.
Training set: 1556
Validation set: 215
Test set: 454
Our training step stopped after 12 epochs. This Quora answer provides a good explanation of epochs, batches, and iterations.
Our test accuracy was only 0.70, which pales in comparison to the 0.96 achieved manually in the referenced article.
Nevertheless, this is very promising because we didn’t need any deep learning expertise and it took only a small fraction of the work. I’ll provide some direction on how to improve models in the resources section.
Visualizing the Training Process
Let’s Test the Model with New Data
Let’s use this JavaScript snippet to scrape Google Trends articles titles to feed into the model.
After creating a pandas dataframe with the articles’ titles, we can proceed to get predictions from the trained model.
Here is what the predictions look like for the top global category.
I scraped headlines from the Tech and Business sections and while the Tech section predictions were not particularly good, the Business ones showed more promise.
I’d say that the DOJ going after Google is definitely not Entertainment. Not for Google for sure!
Automated Question Classification
We are going to use the exact same process and model, but on a different dataset which will enable us to do something more powerful: learn to classify questions by their intention.
After you log in to Kaggle and download the dataset, you can use the code to load it to a dataframe in Colab.
This awesome dataset groups questions by the type of expected answers using two categories a broader and more specific one.
I updated the model definition by adding a new output category, so we now have two predictions.
The training process is the same. I only changed the source dataset.
!ludwig experiment \
— Richard Feynman (@ProfFeynman) May 18, 2019
There is a big difference between looking up a business by its name in the physical world, and looking it up by its address or GPS coordinate.
Looking up a business by its name is the equivalent of matching up searches and pages by the keywords in their content.
Google’s physical address in NYC is 111 Eighth Avenue . In Spanish, it is 111 octava avenida. In Japanese 111番街. If you are close by, you would refer to it in terms of the number of blocks and turns. You get the idea.
In other words, it is the same place, but when asked for directions, different people would refer to this place in different ways, according to their particular context.
The same thing happens when people refer to the same thing in many different ways. Computers need a universal way to refer to things that are context independent.
Word vectors represent words as numbers. You normally take all the unique words in the source text and build a dictionary where each word gets a number.
For example, this is the equivalent of taking all business names in Eighth Avenue and translating them into their street number, is number 114 on Eighth.
This initial step is good to be able to uniquely identify words or street address, but not enough to easily calculate distances between addresses globally and providing directions. We need to also encode proximity information using absolute coordinates.
When it comes to physical addresses that is what GPS coordinates do. Google’s address in NYC has GPS coordinates 40°44′29″N 74°0′11″W, while the Pad Thai Noodle Lounge has 40°44′27″N 74°0′5″W, which are in close proximity.
Similarly, when it comes to words, that is what word embeddings do. Embeddings are essentially absolute coordinates, but with hundreds of dimensions.
Imagine word embeddings as GPS coordinates in an imaginary space where similar words are close together and different ones are far apart.
As word embeddings and GPS coordinates are simply vectors, which are just numbers with more than one dimension, they can be operated like regular numbers (scalars).
In the same way, you can calculate the difference between two numbers by subtracting them, you can also calculate the difference between two vectors (their distance) using mathematical operations. The most common are:
Let’s bring this home, and see how word vectors and embeddings actually look in practice and how they make it easy to compare similar words.
This is the vector representation of the word “hotel”.
This is how this vector approach makes it easy to compare similar words.
So, in summary, when we provided the training text to Ludwig, Ludwig encoded the words into vectors/embeddings, in a way that makes it easy to compute their distance/similarity.
In practice, embeddings are precomputed and stored in lookup tables which help speed up the training process.
Beyond Intent Classification
Now, let’s do something a bit more ambitious. Let’s build a model that can parse text and extract actions and any information needed to complete the actions.
For example: “Book a flight at 7 pm to London”, should not just understand that the intent is to book a flight, but also the departure time and departure city.
Ludwig includes one example of this type of model under the Natural Language Understanding section.
We are going to use this labeled dataset which is specific to the Travel industry. After you log in to Kaggle and download it, you can upload it to Colab as done in previous examples.
It is a zip file, so you need to unzip it.
!unzip atis-dataset-clean.zip
Here is the model definition.
We run Ludwig to train the model as usual.
!ludwig experiment \
--data_csv atis.train.csv \
--model_definition_file model_definition.yaml
In my run, it stopped after epoch 20 and achieved a combined test accuracy of 0.74.
Here is the code to get the predictions from the test dataset.
Finally, here is what the predictions look like.
Resources to Learn More
If you are not technical, this is probably the best course you can take to learn about deep learning and its possibilities: AI For Everyone .
If you have a technical background, I recommend this specialization: Deep Learning by Andrew Ng . I also completed the one from Udacity last year, but I found the NLP coverage was more in-depth in the Coursera one.
The third module, Structuring Machine Learning Projects , is incredibly valuable and unique in its approach. The material in this course will provide some of the key knowledge you need to reduce the errors in the model predictions.
My inspiration to write this article came from the excellent work on keyword classification by Dan Brooks from the Aira SEO team.
More Resources:

Images Powered by Shutterstock