[LWM] NLP: Bag-of-words

Welcome to another blog-isode of Learn with me — a weekly educational series by Gauss Algorithmic. We take cutting-edge technological concepts and break them down into bite-sized pieces for everyday business people. Today’s topic will be the second in our series on Natural Language Processing, or NLP.

 

Welcome back to Learn with me, where we break down complicated tech topics so simply that even a marketer like me can understand. 🤷‍♂️

We started our series on NLP last week, and we will continue by discussing two interconnected concepts: Bag-of-words and Text classification. Both of these concepts are used for spam filters or other forms of automated document sorting.

Let’s look at a (very real) email you might find in your spam folder.

 

 

Bag-of-words

Like we talked about last time, machine learning algorithms don’t typically process text without turning them into numerical data first. The bag-of-words method is one way to do that.

This process takes all the words in a body of text (for our example, we’ll just use the subject line of an email), and simply counts how many times different words appear. This creates a number set called a vector.

 

 

Notice that a Bag-of-words vector can include words that AREN’T in the text (“hello”). This is because the absence of words can also be important when sorting, or classifying, a text.

Text classification

A spam filter will compare the bag-of-words vector of an incoming email to what a typical email might look like. For example, the profile of a typical spam email might have many more instances of the word “sexy” than your average email that you want in your inbox.

(Your inbox experience may vary, we don’t judge. 😉)

 

 

Machine learning algorithms will go through a huge number of emails to start learning patterns of which emails belong in your inbox, and which belong in your spam folder. The Bag-of-words is one method they use for this kind of text classification, but there are other methods too which we’ll talk about next week.

The result? No hoot nights for me. 🥺

 

 

How can this help my business?

Lot’s of different types of text exist, and machine learning algorithms can do something with them automatically. This is called document automation. The cost for setting this up can quickly become less than the man-hours that this process is replacing.

A few example:
⚖️ Sort health records into low- and high-risk
✉️ Redirect info@ emails to the right team
💫 Organize YEARS of old documents

Tell us about your business use case and we will let you know how we can use machine learning and/or document automation to save you money.

Write a comment