Latent Semantic Analysis

Within the realm of natural language processing, Latent Semantic Analysis (LSA) is a technique used to analyze the relationships between a collection of documents based off of the words that are inside of them. This post will take an in-depth look at LSA and show how it can be used to sort a large collection of documents.

The Python code that was used in this post can be found here: LSA Walkthrough

To understand how Latent Semantic Analysis works, lets walk through an example of how it can be used to sort the following four (very unrealistic) documents:

After reading through these documents, intuitively, we can see that they could be sorted into two different categories. For the sake of simplicity, let’s call these categories pets and travel.

This is a rather trivial conclusion for us to draw since the subject of the first two documents is clearly different than the last two, but for a computer, identifying these subject differences is not a simple task. What techniques do we have at our disposal to make it easier for a computer to recognize these kinds of differences?

Document-Term Frequency and Cosine Similarity

One common technique begins by first looking at the frequencies at which the words inside of a document appear inside of the others. Intuitively, documents that use the same words frequently have a higher probability of being similar to one another. For example, we know that on a basic level, the first two documents in our collection can be considered to be similar to one another due to the fact that they both use the words cats and dogs frequently. To further analyze these relationships, it is common practice to convert a collection of documents into what is known as document-term frequency matrix (in my code and from here on out, I will refer to it as a "frequency matrix").

Each row of this matrix represents a document and the columns represent the words that appear inside of these documents. By letting each entry of this matrix represent the frequency at which these words appear inside of each document, we have a consise way of representing these documents based off of their word frequencies.

Here is the document-term frequency matrix represented as a heat map for our example documents:

While it may not be abundantly clear, if we sift through this matrix, we can see that words like cats and dogs are frequently used by the first two documents and that the third and fourth documents frequently use words like travel and Italy. Just based off of these patterns, we can already begin to get a sense of how these documents are similar.

By treating the rows of this matrix as vectors (often referred to as document vectors), we can quantify the similarity of two documents based off of their document vector's cosine similarity. Cosine similarity simply tells us how large the angle is between any two vectors. If the angle between two vectors is small, meaning that they're similar, then the cosine similarity score will be high.

In the code below, I do a quick pairwise cosine similarity calculation between every document vector in our collection to get an idea of how similar each document is to the others:

from sklearn.metrics.pairwise import cosine_similarity

num_documents = frequency_matrix.toarray().shape[0]

print("Cosine Similarity: \n")
for i in range(num_documents-1):
    for j in range(i+1, num_documents):
        print("\t Document %d and %d: %s" % 
        (i+1, j+1, str(cosine_similarity(frequency_matrix[i], frequency_matrix[j])).strip('[]')))
Output:

Cosine Similarity:

    Document 1 and 2: 0.30555556
    Document 1 and 3: 0.07273930
    Document 1 and 4: 0.10206207
    Document 2 and 3: 0.24246432
    Document 1 and 4: 0.13608276
    Document 3 and 4: 0.53452248

As we can see, the document pairs (1,2) and (3,4) have the highest cosine similarity scores, which is what we would expect to see from these documents, but we can also see that the document pair (2,3) has an unexpectedly high similarity score.

Stop Words and Tf-idf

The reason why this pair has a high cosine similarity score is due to the fact that both of these documents use the word "to" frequently. This points out a major flaw in this model. By simply letting word frequencies contribute to the similarity between documents, we are allowing for certain, not very useful words, to contribute to the similarity.

For example, the word "and" appears in nearly every English body of text that you can come across. If you let "and" contribute to the similarity, then almost every body of text is going to be related since almost all of them contain the word "and."

Due to this fact, before we make our frequency matrix, it is common practice to use a stop list to filter out conjuctions, prepositions and determiners since they appear in the majority of documents and do not contribute all that much information to the subject of the document.

To further increase the accuracy of our cosine similarity scores, we introduce a weighting scheme known as tf-idf to weight each word based off of how frequently it appears and the number of documents that it appears in. Doing this in some sense quantifies the "importance" of a word to a collection of documents. If a word appears infrequently in a select few documents, then it is likely that the word is a defining feature to the documents and dictates the subject.

After removing our stop words and running the tf-idf weighting scheme over our collection of documents, we produce the tf-idf matrix:

After removing stop words and introducing the tf-idf weighting scheme, the similarities between the documents becomes much more clear. Now we are really able to pick up on the fact that, for example, words like cats, dogs and pets are prominent words that appear frequently inside of the first two documents.

Let's take a look at the cosine similarities between these new document vectors:

Output:
            
Cosine Similarity:

    Document 1 and 2: 0.26221910
    Document 1 and 3: 0.04448158
    Document 1 and 4: 0.05626396
    Document 2 and 3: 0.14167618
    Document 1 and 4: 0.07168148
    Document 3 and 4: 0.39328159

As we can see, the similarity scores for all of the document pairs have gone down, but some have been squashed more than others. In paticular, the documents that we know are not similar have been, for the most part, driven down to less than 20%.

Singular Value Deomposition and Semantic Space

The tf-idf model coupled with cosine similarity provides us with a great approximation for the similarity between documents, but there are some limitations to this model.

First of all, this model compares documents based on the idea that documents that frequently use the same words will be similar to one another. While for the most part, this works decently well, the problem with this idea is that it leads to the inability to pick up on synonyms. Suppose, for example, we chose to add a fifth document to our collection that frequently used the words feline and canidae. The tf-idf model would fail to recognize that the first two documents are similar to this new one since neither of them use the word feline!

The second issue with this model boils down to the fact that in any real world situation, we would be using this technique to analyze thousands of documents that contain tens of thousands of words, which leads to document vectors that live in a very high dimensional space. Cosine similarity is only useful if the vectors that we're studying live in a relatively low dimensional space. As we increase the dimension that our vectors live in, the angle between any two of our vectors gets smaller, hence, cosine similarity is useless since in very high dimensions, every vector is going to appear to be similar. This is known as the curse of dimensionality.

To counteract these issues, we need to find a way to pick up on synonyms and also reduce the dimension of these document vectors. Conveniently, this can be done with singular value decomposition. of our tf-idf matrix and then using this decomposition to project our documents into semantic space.

Singular value decomposition provides us with a way to decompose our tf-idf matrix into a product of three smaller matrices:

Given an arbitrary transposed tf-idf matrix D , there exists a singular value decomposition of the form:

D = U SVT

where U is a unitary matrix that describes how are the words in our documents are related to our features (I'll explain what I mean by this in just a second), S describes the strength of our features and VT describes how our documents are related to our features.

What's a feature you might ask? When we initially started this example, we intuitively divided our collection of documents into two categories: pets and travel. These categories are examples of features for our collection of documents.

Here is the U matrix for our collection of documents:

As we can see, words like cats and dogs are strongly associated with our first feature: pets! Furthermore, words like travel and Italy are strongly associated with our second feature: travel!

It's unclear to me what the third and fourth features are trying to describe. By taking a quick look at our S matrix, which is a diagonal matrix that consists of singular values that describe the strength of each of our features:

Singular Values:

    1.225233566301
    1.077347205740
    0.863096120569
    0.770188803448

We can see that our first two features are the dominating features in our collection of documents.

Lets take a look at our VT matrix:

As we can see, the first two documents are highly associated with our first feature, and the third and fourth documents are highly associated with our second feature, as we would expect.

As we can see, the decomposition allows us to pick up synonyms, but the real guts of this method comes from projecting our matrix into semantic space.