Reddit Community Clustering

Abstract - The purpose of this project is to analyze the relationships between communities on the website Reddit based on the unique words used within each subreddit. Latent Semantic Analysis will be used to analyze these relationships and to identify similar communities.

Introduction

If you aren’t familiar with Reddit, it is a bulletin-board style website where you can share interesting links, pictures, etc. with others. To organize what you share on the website, Reddit asks you to choose a specific subreddit to post your information to. A subreddit is just a forum on the website that is dedicated to a specific topic. For example, if you wanted to share pictures from your backpacking trip along the Appalachian Trail, the most obvious subreddit to choose would be r/hiking, but there’s an even more specific subreddit, r/AppalachianTrail, that would probably be a better choice. There is a subreddit for just about anything that you can think of. Whether you’re interested in science, politics, economics, etc., there is probably a subreddit for it.

A direct result of having these different subreddit communities is that they have their own unique vocabularies. For example, posts within the hiking subreddit are more likely to contain words like “outdoors” and “nature” than a subreddit like math where the posts would be more likely to contain words like “algebra” and “calculus.” While this may seem obvious, recognizing that each subreddit has its own unique vocabulary helps tease out the idea that there is an underlying structure to the website based off each subreddits vocabulary.

With this idea in mind, we can start to ask questions like:

Which subreddits have the most similar vocabularies? In other words, which subreddits are the most semantically similar/dissimilar?
Given a random sentence, can I make a prediction as to which subreddit it came from?

For this project, I will attempt to answer these questions with Latent Semantic Analysis (LSA). If you are unfamiliar with LSA, I highly suggest reading my blog post on it. I spent a good amount of time trying to flesh out the core ideas behind how it works.

The Python code for this project can be found here.

Gathering and Processing Post Titles From Reddit

Before I start gathering data for this project, I want to take a second to outline exactly how I am going about gathering and processing post titles. To gather post titles from Reddit, I decided to use the Python package PRAW (Python Reddit API Wrapper) as it provides easy to access to Reddit’s API and ensures that I don’t exceed the API’s request rate limits.

With PRAW, the first step towards gathering the information we need is to get your client ID and secret keys for OAuth2 authentication. This just ensures that Reddit knows who is making requests to pull data from their website and when. Check out this OAuth2 quick start guide for more information on how to set up your developer account.

Once you have created your Reddit developer account and have your credentials for OAuth2, we can begin gathering post titles from any subreddit of our choosing with only a few of lines of code:

import config
import praw

num_posts = 5
    
#Use your credentials to log into your bot's account
bot = praw.Reddit(username = config.username,
                password = config.password,
                client_id = config.client_id,
                client_secret = config.client_secret,
                user_agent = "Subreddit_Analysis")
                
#Create a object that contains information on the 'hiking' subreddit
subreddit = bot.subreddit('hiking')
                
#Print the post titles from the first x number of posts inside of the 'hiking' subreddit (can be sorted by top, hot, new, ...)
for post in subreddit.hot(limit = num_posts):
    print(post.title, "\n")
            
OUTPUT:
    
Yesterday, at Chopok Mountain, Slovakia [OC]

2,917 meters above sea level with view to the „Gschnitzer and Pflerscher Tribulaun“, Schwarze Wand - Obernberg valley, Tirol, Austria
        
Atop Grassy Ridge Bald, Pisgah National Forest, N.C./Tenn. boarder, U.S.A.
        
Hiking to Island Lake, Colorado is well worth the long drive!
        
I love it when I get to cross a beautiful footbridge on the trail. Caples Creek, Ca.         

As you can see, it is rather easy to start gathering post titles from subreddits. Now we can start the process of transforming these post titles into something that is more suitable for latent semantic analysis. The first thing that we need to do is remove all the special characters, numbers and capitalization. This can be accomplished with regular expressions, specifically, the re package in Python.

Continuing off of the code above, we can quickly implement a method to clean up the titles:

import re

def clean(post):

    #Remove parentheses
    re_p = re.sub("([\(\[]).*?([\)\]])", "\g<1>\g<2>", post)

    #Remove links
    re_l = re.sub(r"http\S+|([\(\[]).*?([\)\]])", "", re_p)

    #Remove special characters
    re_s = re.sub(r"[^A-Za-z \—]+", " ", re_l)

    #Remove capitalization
    re_c = re.sub('[A-Z]+', lambda x: x.group(0).lower(), re_s)

    #Remove excess white space
    filtered_post = " ".join(re_c.split())

    return filtered_post
                  
#Print post titles from x number of posts inside of the 'hiking' subreddit (can be sorted by top, hot, new, ...)
for post in subreddit.hot(limit = num_posts):
    print(clean(post.title), "\n")                   
OUTPUT:
    
yesterday at chopok mountain slovakia

meters above sea level with view to the gschnitzer and pflerscher tribulaun schwarze wand obernberg valley tirol austria
        
hiking to island lake colorado is well worth the long drive
        
atop grassy ridge bald pisgah national forest n c tenn boarder u s a
        
i love it when i get to cross a beautiful footbridge on the trail caples creek ca        

Now that we have removed special characters, numbers and capitalization, the titles are much more suitable for LSA, but there is still a couple of issues that need to be dealt with. For example, take a look at this title:

atop grassy ridge bald pisgah national forest n c tenn boarder u s a

As you can see, our first issue is that we still have a few individual characters floating around that need to be removed. While the cleaning method did its job correctly, I failed to realize that abbreviations like “U.S.A” would leave behind whitespace between the individual letters. When it comes to feeding this data into other methods down the line, the program is going to interpret the individual characters as unique words, which is obviously something that we do not want, so we need to find a way to get rid of them. The second issue that we are encountering is that our titles are filled with what are considered to be nuisance words like “the” and “is.” Words like these that will appear frequently in the majority of the titles that we are going to gather do not contribute much to the overall meaning that the title is trying to convey, hence they will just end up being extra noise in our model that we do not want.

We can solve both of these problems by introducing a list of stop words to filter out these individual characters and nuisance words (if you are unfamiliar with stop words, I will again refer you to this blog post). There are many different lists of English stop words that you can find on the internet that will be better for some models and worse for others, it really just depends on what you’re trying to accomplish. I found that for this project, by combining both the nltk and sklearn stop word lists, I was able to remove the majority of these nuisance words and individual characters.

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction import text

#Set stop words to english (Uses both sklearn and nltk stopwords)
stop_words = set(list(text.ENGLISH_STOP_WORDS) +  list(stopwords.words("english")))

filtered_post = []
                
#Print post title from x number of posts inside of the 'hiking' subreddit
for post in subreddit.hot(limit = num_posts):

    #Clean the post title and tokenize
    post_tokens = word_tokenize(str(clean(post.title)))

    #Remove stopwords from tokenized comment
    for word in post_tokens:
        if word not in stop_words:
            filtered_post.append(word)

    print("Filtered post: %s \n" % str(' '.join(filtered_post)))
    filtered_post = []   
OUTPUT:
    
Filtered post: yesterday chopok mountain slovakia

Filtered post: meters sea level view gschnitzer pflerscher tribulaun schwarze wand obernberg valley tirol austria
        
Filtered post: hiking island lake colorado worth long drive
        
Filtered post: atop grassy ridge bald pisgah national forest boarder a

Filtered post: love cross beautiful footbridge trail caples creek

After removing our stop words, these titles are finally at a point where I feel comfortable feeding them in as raw data into our model. Let’s put this to the test and actually collect some data!

Data Collection

For this project, I decided to keep it simple and gather post titles from three different TV show fanbase subreddits: Lost, Westworld, and Doctor Who. One of the downsides to Reddit’s API is that I’m limited to gathering 1000 titles per request, which makes it hard to gather large amounts of data, but for the time being, 3000 titles should be enough data to mess around with.

OUTPUT:

Enter the names of the subreddits (separated by commas) that you would like to analyze: lost,westworld,doctorwho

#-------------------------------------------------------------------------------#
Gathering top 1000 post titles from lost...
#-------------------------------------------------------------------------------#

Post title: Ran into Jack today! :) (contains 34 comments)

Filtered post: ran jack today

Post title: Look who I randomly got to sit next to at the Book of Mormon in Chicago last night! (contains 28 comments)

Filtered post: look randomly got sit book mormon chicago night

...

Successfully read 995 posts out of 1000 --- 99.50%

#-------------------------------------------------------------------------------#
Gathering top 1000 post titles from westworld...
#-------------------------------------------------------------------------------#

Post title: Westworld Season 2 Trailer (Superbowl) (contains 91 comments)

Filtered post: westworld season trailer

Post title: Elon Musk: "What do you think we're living in?" (contains 72 comments)

Filtered post: elon musk think living

...

Successfully read 973 posts out of 1000 --- 97.30%

#-------------------------------------------------------------------------------#
Gathering top 1000 post titles from doctorwho...
#-------------------------------------------------------------------------------#

Post title: Like Streaming Doctor Who? Please take 2 minutes out of your day and stop ISPs from destroying net neutrality and charging you more for the streaming you love!! (contains 39 comments)

Filtered post: like streaming doctor minutes day stop isps destroying net neutrality charging streaming love

Post title: Matt Smith being a bro. (contains 105 comments)

Filtered post: matt smith bro

...

Successfully read 998 posts out of 1000 --- 99.80%

#-------------------------------------------------------------------------------#
Successfully read a total of 2966 post titles over 3 subreddits --- 98.87%
#-------------------------------------------------------------------------------#

Bundling up posts...

Successfully bundled: 2966 posts
Successfully tagged: 2966 posts

After attempting to gather 1000 titles from each subreddit, I was able to successfully filter 2966 out of the 3000 (98.87%) that were read. Some of the titles were not read because they were either deleted by the user, or after removing stop words, all words were removed, hence they were discarded.

After gathering all the titles, I put all of them into a list (which I refer to as bundling in my code) and then created a separate list which includes labels for each title that indicates which subreddit it came from (I refer to this as tagging in my code).

Data Processing

Now that we have my raw text dataset, we can start our analysis. The first thing that we need to do is convert this dataset into a frequency matrix. With sklearn, we can build a quick method to produce the frequency matrix for our dataset:

def get_frequency_matrix(raw_text):

    #Convert our raw text into a frequency matrix
    count_vectorizer = CountVectorizer(max_df=0.25, max_features=500)
    frequency_matrix = count_vectorizer.fit_transform(raw_text)

    #Get the feature names from our 
    feature_names = count_vectorizer.get_feature_names()
    frequency_df = pd.DataFrame(frequency_matrix.toarray(), columns=feature_names)

    return frequency_df, feature_names   

In my code, I set max_df=0.25 to tell the vectorizer to remove words that appear in more than 25% of the documents. This helps further reduce the noise in the dataset by removing frequently occurring words that my previous processing methods may have missed. I also set max_features=500 to ensure that our dataset remains relatively small so that my computer doesn’t get bogged down when it comes to running singular value decomposition later on down the line.

After running the data set through this method to produce the frequency matrix, I did a quick search through the matrix to find the most frequently occurring terms inside of each subreddit.

OUTPUT:
    
    Learning vocabulary...
    
    Building frequency matrix...
    Frequency matrix shape: 2966 by 500
    
    Top 10 most frequently occuring terms in each subreddit (frequency, word):
    
                 lost         westworld      doctorwho
    0     (264, lost)  (229, westworld)  (223, doctor)
    1      (53, time)     (109, season)   (72, tardis)
    2     (51, today)     (78, episode)    (52, today)
    3  (50, watching)        (44, like)      (42, got)
    4       (36, day)     (42, dolores)       (40, th)
    5   (34, episode)        (40, ford)      (40, new)
    6       (33, new)     (32, bernard)  (38, cosplay)
    7       (33, got)       (31, black)   (33, friend)
    8  (31, favorite)        (31, post)     (32, matt)
    9    (30, island)       (26, scene)      (32, met)

As you can see, if you’re a fan of any of the shows, the many most frequently occurring words in each subreddit are the central characters.

def get_tfidf_matrix(feature_names, frequency_df):

    tfidf_transformer = TfidfTransformer(use_idf=True, smooth_idf=True)
    tfidf_matrix = tfidf_transformer.fit_transform(frequency_df)

    tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=feature_names)

    return tfidf_df

From the frequency matrix, we can see what the most commonly occuring words for each subreddit are and from the tfidf matrix, we can see what the most "important" words for each subreddit are:

OUTPUT:
    
Learning vocabulary...

Building tf-idf matrix...
Tfidf matrix shape: 2966 by 500

Top 10 highest tfidf scoring terms in each subreddit (cumulative score, word):

                             lost                      westworld                      doctorwho
0       (105.7410446885777, lost)  (95.9171964840228, westworld)    (98.12051961736533, doctor)
1         (24.237046146598, time)    (47.00682149014582, season)    (38.53775687096871, tardis)
2      (21.97703968987548, today)  (30.926452207508344, episode)     (21.93737587780408, today)
3  (21.859701992474747, watching)  (25.338218374152113, dolores)   (21.20877099295608, cosplay)
4     (18.285704938708523, locke)      (24.43600766655133, ford)        (19.04766140330174, th)
5       (17.668726319490837, day)     (19.164721422293642, like)      (17.148465608424903, got)
6    (17.488538288408737, island)   (18.21061190658453, bernard)      (16.945853180771472, new)
7       (15.863108143237097, new)     (15.566489581966671, post)    (15.07661005869906, friend)
8  (15.754216253937335, favorite)    (15.028881080571793, maeve)    (14.901575705604301, dalek)
9       (15.583894955786707, got)      (14.511512977912473, hbo)      (14.837508745561601, met)

1. some titles were removed because of the max_df statemnt in first method 2. I reduced the number of words that are ncluded to 500 since the majority are not important.

Now that I have successfully filtered the submission titles, its time to create the frequency, tfidf, and reduced matrix.

OUTPUT:

Reducing dimension of data...
Successfully reduced data to 2954 posts explained by 300 features
Reduced matrix shape: 2954 by 300

#-------------------------------------------------------------------------------#

Explained variance of first 5 components: 0.01547978 0.02130377 0.02018137 0.01032878 0.01088992

Singular values: 7.83334653 7.59046825 7.41905755 5.66803483 5.4793154

Percent variance explained by all components: 0.852

Now that I have successfully filtered the submission titles, its time to create the frequency, tfidf, and reduced matrix.