r/ApplyingToCollege: What Are Common Topics Discussed? (2024)

Wendy Wu

Data Collection and Cleansing

The network data I used came from the Reddit API. To start off, I had to create a Reddit application to acquire the client_id, client_secret, and a user_agent. Using the PRAW library and the acquired information, I gained access to the r/ApplyingToCollege subreddit and retrieved the “hot” posts. “Hot” posts are submissions that are currently popular and receiving a lot of upvotes and comments. Initially, I set the limit to 100 posts but decided to change it to 300 so that I have more data to work with. The keywords are represented as nodes in the network graph while the edges between the nodes represents the co-occurrence of the words.

I tokenized the submission posts and titles into individual words using the TweetTokenizer so that I can use it for further processing. This tokenization enabled me to count the most frequently occurring words in both titles and submission posts. To ensure that the extracted terms were meaningful, I created two variables: one to remove punctuations using the ‘string.punctuation’ set of characters and another to remove common English stopwords, obtained from the NLTK library. Stopwords such as “the,” “and,” and more are removed from the text data since they do not provide substantial information. This ensured that my data primarily consisted of significant terms. I created a dictionary of the top 20 keywords to visually see the results. Each word was a key and the count of its frequency was the value.

r/ApplyingToCollege: What Are Common Topics Discussed? (4)

From the dictionary printed, it seems like the top three most frequently occurring words are an apostrophe, “school,” and “college.” I initially assumed that the ‘string.punctuation’ module would effectively remove all punctuation marks. However, I learned that although an apostrophe is considered a punctuation mark, it is often used as part of a word to indicate possession. Therefore, I made adjustments to my code to include the apostrophe in the list of punctuations. This modification ensures that the data I extracted included only complete words. The revised dictionary is presented below.

r/ApplyingToCollege: What Are Common Topics Discussed? (5)

Having identified these common keywords, I was ready to construct a network graph. In this graph, the top 20 most frequent words served as nodes and edges were established between pairs of words based on their co-occurrence in the text. For an edge to be created, a word pair needed to co-occur at least three times, which served as a threshold to ensure the significance of their connection. I used degree centrality to identify the most interconnected node in the network, but it turned out that each node had a degree centrality of 1. This implies that the entire network is fully connected and that all keywords are linked to at least one other keyword but no keyword is particularly more central or connected than the rest. This poses a potential concern because it is uncommon to have a situation where all nodes have a degree centrality of 1. In most networks, it is expected to see a variety of degree centralities. In the visual representation of the network graph below, you can see that all the nodes are of the same size. I utilized the networkx and matplotlib.pyplot libaries to create and visualize the graph.

r/ApplyingToCollege: What Are Common Topics Discussed? (6)

To gain a better understanding, I decided to look deeper into the edges to analyze the results. I created a text file that stored the results of the edges.

r/ApplyingToCollege: What Are Common Topics Discussed? (7)

Findings and Analysis

When looking at the printed output, I observed that it still contained a lot of non-significant words, such as “one,” “get,” “want,” “i’m,” and so on. The words do not provide any information on what topics are discussed without any additional context. However, words such as “schools,” “university,” “GPA,” and “applying” did appear frequently with each keyword. These terms are significant enough to understand what a potential topic might be in the subreddit since it gives more information. Since the degree centrality was 1 for all keywords, I could not identify three important nodes based off the measurement alone. However, from the text file that contains the edges, the three words I’d consider significant are “university,” “GPA,” and “applying.” This indicates that when students apply to colleges, they might have the most questions regarding universities, their GPA and the application process. Therefore, it would be beneficial for parents, teachers, and schools to focus on providing more information and guidance on these topics to students. They can teach students how to research and select the university they want to attend and explain the application process in greater detail. Additionally, they can explore ways to help students maintain a good GPA.

Limitations

One limitation of this data is that keywords are selected from recent “hot” posts which can vary from day to day and may not represent the overall subreddit. Furthermore, since the text has been tokenized, I only have individual words without their context, making it challenging to precisely determine the topics associated with these words. These limitations and uncertainties can impact the results drawn from the analysis.

Conclusion

In summary, this analysis provided insights on the common keywords in the r/ApplyingToCollege subreddit. Notably, the degree centrality for all keywords were consistently 1, indicating a fully connected network. This raises concern about whether there might be an issue with the code or if the collected network data lacks the significance needed to derive meaningful insights. As mentioned previously, typically in networks, you’d expect variations in degree centralities since some nodes have more connections than others. However, based on the results and network graph generated, we see that all the nodes are the same size. After looking more into the edges, it was determined that the significant terms among the top 20 keywords are “universities,” “GPA”, and “applying.” These words offer insights into the subjects discussed in the subreddit and can guide parents and schools in assisting students with these aspects, better preparing them for the college application process.

You can find the code for this analysis here.

r/ApplyingToCollege: What Are Common Topics Discussed? (2024)

Data Collection and Cleansing

Findings and Analysis

Limitations

Conclusion

References