topic modeling - freeCodeCamp.org

Topic Modeling Tutorial – How to Use SVD and NMF in Python

Bala Priya C — Tue, 21 Feb 2023 18:32:38 +0000

In the context of Natural Language Processing (NLP), topic modeling is an unsupervised learning problem whose goal is to find abstract topics in a collection of documents.

Topic Modeling answers the question: "Given a text corpus of many documents, can we find the abstract topics that the text is talking about?"

In this tutorial, you’ll:

Learn about two powerful matrix factorization techniques - Singular Value Decomposition (SVD) and Non-negative Matrix Factorization (NMF)
Use them to find topics in a collection of documents

By the end of this tutorial, you'll be able to build your own topic models to find topics in any piece of text.📚📑

Let's get started.

What is Topic Modeling?
TF-IDF Score Equation
Topic Modeling Using Singular Value Decomposition (SVD)
What is Truncated SVD or k-SVD?
Topic Modeling Using Non-Negative Matrix Factorization (NMF)
7 Steps to Use SVD for Topic Modeling
How to Visualize Topics as Word Clouds
How to Use NMF for Topic Modeling
SVD vs NMF – An Overview of the Differences

What is Topic Modeling?

Let's start by understanding what topic modeling is.

Suppose you're given a large text corpus containing several documents. You'd like to know the key topics that reside in the given collection of documents without reading through each document.

Topic Modeling helps you distill the information in the large text corpus into a certain number of topics. Topics are groups of words that are similar in context and are indicative of the information in the collection of documents.

The general structure of the Document-Term Matrix for a text corpus containing M documents, and N terms in all, is shown below:

Structure of the Document-Term Matrix

Let's parse the matrix representation:

D1, D2, ..., DM are the M documents.
T1, T2, ..., TN are the N terms

To populate the Document-Term Matrix, let’s use the widely-used metric—the TF-IDF Score.

TF-IDF Score Equation

The TF-IDF score is given by the following equation:

where,

TF_ij is the number of times the term Tj occurs in the document Di.
dfj is the number of documents containing the term Tj

A term that occurs frequently in a particular document, and rarely across the entire corpus has a higher IDF score.

I hope you’ve now gained a cursory understanding of the DTM and the TF-IDF score. Let’s now go over the matrix factorization techniques.

Topic Modeling Using Singular Value Decomposition (SVD)

The use of Singular Value Decomposition (SVD) for topic modeling is explained in the figure below:

Singular Value Decomposition on the the Document-Term Matrix D gives the following three matrices:

The left singular vector matrix U. This matrix is obtained by the eigen decomposition of the Gram matrix D.D_T—also called the document similarity matrix. The i,j-th entry of the document similarity matrix signifies how similar document i is to document j.
The matrix of singular values S, which (values) signify the relative importance of topics.
The right singular vector matrix V_T, which is also called the term topic matrix. The topics in the text reside along the rows of this matrix.

If you'd like to refresh the concept of eigen decomposition, here's an excellent tutorial by Grant Sanderson from 3Blue1Brown. It explains eigenvectors and eigenvalues visually.

Embedded content

It's totally fine if you find the working of SVD a bit difficult to understand. 🙂 For now, you may think of SVD as a black box that operates on your Document-Term Matrix (DTM) and yields 3 matrices, U, S, and V_T. And the topics reside along the rows of the matrix V_T.

Note: SVD is also called Latent Semantic Indexing (LSI).

What is Truncated SVD or k-SVD?

Suppose you have a text corpus of 150 documents. Would you prefer skimming through 150 different topics that describe the corpus, or would you be happy reading through 10 topics that can convey the content of the corpus?

Well, it's often helpful to fix a small number of topics that best convey the content of the text. And this is what motivates k-SVD.

As matrix multiplication requires a lot of computation, it's preferred to choose the k largest singular values, and the topics corresponding to them. The working of k-SVD is illustrated below:

Topic Modeling Using Non-Negative Matrix Factorization (NMF)

Non-negative Matrix Factorization (NMF) works as shown below:

Non-negative Matrix Factorization acts on the Document-Term Matrix and yields the following:

The matrix W which is called the document-topic matrix. This matrix shows the distribution of the topics across the documents in the corpus.
The matrix H which is also called the term-topic matrix. This matrix captures the significance of terms across the topics.

NMF is easier to interpret as all the elements of the matrices W and H are now non-negative. So a higher score corresponds to greater relevance.

But how do we get matrices W and H?

NMF is a non-exact matrix factorization technique. This means that you cannot multiply W and H to get back the original document-term matrix V.

The matrices W and H are initialized randomly. And the algorithm is run iteratively until we find a W and H that minimize the cost function.

The cost function is the Frobenius norm of the matrix V - W.H, as shown below:

The Frobenius norm of a matrix A with m rows and n columns is given by the following equation:

7 Steps to Use SVD for Topic Modeling

1️⃣ To use SVD to get topics, let's first get a text corpus. The following code cell contains a piece of text on computer programming.

text=["Computer programming is the process of designing and building an executable computer program to accomplish a specific computing result or to perform a specific task.",

      "Programming involves tasks such as: analysis, generating algorithms, profiling algorithms' accuracy and resource consumption, and the implementation of algorithms in a chosen programming language (commonly referred to as coding).",

      "The source program is written in one or more languages that are intelligible to programmers, rather than machine code, which is directly executed by the central processing unit.",

      "The purpose of programming is to find a sequence of instructions that will automate the performance of a task (which can be as complex as an operating system) on a computer, often for solving a given problem.",

      "Proficient programming thus often requires expertise in several different subjects, including knowledge of the application domain, specialized algorithms, and formal logic.",

      "Tasks accompanying and related to programming include: testing, debugging, source code maintenance, implementation of build systems, and management of derived artifacts, such as the machine code of computer programs.",

      "These might be considered part of the programming process, but often the term software development is used for this larger process with the term programming, implementation, or coding reserved for the actual writing of code.",

      "Software engineering combines engineering techniques with software development practices.",

    "Reverse engineering is a related process used by designers, analysts and programmers to understand and re-create/re-implement"]

The text for which you need to find topics is now ready.

2️⃣ The next step is to import the TfidfVectorizer class from scikit-learn's feature extraction module for text data:

from sklearn.feature_extraction.text import TfidfVectorizer

You'll use the TfidfVectorizer class to get the DTM populated with the TF-IDF scores for the text corpus.

3️⃣ To use Truncated SVD (k-SVD) discussed earlier, you need to import the TruncatedSVD class from scikit-learn's decomposition module:

from sklearn.decomposition import TruncatedSVD

▶ Now that you've imported all the necessary modules, it's time to start your quest for topics in the text.

4️⃣ In this step, you'll instantiate a Tfidfvectorizer object. Let's call it vectorizer.

vectorizer = TfidfVectorizer(stop_words='english',smooth_idf=True) 
# under the hood - lowercasing,removing special chars,removing stop words
input_matrix = vectorizer.fit_transform(text).todense()

So far, you've:

☑ collected the text,
☑ imported the necessary modules, and
☑ obtained the input DTM.

Now you'll proceed with using SVD to obtain topics.

5️⃣ You'll now use the TruncatedSVD class that you imported in step 3️⃣.

svd_modeling= TruncatedSVD(n_components=4, algorithm='randomized', n_iter=100, random_state=122)
svd_modeling.fit(input_matrix)
components=svd_modeling.components_
vocab = vectorizer.get_feature_names()

6️⃣ Let’s write a function that gets the topics for us.

topic_word_list = []
def get_topics(components): 
  for i, comp in enumerate(components):
    terms_comp = zip(vocab,comp)
  sorted_terms = sorted(terms_comp, key= lambda x:x[1], reverse=True)[:7]
     topic=" "
     for t in sorted_terms:
      topic= topic + ' ' + t[0]
     topic_word_list.append(topic)
     print(topic_word_list)
  return topic_word_list
get_topics(components)

7️⃣ And it's time to view the topics, and see if they make sense. When you call the get_topics() function with the components obtained from SVD as the argument, you'll get a list of topics, and the top words in each of those topics.

Topic 1: 
  code programming process software term computer engineering

Topic 2: 
  engineering software development combines practices techniques used

Topic 3: 
  code machine source central directly executed intelligible

Topic 4: 
  computer specific task automate complex given instructions

And you have your topics in just 7 steps. Do the topics look good?

How to Visualize Topics as Word Clouds

In the previous section, you printed out the topics, and made sense of the topics using the top words in each topic.

Another popular visualization method for topics is the word cloud. In a word cloud, the terms in a particular topic are displayed in terms of their relative significance. The most important word has the largest font size, and so on.

!pip install wordcloud
from wordcloud import WordCloud
import matplotlib.pyplot as plt
for i in range(4):
  wc = WordCloud(width=1000, height=600, margin=3,  prefer_horizontal=0.7,scale=1,background_color='black', relative_scaling=0).generate(topic_word_list[i])
  plt.imshow(wc)
  plt.title(f"Topic{i+1}")
  plt.axis("off")
  plt.show()

The word clouds for topics 1 through 4 are shown in the image grid below:

Topic Clouds from SVD

As you can see, the font-size of words indicate their relative importance in a topic. These word clouds are also called topic clouds.

How to Use NMF for Topic Modeling

In this section, you'll run through the same steps as in SVD. You need to first import the NMF class from scikit-learn's decomposition module.

from sklearn.decomposition import NMF
NMF_model = NMF(n_components=4, random_state=1)
W = NMF_model.fit_transform(input_matrix)
H = NMF_model.components_

And then you may call the get_topics() function on the matrix H to get the topics.

Topic 1: 
  code machine source central directly executed intelligible

Topic 2: 
  engineering software process development used term combines

Topic 3: 
  algorithms programming application different domain expertise formal

Topic 4: 
  computer specific task programming automate complex given

Topic Clouds from NMF

For the given piece of text, you can see that both SVD and NMF give similar topic clouds.

SVD vs NMF – An Overview of the Differences

Now, let's put together the differences between these two matrix factorization techniques for topic modeling.

SVD is an exact matrix factorization technique – you can reconstruct the input DTM from the resultant matrices.
If you choose to use k-SVD, it's the best possible k-rank approximation to the input DTM.
Though NMF is a non-exact approximation to the input DTM, it's known to capture more diverse topics than SVD.

Wrapping Up

I hope you enjoyed this tutorial. As a next step, you may spin up your own Colab notebook using the code cells from this tutorial. You only have to plug in the piece of text that you'd like to find topics for, and you'd have your topics and word clouds ready!

Thank you for reading, and happy coding!

References and Further Reading on Topic Modeling

A Code-First Approach to Natural Language Processing by fast.ai
Computational Linear Algebra by fast.ai

Cover Image: Photo by Brett Jordan on Unsplash

A Data Scientist’s Guide to Happiness: Findings From the Happy Experiences of 10,000+ Humans

freeCodeCamp — Mon, 23 Apr 2018 22:53:44 +0000

By Jordan Rohrlich

Modern life throws a lot at us. We often find ourself struggling to manage anxiety, wrangle responsibilities, adapt to new conditions, and maintain a happy state of mind.

But happiness is a noisy space these days. Self help books, articles, blogs, and meditation apps can’t help everyone, and often increase the mental burden needed to stay content. That’s a serious problem. So, as mental health becomes increasingly vulnerable and solutions become increasingly complex, it’s important to anchor oneself to the fundamentals. That is, we need to refocus our daily lives on the everyday things that make people happy.

Data

This research dives into a handy dataset that can help shed some light on the fundamentals of happiness. HappyDB is a set of 100,000+ happy experiences gathered through Amazon Mechanical Turk from March to June of 2017. It contains the experiences and demographics from tens of thousands of contributors around the world. Interestingly, some basic text analysis methods can help us learn a lot from this data.

By understanding the emotional intensity and keyword patterns drawn from these happy experiences, HappyDB teaches us two valuable lessons.

You can check out the code for yourself on GitHub.

1. Happiness is not conditional on demographics.

This one is counterintuitive.

Most of us experience a “grass is always greener” effect with respect to happiness. Young people anticipate a happy career and family later on in life. Older folks reminisce about a time when they were young and adventurous. Bachelors yearn for companionship. Couples hope for children.

And, despite knowing this, we all think someone else is happier, or some other stage of our life will bring us more joy. Let’s take a look at the data.

Sentiment analysis weighs the emotional intensity of text. Using an R package called “Syuzhet,” I measured the sentiment of these happy experiences to determine how their intensities vary. This created a spectrum of happy experiences that could be broken down by specific demographic groups:

Sentiment of Happy Experiences (by gender, family status)

Sentiment of Happy Experiences (by age group)

Somewhat surprisingly, there’s little change in the spread of happy experiences across these gender, family, and age demographic groups. Here are the highlights:

Overall, the experiences are definitely positive. But the bottom quartile does have negative sentiment (some happy things poetically arise from discomfort and tragedy)
The distributions have high-end tails and fairly limited lower bounds — some experiences are extremely positive, and few are strikingly negative
Self-identifying females have slightly higher sentiment scores than men for most of their experiences (a 0.05–0.1 point difference)
Married parents have slightly higher sentiment scores than bachelors and childless couples for most of their experiences (a 0.05–0.1 point difference)
The quartiles of happy experiences (25th, 50th, and 75th percentiles) across age groups are virtually identical

In sum, there is no significant difference in the range of happy experiences reported by different demographics. Although women and parents tend to have marginally more happy experiences to record, the differences on the sentiment scale can’t be taken seriously — they correspond to a fraction of a fraction of a single happy word per experience recorded. That’s a minuscule difference.

This dataset, however, does not include any data fields for race, socioeconomic status, or other identity positions that may materially influence daily experiences. Future happiness research should inspect these relationships closely.

2. Happiness is determined by specific types of experiences.

It’s easy to think of happiness as a mysterious, ethereal substance that penetrates our experiences in uninterpretable ways. This view espouses a metaphysical understanding of happiness as something beyond human comprehension.

But that’s not very helpful, especially for people who rely on happy and meaningful experiences as the lifeline of their mental health.

Enter Topic Modeling. This method of text analysis (explained here; I use R’s “Mallet” package) provides a constructive approach to explaining what HappyDB’s 10,000+ participants find to be happy experiences.

By segmenting the dataset into documents of each respondent’s experiences, then running an LDA topic model to identify groups of commonly occurring keywords, we can begin to isolate distinct types of experiences that bring us happiness. The topics and related keywords can be seen below, in no particular order:

Topic Model Output from 100,922 Happy Experiences

Time with Family

Seems like a no-brainer. Words like “daughter,” “son,” “husband,” “baby,” “wife,” and “time” seem to show that lots of people reflect very positively on experiences that involve their loved ones. These experiences often involve the most commonplace of settings and derive happiness simply from company and affection.

Try spending more time with loved ones: call your mom, go to your kid’s soccer game. It may pay off more than you think.

Getting Paid

Although people don’t like thinking that money relates to happiness, their experiences sure say the opposite. Getting a paycheck, clearing a credit card balance, or giving money to a friend can make people really happy. And the sense of accomplishment and economic security that comes with would definitely explain why.

Food

People love eating. Cooking a favorite meal, eating out with friends, or gorging on a pint of iced cream in front of the TV can all make someone happy. Good food with friends should definitely play a part in any happy lifestyle.

Sleep Time

Surprisingly, people document lots of happy experiences around sleep: cuddling up in bed, going to sleep with a furry friend, waking up to a promising new day, and so on. There’s lots to be happy about, if one takes a moment to reflect at night after a productive day, or in the motivated morning before something exciting.

Games and Competition

Humans are competitive. They love playing video games, watching sports, and doing other things that stoke their biological instinct to dominate. Play a board game with some friends or get excited about your home sports team. Chances are you’ll be happy you did.

Achievement and Education

After weeks of work, it feels great to finish big enterprises. Finishing a class, graduating from school, or launching a project can all seriously lift a person’s mood. But finishing big undertakings requires a few to start, so go out and start something new! Learning and doing are rewarded handsomely.

Celebrating and Birthdays

Obviously, celebrations make people happy (think birthdays, anniversaries, and friendsgivings). People enjoy finding a reason — however important or silly — to meet up with loved ones, get happy about an occasion, and do something to break up a dull weekly routine.

Mental Balance and Introspection

The act of tuning into one’s mental state seems to provide a lot of happiness in and of itself. Thinking introspectively about one’s wellbeing, head space, and happiness seem to have positive effects on those very things! Try meditating, reflecting on happy experiences, or just being aware of your mental state — it may be the very thing to help boost it.

Spending

Satisfying our material desires, of course, brings lots of people happiness. Finding good deals, finally buying that car or home, and getting something nice for oneself or a loved one all create some sort of happiness. Enjoy responsibly.

Weekend Trips

People like being off work, but enjoy it dramatically more if enjoyed in good company, while doing something different. Go on a trip somewhere, have an outing nearby, or find another novel excuse for spending time with others in new scenery. The data says you certainly won’t regret it.

Reading and Music

Whether bundling up at home with a new book or discovering a song on the bus ride home, lots of people get happy through the simple act of reading or listening. Taking an hour before bed to read something new or skim through Discover Weekly is probably worth the time investment.

Decisions

Decisions also clock in as a big happiness generating activity. It’s exciting to spend time thinking about a big change, decide to do something new, and tell people about it. It leaves a lingering mood boost for lots of folks, too. So make a change you’ve been meaning to for a while; and commit to it!

Wrapping up

These twelve categories of experience represent the foundations of daily happiness for tens of thousands of people. Given that humans are more alike than we often give them credit for, the same can likely be said about you.

This method, like any, is imperfect. Some demographics contribute more heavily than others, which may throw curious words into some topics, or may bias the topics that are represented in the model. Textual data is messy and people also don’t think about happiness in crisply defined categories of experience.

But, using these two lessons as a basic structure for understanding positivity in our everyday lives, I think it can help remind us that happiness is never so far off as we may think.

We already know many of these happy topics to be true on some level. But we rarely recognize the power that they have on our mood, and so don’t structure them into our everyday lives as readily as we should.

These categories are empirically-certified mood boosters. They’re happiness slam dunks.

So we should take what we can get. Throw out the self-help handbooks and focus on real happy experiences. You may like what you find.

If you found this article helpful, share it with a friend or give some claps ?.

See the code for yourself on GitHub!