Topic modelling (TM) is a collective term given to a family of computational algorithms that aim at “discovering the main themes that pervade a large and otherwise unstructured collection of documents” (Blei, 2012, p. 77). In less than ten years, we have witnessed a proliferation of social scientific and humanities studies applying TM to textual data. In previous work, TM analysis has been argued to be able to identify themes across large samples (Murakami et al., 2017) and to have 'high levels of substantive interpretability' (DiMaggio, Nag and Blei, 2013). Consequently, the method's ability to read texts has been judged plausible in many cases (Mohr and Bogdanov, 2013). While there are also multiple explanations of what topic models are (e.g. https://tedunderwood.com/2012/04/07/topic-modeling-made-just-simple-enough/, which is really good and I encourage you to go read it before continuing), they often are either too technical, or too vague with regard to how the methods work, as well as how parameter choices affect the outcome. Due to this, most applications of the methods also feature no critical reflection on their research design – from corpus construction to result interpretation. This document aims to bridge that gap, by providing a rigorous but generally understandable account of topic models.
Current TM approaches belong to the class of probabilistic, generative models. What this means in practice is that the algorithm contains a model for generating documents by randomly, based on probability parameters, picking words from a set of topics. Importantly, the probability parameters of the model can be tuned to better match evidence (i.e., an existing collection of documents) by Bayesian probabilistic inference. The tuned parameters can then be read back from the model as a description of the topics in that collection (Blei, 2012).
More specifically, the generative model encoded in TM algorithms is as follows (see Figure 1): first, topics are modelled as bags of words, with a variable number of each individual word inside each bag (e.g., a particular topic may have many copies of words like “school”, “teacher” and “degree”, but very few copies of other words like “tree” or “kitten”, so we could assume the topic is about education). The documents in a collection are also modelled as such bags, holding all of the words in the document, without regard to the order in which they appear. The TM algorithm then tries to recreate these document bags of words through the following process: First, from the set of topics covering the whole document collection, select some number of topics to which a particular document pertains to, and their proportions in the document (e.g. this document is 67% about education and 33% about environment protection). Then, recreate the document word bag by sampling words at random from each topic bag in the thematic proportions selected previously (Blei, Ng and Jordan, 2003).
The two important sets of probability parameters are then 1) the proportions of each word in each topic bag and 2) the proportions of how much each document deals with each topic. At the beginning of running the TM algorithm, both are initialised as random. Then, with multiple rounds of Bayesian inference, both proportions are gradually changed, so that the bags of words the model produces would correspond as much as possible to the ones derived from the training data.
At the end of the process, these two sets of parameters, changed by the training, are read back as outputs. Both the topic word proportions (taking the form of e.g. topic #13 – recipe: 4%, tomato: 1.5%, oven: 1.3%) and the document topic proportions (e.g. file1.txt – topic #13: 47%, topic #4: 28%, topic #7: 1%) can be subjected to human analysis.
A worrisome trend in many tools and studies is that topic output is commonly interpreted in isolation from the documents, and based on only the top five to twenty words most associated with the topic in the collection. This practice can be argued flawed, because the topics are a description of the collection of documents, and do not exist in isolation from it. Thus, they also need to be interpreted not in isolation, but within the context of the documents in which those words originally appear. In addition, summarising what is in reality a distribution over all the words appearing in the collection with just the top N words can hide important and interesting information of the whole.
Besides misrepresenting actual content, this may hide problems in model parameterisation. For example, if the number of topics or the preprocessing parameters have been incorrectly set, a generated topic may end up as amalgam of what in reality are multiple distinct topics. Yet, particularly given the capability of the human mind to find connections everywhere, this may not be apparent from just the top five words in the amalgamated topic. In short, interpreting short word lists in isolation paves the way for misinterpretation of what the topics identified actually mean and signify in the documents.
Even after due diligence in parameter setting and individual topic exploration, how well the derived topics correspond to any phenomena of interest to a researcher depends on 1) how well the collection of documents can be thought of as having been created by the model described above (each document deals with some number of topics in some proportion, topics talked about determine vocabulary used), 2) how well the definition of a topic in the above model corresponds to the phenomena of interest, and 3) various assumptions and parameters of the exact topic model variant used that further define how the topics behave.
As examples of the effect of such assumptions and parameters, consider the original Latent Dirichlet Allocation (LDA) -based topic model. Besides the obviously important parameter of number of topics, two parameters, alpha (ɑ) and beta (β) are required for the algorithm to function. To understand these parameters, one must remember that Bayesian probabilistics are based on a framework where prior intuition on the phenomena can be included in the model, which then acts as “synthetic additional evidence”, constraining and guiding the inference. Here, this appears as being able to encode prior intuitions on possible 1) document topic proportions and 2) topic word proportions. In the original LDA implementation, these are defined using symmetric Dirichlet distributions whose shapes (and thus the exact intuitions) the alpha and beta parameters dictate. Taking the document topic Dirichlet prior and its parameter alpha as an example (visualised in Figure 2a), an alpha near zero states an intuition that most documents will be dominated by a single topic. An alpha much over one on the other hand states an intuition that most documents will talk about all the topics in equal proportions. An alpha of exactly one will state that any combination of topics in a document is equally likely (known as an uninformative prior).
Now, while first implementations of the LDA algorithm required the user to set these alpha and beta parameters manually, more recent variations include mechanisms to learn also them automatically from the data. However, symmetric Dirichlet distributions also have another problem: to them, all topics are equal. Therefore, a symmetric Dirichlet distribution over document topic proportions is not well suited to modelling a selection of topics where particular topics do not behave in the same way as the others (e.g. a collection on various branches of EU policy where EU terminology is consistently present, but other topics vary widely).
The above does not mean that the topic model doesn’t work at all for such materials, but that the resulting topics will be worse. As a consequence of this, traditional LDA has required a significant amount of pre-processing to remove general language words (such as “the”, “and”, etc., commonly referred to as ‘stopwords’), as otherwise these confuse the model. On the other hand, when the requirement for topics to behave identically is loosened in asymmetric LDA, which replaces the single alpha parameter with a dedicated prevalence parameter for each topic (Wallach et al., 2009, see also Figure 2b above), the model is able to segregate these general language words into a single, often appearing topic without this adversely affecting the quality of the other topics.
Another assumption of the traditional Dirichlet -based LDA models is that topics appear in documents independently of each other. As it is clear that this assumption does not hold in most collections of texts in practice, extensions of the LDA have been developed. The Correlated Topic Model (CTM) (Blei and Lafferty, 2006) replaces the topic proportion prior with one capable of capturing correlations between topics (e.g. that the topic “mantle isotopic crust plate earth” often appears with the topic “fault earthquake data earthquakes images”. See also Figure 3 above, where with the third set of parameters, the blue and red topics are strongly correlated so they always appear almost an equal amount, whatever that may be).
While CTM improves the ability of the model to mimic the original texts, as well as provides derived topic correlation numbers as possibly useful output for the researcher, it has been reported (Chang et al., 2009) that the topics generated are themselves less interpretable by humans. This may arise from the fact that by allowing the topics to correlate, they are no longer encouraged to be as independent and distinct as possible.
As a final useful extension to TM, the Structural Topic Model (STM) (Roberts et al., 2014) needs mentioning, as it adds text-external correlates to the CTM. With such correlates, it becomes possible to chart and compare for example how a certain topic is discussed in a collection at different times or by different groups. STM also provides heuristics for selecting an optimal number of topics to be extracted, countering a long standing problem where the selection of this crucial parameter was left purely to the researcher’s intuition and post-hoc qualitative analysis of the coherence of the topics generated (Nowlin, 2016; Quinn et al., 2010).
Blei, D. M. 2012. Probabilistic Topic Models. Communications of the ACM, 55, 77-84. doi:10.1145/2133806.2133826
Blei, D.M., Lafferty, J.D., 2006. Correlated Topic Models. Adv. Neural Inf. Process. Syst. 18 147–154. doi:10.1145/1143844.1143859
Blei, D.M., Ng, A., Jordan, M., 2003. Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022. doi: 10.1162/jmlr.2003.3.4-5.993
Chang, J., Gerrish, S., Wang, C., Blei, D.M., 2009. Reading Tea Leaves: How Humans Interpret Topic Models. Adv. Neural Inf. Process. Syst. 22 288--296. doi:10.1.1.100.1089
DiMaggio, P., Nag, M., Blei, D., 2013. Exploiting affinities between topic modeling and the sociological perspective on culture: Application to newspaper coverage of U.S. government arts funding. Poetics 41, 570–606. doi:10.1016/j.poetic.2013.08.004
Mohr, J.W., Bogdanov, P., 2013. Introduction-Topic models: What they are and why they matter. Poetics 41, 545–569. doi:10.1016/j.poetic.2013.10.001
Murakami, A., Thompson, P., Hunston, S., Vajn, D., 2017. “What is this corpus about?”: Using topic modelling to explore a specialised corpus. Corpora 12, 243–277. doi:10.3366/cor.2017.0118
Nowlin, M.C., 2016. Modeling Issue Definitions Using Quantitative Text Analysis. Policy Stud. J. 44, 309–331. doi:10.1111/psj.12110
Quinn, K.M., Monroe, B.L., Colaresi, M., Crespin, M.H., Radev, D.R., 2010. How to analyze political attention with minimal assumptions and costs. Am. J. Pol. Sci. 54, 209–228. doi:10.1111/j.1540-5907.2009.00427.x
Roberts, M.E., Stewart, B.M., Tingley, D., Lucas, C., Leder-Luis, J., Gadarian, S.K., Albertson, B., Rand, D.G., 2014. Structural Topic Models for Open-Ended Survey Responses. Am. J. Pol. Sci. 58, 1064–1082. doi:10.1111/ajps.12103
Wallach, H.M., Mimno, D., McCallum, A., 2009. Rethinking LDA: why priors matter. Proc. 22nd International Conference on Neural Information Processing Systems.