Skip to content

Detecting Latent Textual Bias with Topic Modeling and Sentiment Analysis (DH 2022)

Posted in Digital Humanities, News and Notes, and Percolating Ideas

Presented at the Alliance of Digital Humanities Organizations annual conference, 2022, Online.


Bias detection is an emerging area of research for digital humanists, computational linguists, and information studies scholars, alike, who point to biases inherent in our algorithms, software, tools, and platforms, but we are only just beginning to examine how computational methods could be used to interrogate our primary textual sources (Noble, 2018; Al-Sarraj and Lubbad, 2018; Chen et al., 2020). This project seeks to develop a method for bias detection that can be used at the outset of a study with little initial knowledge of the corpus, requires little pre-processing and is both beginner-friendly and language-agnostic. Word2Vec and similarity measures allow us to compare a test corpus against a comparison corpus of biased or neutral terms. This works especially well with contemporary texts, such as online news articles in English, but it becomes an increasingly difficult task with historical or non-English language sources to find appropriate comparative corpora (Patankar and Bose, 2017). Building classifiers to identify bias with feature extraction, support vector machine learning algorithms, decision trees, and naïve Bayes approaches work well but require a deep understanding of the corpus and are not accessible to those who are new to computation.(Al-Sarraj and Lubbad, 2018; Leavy, 2019; Manzini et al., 2019) Therefore, with the aforementioned aims in mind, I chose to use the latent Dirichlet allocation (LDA) algorithm for topic modeling to study a set of three chronicles covering the 300 years of Ottoman Algerian history, written in French by two nineteenth-century French scholars and one twentieth-century Algerian scholar (Vayssettes, 1867; Mercier, 1903; Gaïd, 1978).[1]

At just 138 documents and approximately 183,000 words, the corpus is much smaller than one normally uses for topic modeling, but its manageable size made it a promising test case for this approach. The scalar feature of the LDA algorithm was particularly interesting to examine, as the topics of each larger, more detailed model neatly nested under the model with the next most topics, creating a hierarchy.[2] Nesting models of 4-, 7-, 11-, and 20-topics provided a detailed summary of the corpus, with the 4-topic model serving as a general overview, and the 20-topic model offering a glimpse into the richness of the region’s history with topics related to some of the dominant themes of the corpus, such as “governance and succession,” but also much more granular themes, including “illness, death, burial, and remembrance,” and the “roles of women.” Pairing topic models at different scales (4-, 7-, 11-, and 20-topics) with sentiment analysis of the topic models at 11- and 20-topics, as well as targeted close reading, guided by topics of interest and using the concordance method to identify passages with key terms of interest uncovered the stories of lesser known actors, including women, Jews, Spaniards, and the councilmen of provincial governors, as well as biases inherent in the writing of their histories (Ghasiya and Okamura, 2021).

The anti-Arab and/or anti-Turkish sentiments one might expect to observe were absent, but a latent anti-Semitic sentiment appeared in the more granular topic models that, despite my careful reading of the texts, had escaped my notice. The resulting model aids the scholar in weaving the disparate threads of these individuals’ lives into the tapestry of the region’s history, and the method may well be applied to other corpora, topics, languages, and time periods to reveal hidden biases, especially in larger collections of documents that would be impossible for a single scholar to read. I am currently testing this bias detection method with additional corpora and this presentation will briefly report the results of these trials along with the original study.

[1]  For more details on the authors, see the introductory chapter and/or the README on my GitHub repository for this project, available at It should also be noted thatI have not included the one additional Algerian chronicle of these governors, written in Arabic, because the LDA algorithm separates topics by language first and then by significant collocations, which would not help me identify common themes that cut across three centuries, individual biographies, and sources/authors. For those interested in topic modeling a collection of Arabic documents, see Richard Nielsen’s introduction using R: Richard Nielsen, “Quantitative Text Analysis in Arabic,” (Workshop, Cairo University, April 4, 2019), accessed May 6, 2021,

[2] The resulting nested or hierarchical model visualizes major and minor themes in the authors’ perceptions and presentations of Ottoman gubernatorial histories. Jo Guldi and Benjamin Williams applied a similar approach to British Parliamentary discourse to reveal previously invisible connections between speeches and political tactics, but, as this chapter shows, the method is also useful for text summarization and bias detection. (Guldi and Williams, 2018)

Selected Bibliography

Al-Sarraj, W. F. and Lubbad, H. M. (2018). Bias Detection of Palestinian/Israeli Conflict in Western Media: A Sentiment Analysis Experimental Study. 2018 International Conference on Promising Electronic Technologies (ICPET). pp. 98–103 doi:10.1109/ICPET.2018.00024.

Chen, W.-F., Al-Khatib, K., Stein, B. and Wachsmuth, H. (2020). Detecting Media Bias in News Articles using Gaussian Bias Distributions. ArXiv:2010.10649 [Cs] (accessed 12 May 2021).

Ghasiya, P. and Okamura, K. (2021). Understanding the Middle East through the eyes of Japan’s Newspapers: A topic modelling and sentiment analysis approach. Digital Scholarship in the Humanities(fqab019) doi:10.1093/llc/fqab019. (accessed 19 June 2021).

Guldi, J. and Williams, B. (2018). Synthesis and Large-Scale Textual Corpora: A Nested Topic Model of Britain’s Debates over Landed Property in the Nineteenth Century. Current Research in Digital History, 1 doi: (accessed 11 January 2021).

Leavy, S. (2019). Uncovering gender bias in newspaper coverage of Irish politicians using machine learning. Digital Scholarship in the Humanities, 34(1): 48–63 doi:10.1093/llc/fqy005.

Manzini, T., Lim, Y. C., Tsvetkov, Y. and Black, A. W. (2019). Black is to Criminal as Caucasian is to Police: Detecting and Removing Multiclass Bias in Word Embeddings. ArXiv:1904.04047 [Cs, Stat] (accessed 12 May 2021).

Nielsen, R. (2019). Quantitative Text Analysis in Arabic Workshop Cairo University (accessed 19 May 2021).

Noble, S. U. (2018). Algorithms of Oppression: How Search Engines Reinforce Racism. New York: New York University Press (accessed 19 May 2021).

Patankar, A. A. and Bose, J. (2017). Bias Discovery in News Articles Using Word Vectors. 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA). pp. 785–88 doi:10.1109/ICMLA.2017.00-62.

Be First to Comment

Leave a Reply

Your email address will not be published. Required fields are marked *