The Problem
Soviet secret police archives have been the holy grail of scholars studying the Soviet Union. The Committee of State Security (KGB) and its predecessors were notorious for widespread surveillance and repression in society. Although central police archives in Moscow remain largely classified, many post-Soviet countries have made records available to research. The increasing accessibility of tools for creating digital copies for the archives has increased their accessibility. Tens of thousands of formerly secret files are available in electronic form, some of them in open access on the internet. Nonetheless, catalogs for these materials are inadequate or absent. Soviet police created a massive archive of information about Soviet society – but can researchers find anything in it?
The Method
KGB Lab hopes to leverage artificial intelligence tools to make police archives more accessible. “Artificial intelligence” describes a wide variety of tools, but the main one we use are neural network models, computational representations of human data. These include:
- Optical Character Recognition: The transformation of image files into machine readable text.
- Language Embedding: The transformation of text into a numerical representation of its meaning that allows for comparisons between texts and “semantic searching,” the ability to identify sources by meaning rather than keywords.
- Feature Classification: The use of classifiers to identify important characteristics of police files, such as the profession of an arrestee and the geographical locations that appear in a document.
- Summarization: The use of large language models to summarize an entire case file to create catalog entries.
The History of the Group
The group began in Moscow at Higher School of Economics as a Research-Study Group for the Creation of Detailed Analytical Databases with the Use of Information Technology Methods (SDADIM). (The title, by the way, refers to the necessity to publish articles as part of the group and roughly means “we will deliver [articles]”.) The group created databases that extracted social and policing data from investigation files from Ukraine. Some of the data sets that group produced in 2018 and 2019 appear on this site.
In 2024 the College of Liberal Arts and Sciences at UF funded KGB Lab with a Spark Grant. The group included students of Russian language and worked on annotating documents from investigations for the creation of machine learning classifier models.
In 2025 the Digital Humanities Laboratory in the Center for Humanities and the Public Sphere sponsored the group and it currently includes a team of graduate students and students who are working on annotating interrogation documents and gathering policing data from secondary sources.