By James Fisher, Metadata Assistant for the Georgian Papers Programme at King’s College London.
Over the past few months I have been compiling lists of subject headings for indexing the Georgian Papers. This is not nearly as straightforward as it might sound. It requires a detailed knowledge of the papers themselves, a broad awareness of the trends in eighteenth-century scholarship, and a sense of how to mediate between them. What topics are present, and what will be searched? Can we divide up the subjects in the papers ‘at the joints’, in Plato’s phrase?
Below I will briefly summarise the process of developing our subject thesaurus, and then offer a couple of historical reflections.
This follows on from work already discussed by Chris Olver in a post from January 2017, which explained the aims of metadata enrichment and explored the use of natural language processing using digital tools.
What is a subject thesaurus?
A subject thesaurus is a controlled vocabulary for the purpose of indexing. It is a structured list of terms to help both indexers and searchers use consistent terms and assist users to find documents. Terms are cross-referenced to each other to form a kind of classification scheme or map of the subject. It makes the relationship between terms explicit and avoids overlap or confusion between similar concepts.
United Kingdom Archival Thesaurus (UKAT)
It was decided that GPP would use and develop an existing thesaurus; the UK Archival Thesaurus (UKAT), which is itself originally based on the UNESCO Thesaurus. The advantage of UKAT is that it allows further terms to be incorporated from other projects, such as GPP. Rather than build our own vocabulary from scratch, we can focus on refining the existing vocabulary by adding terms specifically related to the eighteenth-century and the collection in the Royal Archives.
My first step was to gather a list of potential terms, focused on subject headings (names of people and organisations will be dealt with separately). I used three main sources.
- Transcripts from Georgian Papers
I used Voyant Tools to find the word frequencies in transcripts from the published correspondence of George III by John Fortescue obtained using OCR and the essays of George III obtained using handwritten text recognition. Together these included over a million words. After removing unsuitable terms (e.g. ‘Stop words’ such as prepositions, proper nouns, and so on), we were left with a set of possible keywords reflecting the contents of these texts.
- Eighteenth-century scholarship
To collect terms used by historians studying the eighteenth century, I exported data from COPAC and ZETOC on titles used in books (approx. 5,000), journal articles (approx. 5,000) and theses (approx. 1,000) from 1997-2017 that included the term ‘eighteenth century’ in the title itself. The list required cleaning as it contained a number of duplicates (e.g. where book titles are included in review articles in journals), but again Voyant Tools enabled an analysis based on word frequency.
This provided a list of the most common subject terms that scholars have directly associated with the eighteenth century in the last couple of decades. This list was supplemented by terms extracted from manual searching of the indexes of eighteenth-century history books, especially on the military and monarchy, including the work of Flora Fraser.
- TOBIAS (the Thesaurus of British and Irish History as SKOS)
I also consulted TOBIAS, another existing historical vocabulary, for terms relevant to the eighteenth-century. TOBIAS is the Royal Historical Society’s detailed vocabulary of British and Irish history, in the SKOS (Simple Knowledge Organization System) format.
Comparing with UKAT
All these terms were initially tested to see if they matched with terms already in UKAT; either an exact match, or an indirect match to a term present in a slightly different form. Matching terms were discarded, leaving a list of potential new terms.
All together this generated around 1,000 new possible terms.
Correcting term forms
These terms had to be checked and adjusted to the correct form to be included into UKAT. A host of factors were considered in selecting the most appropriate term form, including:
- Terms should be nouns or noun phrases (“Virtue”, not “Virtuous”).
- Concrete nouns should be in plural form (“Ships”, not “Ship”).
- Compound terms should be broken down into simpler elements, or factorised, where possible (“British Colonies” becomes “Colonies”, “British”). However, as Voyant tools breaks down texts into single words for ranking by frequency, it was also necessary to identify frequent compound terms to ensure the correct meaning was preserved (e.g. “Seal” may be a frequent term, but only due to its use in the compound “Privy Seal”, which must be preserved).
I also decided, with input from the GPP Metadata Analyst, Samantha Callaghan, and the Georgian Papers Programme Manager, Patricia Methven, the preferred term from groups of synonyms. The other terms are then designated as ‘alternative terms’, which will be entered into the database but direct the user to select the preferred term (e.g. “Anger” is the preferred term for “Rage”).
We also decided if a qualifier was needed to distinguish the meaning between two homographs (words with the same form but different meanings). For example, “Seals (animals)” and “Seals (law)”.
As an aid to selecting the most useful term form, we consulted other authoritative thesauri, such as the Library of Congress Subject Headings (LCSH).
As the potential terms were continually adjusted, repeated checks were made for matches to existing UKAT terms.
Creating thesaural relationships
The next step was to identity the prospective position of candidate terms in the hierarchical structure of UKAT and their relationships to other terms in the thesaurus. Each preferred (entry) term must have a Broader Term (BT), to specify its location in the hierarchy, and may also have Narrower Terms (NT) and Related Terms (RT). Alternative terms (AT) may only have relationships to preferred terms. These relationships help the user navigate the vocabulary and find the appropriate term.
These relationships are crucial as the terms do not have definitions, so their meaning and use is determined or implied by their position in the hierarchy.
Including terms from specialist vocabularies
In addition to the above, I have also been collecting terms from specialist vocabularies, especially medical and military terms. The Medical Heritage Library has shared a list of historical medical terms, including anatomical names and disease terms, some of which were central to eighteenth-century medical discourse but are now obsolete. I also drew terms from lists of battle names for the American War of Independence, and the French Revolutionary and Napoleonic Wars, based on those in LCSH.
We are now ready to add over 1,500 new terms to UKAT.
Finally, let me share a couple of reflections on this process.
I am trained as an historian, not an archivist, and the process of developing a subject thesaurus presented some interesting challenges.
Firstly, I found that there was basic tension at the heart of building a thesaurus for archival materials. To put it simply, from a historical perspective we desire a flexible vocabulary, as the meaning of words change over time, and, especially in areas of intellectual and cultural history, we both, historians and archivists, desire a vocabulary that reflects the nuanced differences between related words. From a thesaural perspective, however, we primarily desire a stable vocabulary with fixed relationships between terms, which means we prioritise clarity and consistency rather than reflecting the messy complexities of history. Archivists would point out that the need to ensure the widest possible access requires a certain rigour that individual historians do not have to deal with.
These are both valid approaches, and neither one is better or worse, but they pull in different directions. We have a few ways to mitigate this problem or reach a compromise between these divergent requirements. As described above, we can add qualifiers to capture the particular meaning we want or differentiate between similar concepts. We can also add a scope note, which can provide a short description explaining the historical usage of a term, perhaps even specifying a date range. This generally gives the thesaurus sufficient flexibility to avoid obvious errors or mischaracterisations. Yet there are limits to how well a subject thesaurus such as UKAT can fully incorporate the historical sensitivity to change over time.
Secondly, there is the difficulty in determining which terms will be ‘useful’ for indexing. This is central to developing a subject index but involves tricky judgements that can be endlessly debated.
One crucial consideration is the level of specificity, or granularity. For example, when selecting body parts, is it sufficient to list “Eyes”, or do we need to list the component elements: “Eyelids”, “Eyeballs”, “Eyelashes”, etc. The answer obviously depends on the documents in question, along with the research subject and method. As there is no clear rule to follow, these can occasionally become knotty issues.
A related consideration is the future interests of researchers. We have tried to capture the current interests of historians, but as trends shift in popular research subjects and methodologies, what in the archives will be considered historically noteworthy? UKAT has the benefit that terms can be added in future, so the initial list of subject terms will not be a closed vocabulary. Nonetheless, I found that with every term I was about to discard as “not useful”, I was able to imagine future contexts in which it could be useful.
These issues highlight the scholarly significance of building a subject thesaurus – of carving up the archives in one way or another. Indexes shape the research process in ways that are visible and invisible. The choice of terms that we use to identify subjects is a choice of the lens through which to view the material, imposing a loose conceptual map upon the documents. The choice may have many small consequences we cannot predict.