Crowd Sourcing and Georgian Papers Programme
Mark Hedges, Director of Centre for e-Research, King’s College London
The GPP is carrying out a programme of digitisation and metadata creation for a variety of documents in the Georgian Papers collections, and although this will improve greatly discovery of and access to these important materials, the information within the documents themselves will still be locked within the image files. One approach to breaking this information out, and creating enhanced digital resources that can be used by more advanced digital humanities methods and projects, is provided by crowdsourcing.
Crowdsourcing, or ‘public humanities’, in the context of academic humanities or cultural heritage and memory institutions, may be defined as the process of leveraging public participation in or contribution to projects and activities (Dunn & Hedges 2013), in particular for gathering, processing or interpreting information in some directed manner. Making use of the transformations that the Web has brought to processes of collaboration and communication, these initiatives have harnessed public participation with a view to enhancing, augmenting or opening up cultural material, blurring the boundaries between the spaces occupied by professional and non-professional communities and transforming the relationship between cultural organisations and the wider community.
The sort of enhancements to digitised documents that can be carried out by the public take many forms. At the most basic level, the document images can be transcribed into text – this is particularly useful in the case of handwritten documents, in particular the older and unfamiliar forms of handwriting encountered in historical documents such as the majority of the material being addressed by the Georgian Papers Programme. These cannot at the moment be transformed easily into text files using automated methods such as OCR (Optical Character Recognition) – although the GPP is carrying out some promising investigations using the Transkribus software (https://transkribus.eu/) – and human judgement and interpretation is still required if sense is to be made of them.
Two examples of crowdsourcing projects in recent years that have focused on transcription of manuscript documents are Transcribe Bentham and Old Weather. The aim of Transcribe Bentham (http://blogs.ucl.ac.uk/transcribe-bentham/) was to encourage volunteers to transcribe and, more generally, engage with unpublished manuscripts by the philosopher and social reformer Jeremy Bentham. The project “invites the public to play a part in academic research and attempts to break down traditional barriers” by supporting them in the process of transcribing the documents into text marked up using the TEI (Text Encoding Initiative) XML standard, a format that has been widely adopted in the digital humanities community as the standard encoding for marking up textual data with structural and semantic information and for publishing digital scholarly editions. The volunteers have thus been contributing to an online and searchable edition of the Collected Works of Jeremy Bentham.
Old Weather (https://www.oldweather.org/) was also based on the need to digitise content that was not amenable to purely automated methods, in this case logbooks of ships of the British Royal Navy. The initial aim was to transcribe the weather observations they contained, information of great significance for climate research, although as the project developed the participants transcribed a wide range of additional material from the logs, according to their own interests. In this case the text resources produced were different from those in Transcribe Bentham, being plain text without additional mark-up.
As well as prose texts – letters, journals, essays and so on – the collections contain many documents containing more structured information, for example ledgers documenting details of staff establishments, suppliers of foodstuffs and other produce together with the relevant transactions, as well as menus detailing meals served within the household. The unpicking of the details of these documents could prove fascinating for the public, much as did the unpicking of the naval logbooks described above. In terms of outcomes, the structured nature of the source materials means that these could move beyond simple transcription, and include structured data sets or databases which could then be analysed to carry out research into, for example, patterns or networks of supply or employment within the household.
While the digitisation of the Georgian Papers results in the material being more accessible, and searchable via metadata, they can still only be processed by the human reader. Transcription means not only that the content of the documents can be indexed and searched; the text files are now machine-processable, in the sense that other automated digital methods, such as various forms of natural language processing, can be used to extract structured information from the unstructured text, such as personal names, statements about individuals or events, and relationships. The point of this is to make dealing with text more computationally tractable – the meaning of the text becomes machine-processable, it is not longer just a sequence of words. The resulting texts can also be further enhanced through crowdsourced taggings or mark-up, both formal and informal.
A quite different example of crowdsourcing is provided by georeferencing historical maps, that is the process of locating historical geographical information in terms of a modern coordinate system (such as latitude and longitude). Maps from the collections could be greatly enriched by such information, as it transforms documents that can only be interpreted by a human reader (possibly with a great deal of individual effort) into resources that can be spatially searched, analysed and compared, using standard geographical technologies. Again, this is not a task that could be done by computer, and would be prohibitively labour-intensive for an internal team tasked with metadata creation. However, as the British Library’s Georeferencer project has shown (https://www.bl.uk/projects/georeferencing), the widespread pubic interest in maps and mapping meant that it is highly susceptible to successful crowdsourcing, and indeed public engagement in general.
The activities carried out in crowdsourcing are frequently made up of such independent ‘microtasks’ that are farmed out to individual participants, although they can also go beyond that to include participatory creation of more complex information objects, commenting on or discussing content, adding contextual information such as personal experiences or memories, or constructing alternative narratives and interpretations. It should be noted that, while the term ‘crowdsourcing’ is frequently used as a catch-all for such activities, the group of active contributors in any given project may in fact be relatively small.
Dunn, S., & Hedges, M. (2013). Crowd-sourcing as a component of humanities research infrastructures. International Journal of Humanities and Arts Computing, 7(1-2), 147–169.
Leave a Reply
You must be logged in to post a comment.