DALME – Workflow

From busy archive to database record, the process employed by DALME is codified in a four-step workflow. These steps include the acquisition and cataloguing of inventories, their transcription and annotation, the parsing of their contents and the generation of lexicons for the corresponding language or dialect, and the decoding of these descriptions and their re-encoding into a common ontology, so as to enable them to be compared.

Acquisition and Cataloguing

The DALME DAM showing image collections.

DALME records begin their life in one of two forms: original, unedited archival documents, and published transcriptions or editions of original documents. Processing each variety requires mostly unique steps, although the first step is the same for both: producing digital facsimiles of them. Unedited archival documents are photographed, while published transcriptions are scanned. The resulting images and metadata are loaded into our Digital Asset Management (DAM) system, which serves as a repository from which they can then be linked to the rest DALME infrastructure. The images are then reviewed and pre-processed, eliminating duplicates, cropping and enhancing them if necessary, and, in the case of published sources, performing Optical Character Recognition (OCR) to generate computer readable text. The images are then catalogued and collated into records, for which additional information is now added to the database such as details about the record, dates, locales, persons, foliation, etc.

Transcription and Annotation

The DALME visual editor used to annotate transcriptions.

In this step original documents are transcribed and annotated, following standards typical of the process of producing digital editions. In the case of the published transcriptions, the raw output of the OCR is cleaned and formatted during this stage with the use of purpose-built scripts and regular expressions, and then annotated as well. Annotation consists of encoding aspects of the document using markup. DALME uses a subset of the P5 Guidelines produced by the Text Encoding Initiative (TEI).

Each record moves through a carefully monitored series of steps beginning with ingestion and proceeding to transcription, mark-up, review, and parsing. Records are typically reviewed by a second party and all records are reviewed by team members or collaborators experienced with the DALME standards.

Parsing and Lexicalization

During this step, the text describing objects is tokenized and parsed. A database record is created for each object described in the text and attributes are initially suggested by an algorithm based on whether the terms are understood or not (that is, whether they are in the lexicon associated with the language of the record). In the lexicon, basic terms are automatically incorporated from language-appropriate dictionaries, and more complex words are forwarded to members of the DALME team for further research.

Tokenization and parsing.

At this point experts with subject knowledge must review the object and attribute records and decide whether the automatic suggestions are correct. Besides correcting the records, this process has the added advantage of training the algorithm, helping to improve the general accuracy of the process overall. Concurrently the experts in our Lexicon Team research the added entries and more obscure literature and develop the entries for those terms.

Semantic Decoding and Re–encoding

During this step, objects and attributes are dissociated from their original, explicit form, and re-associated with a generic concept that can assume specific forms depending on context. This is the key aspect of the DALME methodology.

To illustrate this point, let us consider the simple example of an art historian interested in exploring the use of a specific color, known in English as blood red, by using a database of material culture. It goes without saying that no search for blood red would yield results in a collection such as DALME, given the fact that the corresponding color was written in Latin (or vernacular equivalents) such as sanguineus or blodeus. This search would be similarly unproductive in any database of tangible things, such as a collection of museum objects or archaeological artefacts. A museum curator, for example, is likely to have recorded the color as 200C using the Pantone system. The archaeologist, more likely to rely on Munsell chips, would have entered the color as 5R 3/16. In each case, a specific system of classes and rules is being used to produce the linguistic surrogate characteristic of each domain. Crucially, these surrogates are not directly comparable.

The DALME ontology seeks to enable comparison by using and extending existing semantic schemas covering the range of concepts necessary to describe material culture. In the case of colors, we use the Getty Institute’s Art and Architecture Thesaurus (ATT). In our system then, the concept of blood red is assigned a unique identification number (300310722), while specific instances of it (eg. sanguineus, blodeus, Pantone 200C, Munsell 5R 3/16) are associated with it as terms. Thus, asked to search for blood red, our system would first translate the query to its corresponding ID, 300310722, perform a search for it, and return all associated terms. What gives our system added power is that, within the ATT, the concept 300310722 is linked to a complex hierarchy of related concepts. Concept 300310722 (blood red) is, for example, contained by 300311118 (red colors), which is in turn a category of 300131648 (chromatic colours, i.e. not white, black, or grey), and so on. A search for any of those concepts will yield any and all results matching the terms above.