Jon Holmen, Christian-Emil Ore, Øyvind Eide. Unit for Digital Documentation, University of Oslo.

Museum archives: Historical documents - more than information resources

Abstract

This paper describes some methodological aspects of creating electronic versions of 200 years of paper-based documentation in the archaeological university museums in Norway. We focus on a few selected methods and positive and negative experiences during this work as well as some aspects of the relationship between these documents as important parts of the non-material cultural heritage in the museums and the material heritage described in them.

Background

Due to historical reasons Norway is divided into five archaeological districts, each represented by a museum responsible for their district. Four out of these five museums are connected to a university in Norway. In 1991 the Faculties of Art at the four Norwegian universities initiated a large collaborative digitization project (1992-1997), called The National Documentation Project, with the aim to establish databases and administrate retro-conversion and digitization of analog archives, books, images and other media types at the collection departments at the faculties including the archaeological museums. The Documentation Project has been continued by the so called Museum Project (1998-2005) focusing on the university museums (cultural and natural history).

The Museum Project is motivated by an ambition to develop ICT-based systems that will offer users centralized and efficient access to information about the Norwegian cultural and natural heritage. With the help of common user interfaces and links between data from different fields of study, it will be possible to generate new information combinations and new insights in the various disciplines. The Museum Project is also in charge of the development of common database systems for the management of collections at all the Norwegian university museums. Ideally, these database systems should be able to handle all reference information related to artifact and specimen collections within and outside of the museums, for natural history as well as cultural history.

As Norway has never had a national museum, a lot of the collections that we normally find in national museums in other countries are located at the university museums in Norway. With respect to archaeology, the museums involved in the Museum Project are in charge of four of the five cultural heritage districts in Norway. Thus, by developing a common system of archive information from these four museums, and by including data imported from the fifth district, the database system becomes a central register of archaeology in Norway.

The Unit for Digital Documentation (DOK) is a permanent unit at the Faculty of Arts at the University of Oslo as a continuation the core group of the Documentation Project. DOK is in charge of the continued development of the databases in the fields of lexicography, text corpus, place names studies, medieval studies and art history as well as the ICT-part of the Museum Project.

Through international cooperation, such as in the Arena project (http://ads.ahds.ac.uk/arena/), we are also working towards cross-national systems for the exchange of information and interconnection between national collections. The long-term goal is a global integrated information system for archaeology. On another dimension, the cross-thematic scope of our projects also opens up for cross-thematic integration of these national collections.

DOK has a unique background in the development and application of ICT-based methods in a wide range of text based disciplines in the humanities. The text background has lead us to apply and further develop methods taken from field of scholarly text edition even when working with museum catalogues and archives. During the ten years of work, we have developed a well tested system for creating digital collections based on analog (mainly paper-based) items. Our efforts to match our practical work to these guidelines is the main topic of this paper.

A methodological note: Scholarly text editions

In traditional text philology the aim is to create a "best text" from a set of manuscripts (text witnesses). A scholarly text edition will usually be a printed version of the "best text" accompanied by a critical apparatus documenting variation in the text witnesses and perhaps a set of facsimiles of (some parts of) the manuscripts. In the last 10-20 years computers have been introduced into text philology. Today a modern electronic scholarly text addition may not necessarily contain a "best text". A "text edition" can be a bibliographical database, electronic facsimiles and electronic transcripts of the text witnesses combined with a few extra search tools and explanatory texts. The transcripts are usually given a mark up in XML/SGML that enables hyper-linking and different views of the texts.

To the extent that text scholars are interested in texts in a museum, this does usually not extend beyond texts that are first class museum objects. They rarely consider texts describing the normal activity of the museum such as archives, excavation reports and so on to be of interest. In our work we have taken the opposite approach. The internal archives, reports, acquisition and other catalogues, are the main sources for the development of scientific/scholarly databases for students, scholars and curators in museums. As a result of our work during the last ten years, we have refined a four-step work procedure for the museum texts developed from and inspired by our close contact with other text based disciplines in the humanities. The first three steps correspond to the steps normally applied to an ordinary text edition. The final step, where we create a mapping from the semantic content of the text to an ontology, is not so commonly used by text editors:

The first step is to register the metadata of the documents metadata such as Title, Author, Year of publication, Edition number, etc. In other words; we create a bibliographic record of the original document.
The next step or recording process is to make an image of every single page of the publication. The result of the process is an electronic facsimile that should be kept unchanged and be regarded as an important source in itself.
We then create a transcript by applying OCR to the facsimiles or, in the case of handwriting, by manually transcribing the text. The transcript is then given a mark up indicating pages, special layout, type faces, illustration, inserts etc. Given that there is no handwriting in the original text, we should in principle be able to make a printed copy which is more or less identical to the original document based on the mark up. We also perform some kind of lemmatization, that is adding normalized terms where these vary in the text, such as different spellings of place names. The result is an XML text.
Finally, we carry out a step that extends the basic descriptive model, as we introduce semantic coding of the content. This means that we look at the information elements in the text and try to code them in relation to a given conception or ontology. In our case the ontology is the set of entities and relations that we thought would best describe the archaeological finds and objects. The result is a fully XML-marked up text that not only can be browsed as a free text, but also used as an XML-based document database. Further the mark up can be used to extract information into a more traditional relational database.

The acquisition catalogues

<topark_1.jpg>

Fig. 1: Process description for digitization of the Museum Catalogues.

For almost 170 years all archaeological museums in Norway have published their acquired artifacts annually in specially prepared acquisition catalogues. The descriptions of finds in these catalogues are quite verbose, including extensive information on the finds, the find contexts, place and time, the finder or excavator, as well as detailed descriptions and classifications. Thus, these sets of catalogues serve as the main inventory descriptions for each museum.

When the Documentation Project started in 1992, it was decided that the project would create an artifact database based on the printed catalogues and a few unprinted older addenda. To what extent did the processing of these documents conform with the guidelines outlined above? We did create a bibliographical record for each catalogue, as described in step one. Unfortunately, facsimiles of the pages have not been kept due to the storage cost in the early 1990-ies. The catalogues were then OCR read and proof-read against the original, and the place names and object terms were lemmatized.

Later, the electronic versions were marked with tags for their semantic contents, in accordance with our ontology. All in all we kept closely to the ideal methodology.

<CATYEAR>
<INTRO><MNAME>UNIVERSITETETS OLDSAKSAMLING TILVEKST 
<YEAR>1989</INTRO>
[...]
<NRPAR><CATNR nrid="37267">C.37267.
<ARTEFDATA><ARTEFACT>Axe</ARTEFACT> of <MAT>iron</MAT>
from <AGE>Late Medieval time</AGE>.
<ARTEFDATA><MEAS>L: 141mm</MEAS>, <MEAS>edge W:109mm</MEAS>. 
Carpenter's axe with <FORM>specially shaped</FORM> blade to 
accommodate the fingers when it is held just "behind" the edge. 
Particularly necessary for fine work, when used at an angle 
to the edge, or as a gouge.
<SHARED>Found <FINDLOC>on the hill, about 
300m above <LOC>ÅROS KAPELL</LOC>, <FARM>SJØGLØTT</FARM>, 
<PARISHD><PARISH>ÅROS</PARISH></PARISHD>, <MUNICIPALITY>RØYKEN</MUNICIPALITY>, 
<COUNTY>BUSKERUD</COUNTY></FINDLOC>, in 1959 by <FINDER>Berge Narvik</FINDER>,
Tjernsrudveien 24, Jar, Oslo. </SHARED> </NRPAR>

Fig. 2: Fully encoded catalogue paragraph, example.

The resulting encoded catalogues were indexed by the contents of the tags and made searchable as independent texts. In addition, the contents of the tags were imported into a relational database, the find database, as the original version of the object classification. A total of approximately 20.000 catalogue pages have been processed in this way, giving us a total of 150 000 database records.

Topographical archives

<topark_3.jpg>

Fig. 3: Process description for digitization of the Topographical Archives.

All the archaeological museums are in possession of a so-called topographical archive, spanning approximately the same period as the find catalogues. A topographical archive normally contains all letters, reports, and drawings concerning sites and monuments, archaeological finds and excavations as well as photographic documentation and drawings of both the objects and the ancient monuments in the district of the museum. The archives are sorted according to a geographical-administrative hierarchy (county, municipality, farm), thus the name. The contents of the topographical archives describe in full the external activity of a museum from the day it was established and represents a very important post-excavation source of archaeological information. At two of the museums, we have digitized these topographical archives

Again, let us take a look at how these were processed compared to the ideal process. First, we created an archival or bibliographical record of each document with a set of general archival data, such as provenance of the original paper version, type of document (photo, map, letter, report, etc), sender (author), address, date and number of pages, as in step one.

We then scanned the documents and saved them as facsimiles. Since these archives are not published books, but dozens of meters of heterogeneous documents, we considered the task of converting them into electronic text as too extensive at this point.

<topark_4.jpg>

Fig. 4: Identified information bits from an Archive Document, example.

Thus, we did not carry out any transcription but extracted the content data by reading and entering the information directly into a database. This, of course, gives us a weaker reference to the documents.

Each document was grouped within a single case of historical activity and organized in a chronological sequence within the case. We also made references to all relevant farms, finds and ancient monuments occurring in the documents. In the database we created searchable indexes for both the archival data and the contents data as well as the facsimile. A total of approximately 200.000 document pages were processed in this way.

Ontology shifts

Ontology was and is a philosophical term denoting the study of being. The Artificial intelligence community tends to let it denote a specification of conceptualization or simply a model of some part of the reality. In computer science the term 'ontology' is often synonymous to 'data model'.

The archive material accumulated through the last 180 years illustrates ontological shifts in the pure philosophical meaning due to the scholars' own view on the world and the implied view on the past. The scholars' view on the writings of their colleagues in the past is changing over time and a corresponding change in what is considered to be important in an old report.

<topark_5.jpg>

Fig. 5: Intersection in the areas of interest of author and digitizing persons.

Step four of the methodology, the encoding of the semantic content will necessarily reflect our view on what constitutes important archaeological information. This will not necessarily coincide with the conception of the 19th century author of the original material. The original author included information according to his conception, from which we code the subset that match our conception or ontology.

Since the old material was created during more than 180 years, it varies significantly with respect to what type of information is emphasized as well as to style.

The methodology used in converting this material gives us additional information on this aspect. Registers of persons who played an important role in the history of the museums have been generated. We also have the sequences of activities that this history is made of, giving us an opportunity to realize to what extent each individual's "style" influences the content. By connecting all this we will be able to describe the history of the institutions.

With respect to new material and new activities at the museums we are creating a much more unified standard. But because this kind of standard must express a form of archaeological school and probably will be changed when necessary in the future, we are developing a system in which standards represented as form definitions/DTDs and taxas are preserved and saved.

New material included at a later stage will thus refer to a specific standard that is present in the database, together with information such as author and date. One implication of this is that when upgrading the standard, there is no need for upgrading older material that has been entered according to an old standard. One will only need to perform a mapping between the old and the new standard.

This will ensure preservation of data in its original form, and assure that no information is lost during upgrades. It will also give ourselves and future users of the systems an overview of the development of archaeology as a science and reveal within which school the various reports are written.

Conclusion

Through the process of converting the original paper-based catalogues and archives at the archaeological museums into digital form, and as far as possible observing our defined principles for this kind of conversion, we have created reliable relational databases of archaeological archives and artifact information. We have also developed an opportunity of working directly with the source material by making an electronic text archive with powerful indexes for both the catalogues and the documents. This makes it possible to study them as individual historical objects.

Through the coding of the source material into a comprehensive ontology we have also, in addition to "normal" database functionality, laid the ground for enabling us to describe the history of the museums and the history of Norwegian archaeology. We have also established a system in which the principles the archaeologists of today use in their documentation work may continually be recorded, thus creating what will be the history in the future.