Revised version of paper presented at the Association for History and Computing conference, London, 1999. Published in: Marilyn Deegan and Harold Short (eds) DRH99: A Selection of Papers from Digital Resources in the Humanities 1999 (London: Office for Humanities Communication, 2000), pp. 83-102.

Retrospective Data Conversion in a Commercial Publishing Environment

The Calendar of State Papers, Colonial

Brad Scott

This article presents a case study of the editorial and production processes undertaken within a commercial publishing company to prepare an SGML data set in respect of a retrospective conversion of an existing print resource. Such data was to be published on CD-ROM targetted at research libraries using an extension of commercially available software, while allowing for possible enhancements for future re-use. This article highlights some lessons learned in the process; it can also be used to illustrate the parallels between data development in an academic or library setting and within a publishing environment. It may also provide some concrete examples for ongoing debate about the theory of text markup.

* * * * * *

Originally published in print form in forty-five volumes from the mid-nineteenth century to 1994, the colonial calendar series of British archival materials focuses on two broad subject areas: North America and the West Indies from 1574 to 1739; and China, Japan, and Persia, 1513-1634. The calendars are extensively used by historians, who value the range of materials they include and their extensive indexing. However, there are a number of limitations to the print versions, which suggested to us and the Public Record Office (PRO) that the creation of an electronic edition was desirable.

In principle, an electronic edition could offer full text searching of the calendar text in a structured way so as to make retrieval of the information possible in novel ways, and could include linked images of the original documents. However, given a finite budget, one has to devise the most appropriate means of producing an intellectually coherent resource, which can have the facility to be expanded and developed in the future. This article considers the design and functionality issues for this rather specific type of source material, and illustrates the roles of the archive and publisher in this activity. It particularly covers many of the data issues that were dealt with during the project, and as such illustrates the practicalities of using encoding schemes such as that of the Text Encoding Initiative (TEI) for retrospective conversion projects within a commercial publishing environment.

1 Background to the Project

In 1994 Routledge was busy examining the potential of electronic media for our markets, and naturally enough, history-related materials figured fairly largely in our discussions. During the following year we started talking about a number of avenues with the Public Record Office in London, initially focusing on what has come to be known through the work of the Model Editions Partnership as ‘image editions’.[1] The first such title in this category, Asia: Official British Documents, 1945-1965, appeared early in 1999.[2]

Through 1996, our main focus electronically was on the Arden Shakespeare, though research into other projects continued in parallel, and the idea of doing something electronically with the calendars began to emerge as a serious prospect. As I have described elsewhere,[3] the work on Arden suggested to us and the software company with which we were working, Database Publishing Systems Ltd, that for future projects it would be prudent to develop a generic extension to DynaText, the SGML browser software we were using, that would facilitate the design and construction of new products, whether they were encyclopedias or collections of historical texts. Work on this began in 1997, and a number of titles have been produced using this route.

In parallel with this ‘Generic Browser’ work, we undertook extensive consultation with the PRO and with historians and librarians about which of the calendar series would be most appropriate for initial electronic development, and to assess likely functionality and design features that we would need to consider. Considering the international character of the market for electronic data, material relating to North America was appealing from the beginning, and the large quantity of data in the Colonial series (45 volumes, which is equivalent to about 65Mb of plain, untagged text) meant that it was relatively straightforward to devise an acceptable financial model for the project. That is, with a given fixed development cost for software and core data conversion, a smaller data set would have to be priced unrealistically high per quantity of data to make it viable. The colonial material was of course also attractive due to the broad time period and range of materials it covered. By the end of 1998, this process and the data analysis was complete and work began in earnest early in 1999. Even so, the project was complicated by the liquidation of the software company in the summer of 1999, though we were lucky to have considerable personal continuity with many of the people involved when the project transferred to STEP UK Ltd later in the year.

2 The Colonial Calendar

The Calendar of State Papers, Colonial series is one of many such collections initiated in the nineteenth century. They were designed to contain summaries and abstracts of documents which at the time it was thought would render it unnecessary for most purposes to consult the originals, and the materials were collected together in the printed edition in chronological order to facilitate their use. Calendaring of documents was already underway at the State Paper Office when the Public Record Office was created in 1838, and when the two offices were merged in 1854 the latter continued and extended the publishing programme, producing the first of the colonial series in 1860. Most of the volumes were published by 1939, with the remaining five appearing up until 1994. With rising production costs, many calendar series came to an end, and, with the death of the last editor, that of the colonial documents was also terminated. Historically then, the series extends to 1739, on the eve of the War of the Austrian Succession, known in North America as King George’s War; thereafter, the volume of relevant colonial North American material at the PRO mushrooms, and it is conceivable that the series could never have been extended much further in its traditional print form. With the advent of digital technologies, an alternative now exists, and it is possible to imagine some form of sustainable model whereby the documents could be made available, enabling access to the materials much as was originally conceived for the printed calendars.

The Calendar includes documents in over 40,000 transcripts and extended abstracts. This material covers a wide range of subject areas, including: correspondence to and from colonial governors; orders and grants from central government to local administration; information from the localities and information about the slave trade and piracy. The texts include details relating to agriculture; boundary disputes; administration; reports of conferences with Native Americans; plantations; immigration; land grants; legislation; industries such as ship building and fisheries; relations with French, Spanish, Dutch, including intercepted letters; trade; privateering; war; and reports of court cases.

These are justly valuable sources, but there are a number of problems with the print calendars. Some material was missed out in the original series and has never been calendared, and, though the indexes are good for personal and place names, they are weak for searches on more general topics, such as on specific trades, social historical themes, etc. There are also numerous addenda, which means that much data is not situated correctly in the strictly chronological structure of the Calendar, and though some of the very early volumes included material on China, Japan, and Persia, this was not continued for the bulk of the series. For about half the volumes, the document references are in an old form, which is no longer in use at the PRO, where almost all of the original documents reside, which means that it can be difficult to locate the originals. In addition, the physical size and number of the volumes is often an impediment to use, the introductions within the volumes are considerably out of date, and a small quantity of documents were inadvertently omitted when the Calendar was originally created.

Given all these considerations, Routledge and the PRO together took the view that an electronic corpus of the series should remedy as many of these shortcomings as possible. Consequently, the main tasks we identified as necessary for rendering the Calendar in a sensible form were:

  1. To omit the volumes relating to Asia, East Indies and Persia, which makes the content more uniform;
  2. To include the indexes and merge them together to make a single reference source;
  3. To omit the existing introductions, but add a new shorter introduction, including a short broad historical overview, and administrative history;
  4. To update the document references to their modern form; and
  5. To add the extra material that was never included in the Calendar.

Furthermore, we also took the decision not to include images of the original documents. With our experience on other projects, and from the feedback received from historians, we felt that adding images would unnecessarily complicate the project, and that it would be a useful resource even without them. Containing abstracts and transcripts of some 44,000 documents, the colonial calendar series draws on well over 100,000 pages of original manuscript material. This alone rendered it at least twice the size of Asia: Official British Documents, which we were working on at the time, and the effort and expense involved in identifying, capturing and relating the image files to each document were not felt to be justifiable for the present. Furthermore, envisaging the title as available on CD-ROM, this would have required over 7 Gb of disk space, which would have been a deterrent to users. However, as we were also conceiving of this resource in the longer term as well, we could also imagine a time when the resource could be available on the web, perhaps linked to the PRO’s own finding aids, at which point document images could also be added, if cost and demand justified it.

In the rest of this article I will describe the technical and organisational consequences of the most significant of the design decisions in the context of the required software functionality, data capture and processing.

3 General requirements and overview of the data

Extract showing the original printed version

Figure 1: The print edition [4]

Figure 1 shows a typical entry in the Calendar, in this case relating to Massachusetts. It has a main item number which is unique to the volume in which it was published, and there is one sub-item, which is an attached document. The first sentence of the main item is a short description of the content; in this case it simply records that it is an order in council, but may also include details such as the names of the sender and recipient of a letter, ‘Minutes of the Council of Georgia’ etc. There are a number of editorial additions, generally given in italic, and the source reference for the entire item is given in square brackets at the end. In addition, each item and many sub-items have a date (usually Old Style, and the original editors of the individual volumes adopted the convention of beginning the year on 1st January) or place of origin noted in the margin. The text itself is, in this instance, a summary of the document; however, many of the documents include fairly extensive transcripts of the originals.

If we compared this with one of the original sources for this, in this case from one of the bound volumes of Colonial Papers from 1685 [5], we would see that the Calendar entry is an accurate but succinct summary of the original three page text; it does not however include some of the fine detail of the manuscript, such as the remark in the petition that Robert Orchard’s father “left both his life & Estate (in the late unhappy Civill War) in the service of his then Maty”. Many items in the Calendar comprise much fuller transcripts, though rarely include information on authorial amendments; neither do they include marginal notes or salutations. In short, as is usual with calendar series it reflects a view of an edition of historical materials that differs from other documentary editions. Furthermore, though Figure 1 is a typical item, because the series was prepared by a number of editors over a 130 year period there are some notable differences in style and in typographic representation of the material across the printed volumes, which have some implications for any attempt to create a completely generic data capture specification for all the volumes. In particular, volume 1 is especially irregular in its structure.

Given such a body of data which is in many ways largely structurally consistent, there are on the face of it a number of options that could be open to someone thinking about preparing an electronic version of the corpus. In planning this, foremost in our minds has been a view of what most users would want to do with any resource. Full text searching limited by date range is the main obvious requirement, and users who know the PRO’s collection may also want to search by document reference number. Users may also want to limit their searches to the content of the ‘title’ or ‘heading’, that is the first sentence of the item, and as many existing publications refer to the Calendar series by volume and item number or page, we needed to support the identification of material so referred.

Figure 2 shows how the data appeared in the finished product, which is largely unchanged from our original goals. The following sections detail some of the data capture and conversion issues involved in the path from our initial goals to the final implementation.

How it looked in the DynaText application

Figure 2: Document display in final application

4 Markup

We used TEI as our base for the data markup, and adopted a small number of the extensions to it made by the Model Editions Partnership (MEP)[6]. There were also a number of instances where our data was processed differently, either to support the searching or other aspects of the software functionality, or to accommodate components such as the PRO source references. Figure 3 shows the data for the example above in its final form.

<surrogate id="V12-E000204" itemno="190" volno="12" pagestart="43"
pageend="43" volrange="1685-1688" type="vol" startdate="16850513"
enddate="16850513">
<dateline>
<date>May 13, 1685</date>
<place>Whitehall</place>
</dateline>
<head>Order of the King in Council</head>
<docbody>
<p>Referring the petition of Robert Orchard to Lords of Trade and
Plantations, to consider an instruction thereon to be given to Colonel
Kirke, who is going Governor to New England. <hi
rend="ITALIC">Signed</hi>, Phi. Lloyd. <num
type="fraction" value="1/2">&frac12;</num> <hi
rend="ITALIC">p. Annexed</hi>,</p>
</docbody>
<enclosure id="V12-E000205" itemno="190 i" volno="12"
pagestart="43" pageend="43" volrange="1685-1688" type="vol">
<head>Petition of Robert Orchard</head>
<docbody>
<p>Recounting the story of his grievance against the Government
of Massachusetts, and the report of the Lords of 3 November 1683
(<hi rend="ITALIC">see above</hi> No. <xref xrefid="V11-
E001572">1352</xref>) delaying redress till the Charter of
Massachusetts should have been vacated. The Charter being now void,
prays for relief. <hi rend="ITALIC">Copy.</hi> 2 <hi
rend="ITALIC">pp. The whole endorsed.</hi> Recd. 5 May. Read
18 May 85. </p>
</docbody>
</enclosure>
<sourcenote><pro.reference lettercode="CO" classnumber="1"
piecenumber="57">
<class.code>
<letter.code>CO</letter.code>
<class.number>1</class.number>
</class.code>
<piece.number>57</piece.number>
<folio>Nos. 123, 123 I., </folio></pro.reference> and
(order only) <pro.reference lettercode="CO" classnumber="5"
piecenumber="904">
<class.code>
<letter.code>CO</letter.code>
<class.number>5</class.number>
</class.code>
<piece.number>904</piece.number>
<folio>p. 354</folio></pro.reference></sourcenote>
</surrogate>

Figure 3: Final data

In my reading of them, the TEI and MEP guidelines are primarily valuable for descriptions of historical documents where the project is initiating its work from the original sources. For the Calendar project we realised that we were not simply encoding this data to represent the printed books as it made most sense for the data to be corrected, inserted and moved. After all, the print edition is itself only a secondary witness to the documents, and did not have any special status that required it to be encoded as an object in its own right. The MEP guidelines were useful as they accommodate the creation of editions containing abstracts, either in addition to or instead of full transcriptions and provide the <surrogate> element for this. For the retrospective conversion of the Calendar, <surrogate> was the ideal container element for each numbered item since the entire corpus is of document surrogates in some form or other. Indeed, <surrogate> is suitably ambiguous given the context of the numerous editorial and typographic layers between the primary source and the captured, converted and cleaned data.

Looking at Figure 3, we can note the following:

  • we chose to include the metadata attributes in the <surrogate> element. Though there are <place> and <date> elements within it, the relevant attributes are not attached to them.
  • item numbers from the print volumes were retained, but were not given the primacy they had. This decision was taken once we had agreed to move material from the Addenda to its correct position within the chronological sequence. As this would generate non-sequential item numbers and did not respect the integrity of the print volumes, it made most sense to add the bibliographic information to each item, detailing from which volume or addendum the material was derived, and the page(s) it appeared on. All of this information is automatically generated in the conversion and added to the attribute values in the <surrogate> element. This data is used to support the retrieval of items where users have a reference to the print source. It is also displayed in the final application using stylesheets. This approach meant that we could easily insert items that previously appeared in Addenda, yet still locate them given old citations; explicitly, the type attribute is used to indicate whether the item was originally located within the body of the volume, an addendum, an appendix, or whether it was a new, previously uncalendared, item.
  • the <dateline> element from TEI contains the date and place information of each item, but there are issues here, as it is not always obvious in the Calendar what the relationship is between the original manuscript and the printed text, and what editorial intervention there has been. Furthermore, the conversion process itself added the implied year within the <date> element, so it is far from transparent in the digital data what the provenance of the <dateline> data is.
  • <dateRange> has not been used; instead, attribute values of startdate and enddate are included in the container element for each document item, and support the date searching.
  • <docBody> also contains all the additional editorial information about page length, endorsements, docketings etc, which may not be ideal within the TEI/MEP guidelines, though was the only practical solution given the format of the Calendar edition.
  • the <head> element comprises the first sentence of each item. Using stylesheets, this ‘title sequence’ can be rendered differently from the main text, in this case in red, with the result that users can more clearly see what the documents are than was possible in the print editions, where the same typographic representation of the words helped to render key information relatively invisible. Users can also limit their searches to this element, and using standard query expressions can search for, say, letters to or from a given person or body even though this has not been explicitly marked up in the data.
  • The text noting an internal cross reference to an item in a preceding volume has been changed to remove its book-related character.
  • All the data components within the PRO source reference have been explicitly encoded; most are also stored as attribute values within the container <pro.reference> element to facilitate the searching. Through this markup, users can limit their searches to known document classes, or simply retrieve all items from, say, a given journal of the Board of Trade and Plantations.

5 Data capture

The previous section detailed the form of the finished data, but how was this created? In getting the data captured we clearly needed a system that would allow us to process the data to support the core functionality within a DynaText application. The easiest solution was for us to send photocopies of all of it to a keying company in Bombay to capture it in SGML, though using a minimal mark-up based on typographic features which could then be used by us to enrich the data and convert it to a more TEI-like form. This is sensible as many keying companies charge per thousand characters of delivered data, and so it was felt to be prudent to keep the tagging to a minimum. And, as we were sure that, despite extensive data analysis, there would be data modelling and conversion and correction issues that we were not aware of (as some of these things only become apparent when you actually have the electronic data), it made most sense to have control of the data processing in the UK.

It was not possible to send a complete set of the physical volumes to the keying company; the set we worked from was a copy from the PRO that contained a number of handwritten amendments and corrections, which we wanted to incorporate into the data. To facilitate this, both Routledge and the data capture company needed a copy of the text, so multiple photocopies had to be made. Though this was outsourced and the copies produced were of a fairly consistently high quality, there were several issues that subsequently had to be dealt with. As the keying company worked through the data it became clear that several pages had been missed from the photocopy; this necessitated repeated visits back to the PRO to locate the material. Furthermore, because many of the volumes were very large with tight bindings some pages had not photocopied well, with much of the text disappearing into the guttering. Despite our requests that any oddities be reported to us during the capture process, this was only discovered when the data was delivered to us and we found an extremely large number of sections commented out as ‘illegible’.

Figure 4 shows the data as received back from the capture process. Its markup closely reflects the typographic arrangement of the printed text. [7]

<it><d>May 13.
<pl>Whitehall.
<n><b>190.</b> <p>Order of the King in Council.
Referring the petition of
Robert Orchard to Lords of Trade and Plantations, to consider an
instruction thereon to be given to Colonel Kirke, who is going
Governor to New England. <i>Signed</i>, Phi. Lloyd.
<fr>1/2</fr> <i>p. Annexed</i>,
<sit><n>190. I. <p>Petition of Robert Orchard.
Recounting the story of
his grievance against the Government of Massachusetts,
and the report of the Lords of 3 November 1683 (<i>see
preceding volume, No.</i> 1352) delaying redress till the
Charter of Massachusetts should have been vacated. The
Charter being now void, prays for relief. <i>Copy.</i> 2
<i>pp.
The whole endorsed.</i> Recd. 5 May. Read 18 May 85.
[<i>Col. Papers, Vol. LV., Nos.</i> 123, 123 I.,
<i>and</i> (<i>order only</i>) <i>Col.
Entry Bk., Vol. LXI., p.</i> 354.]

Figure 4: Data as captured

The main things to point out here are that the year information has to be inferred from previous entries, the first sentence is not explicitly tagged, empty elements detailing page breaks in the original print edition were dotted throughout the material, and the document reference is not identified as such. In addition, we used tag minimisation extensively, and marked anything that looked like a fraction (using <fr> tags).

6 Data conversion issues

6.1 Initial checks

On receiving the captured data, after a series of trials STEP UK converted it to something similar to that in Figure 3, which was the version used for internal and external checking. This version primarily differed in terms of element names and some of the fine detail of the attributes, though had the same essential structure. In addition to the converted data, STEP also supplied us with a number of log files which abstracted the key areas of the data that initial trials had identified as potentially error-prone. Using these, we ran some in-house checks on some of them to make sure, for instance, that all the items and pages were there and in sequence, and to identify any numbered items with no dates. The latter were often implied by context; it was very rare for there to be no clear date for an item.

6.2 Freelance work

Once these quick assessments were complete, the data was sent out to a small number of freelance SGML editors to undertake the more predictable, time-consuming checking. This was not a completely straightforward process, though; it took two or three months of conversion and log-generation trials before we were reasonably happy with both the processing and the explanatory documentation for the freelancers. Even then, the resulting documentation was not ideal, comprising as it did 24 pages of explanation and detailed examples of the wide range of oddities that we had found during our early work with the data. In practice, it was difficult to write such documentation clearly and still give adequate context of the project and its aims so that the freelancers would be able to make informed decisions when cleaning the data. Armed with this documentation, the freelance editors undertook the following checks [8]:

  • One log contained the contents of plus the following 30 characters for each volume; this was to check that the conversion process had correctly captured the entire first sentence, as irregular abbreviations and some other features truncated the contents. In addition, sometimes the element erroneously included place and date information, and a tiny number had no descriptive first sentence at all; these were subsequently devised, either by the freelance editor, or one of the in-house team. Finally, some of the descriptive first sentences were of the form “The same to Lord Sunderland”, where information was implied from the previous item; a sensible space-saving device in a print edition, but inappropriate in digital form. All of these were expanded to their full form, which process was largely done manually, since trials had found inconsistencies across the volumes that rendered any automatic rule-based processing unreliable.
  • The contents of the <place> element were checked to make sure such things as dates and, very occasionally, sources hadn’t got caught there in the capture and conversion.
  • The <date> elements were examined, primarily to make sure that the correct year had been added by the processing, since often it could only be implied from the printed page; documents spanning a date range in some of the early volumes in the series were particularly prone to generating date errors in subsequent items. Furthermore, it was vital that anything that looked like a date in Old Style/New Style form was closely examined, as the auto-generated startdate and enddate attributes in the <surrogate> were always wrong; however, some such dates were in fact real date ranges, rather than simply calendrical variants of the same date.

These checks took over 400 hours, equivalent to about 2.7 hours per Mb of tagged data. However, as the following sections illustrate, there was a considerable number of other tasks that needed to be performed. Invariably, these not only required a sophisticated understanding of the data and the project, but (apart from the work on the source references) were each of a scale that made it very difficult to hand them over to freelance data checkers; by the time one had understood the nature of the issue to be resolved, prepared a comprehensive briefing, and the freelancer had mastered the task, it was certainly quicker for the lead members of the team to have done the work themselves. Such work probably tied up the two main resources solidly for three months each.

6.3 Document sources

In parallel with this work, the data detailing the PRO source references was also cleaned. The first eighteen volumes included source information in the form shown in Figure 1, which is no longer in use at the PRO. To improve the utility of the digital edition, it was clearly necessary to convert these to their modern form. These were extracted from the main SGML data and a look-up table created using Excel, so the new versions could be dropped back in. Like everything else, this was not without its difficulties; though most sources could be identified as the last piece of information within an item within square brackets, this was not universally true, resulting in a number of sources remaining undiscovered and unconverted until relatively late in the process.

There were a number of other issues relating to the source data that we also encountered. Volumes 19 to 45 used the modern reference form, which could generally be processed automatically, even in those instances where parts of the document reference number were implied. This was itself testimony to the incredible consistency of the original edition. Though the conversion script worked perfectly for the vast majority of these cases, the rare irregular use of punctuation in the original print edition sometimes caused errors. Such mistakes were picked up by outputting the contents of <sourcenote> minus markup to a log, from which any irregularities could be identified and manually corrected in the data [9].

Once the bulk of the processing was complete, we checked for those instances where a <surrogate> did not have a <sourcenote>. In our final target DTD <sourcenote> was mandatory, yet about 200 items did not have a tagged source. On closer examination this was for the following reasons:

  • The source reference was implied from the previous (or following) item;
  • A source in the old style had been completely missed in the conversion process; or
  • It had been omitted in the print edition.

Many of these were resolved in-house, but some needed expert advice from the PRO. However, even after this process, it was not possible to identify the source of every single item in the Calendar; of the 44,000 <surrogate>s, seven remained unidentified [10].

Finally, we created a log and count of all class codes in the data (eg CO 5 in the example above) and the PRO generated a look-up table of the corresponding class description. This was an invaluable check, as it allowed us to find instances where there were typographic errors in the original data. For instance, seeing that one document was apparently from a class which was described as ‘Colonial Office: Johore: Sessional Papers’ suggested that this was an error as such material would not conceivably be included in the colonial North American collection. On inspection, it turned out that what was in the print edition as ‘CO 715/5′ should have been ‘CO 5/715′; most of these problems were resolved fairly easily by comparison with documents in close proximity in the calendar sequence [11].

6.4 Fractions

In the original printed volumes, a number of data items were represented typographically by fraction-like displays. Many of these were real fractions (eg 1/4, 1/2 etc), but a sizeable proportion were related to dates; they often represented Old and New Style dates (eg 20/30 January, or 1723/24). To make sure that no irretrievable errors were introduced into the data, any such data was captured by the keyboarders within a <fr> element, which were subsequently processed to <num>. The conversion process needed to be very sensitive to context; for example, the year 1701/2 may well have been captured as ‘170<fr>1/2</fr>‘, which accurately reflected the typographic representation on the page, but which in the digital edition would best be displayed as ‘1701/2′ rather than use fraction characters, such as ‘1701/2′ [12].

6.5 Handwritten amendments

The set of volumes that we used to create the digital edition contained numerous annotations made by PRO staff over the years. Most were simply implementing the printed errata and corrigenda, though a small quantity were new corrections or additions of new items that had been overlooked in the creation of the Calendar. All of these were captured by the keyboarders and individually checked for accuracy and positioning, and tagged using <corr>.

6.6 Tables

Within the print series there were several hundred occasions where data was presented in tabular form. The simplest of these were rendered in SGML, but most were too complex and would have required considerable time and effort to achieve satisfactory results, either due to the difficulties of drawing up reliable data capture rules, or because of some of the typographic layout features. Consequently, some 200 tables were treated as images, with their text content contained within a <figure> element, which was then indexed for searching purposes.

6.7 Volume 1

As detailed above, there were many instances of slight inconsistencies across the entire series of print volumes, which was not surprising given the long time period in which they were created. In fact, what was most remarkable was how consistent the material actually was. However, the first volume in the series created considerable problems for us, as it was too far removed from the basic model for any of the conversion scripts to run effectively without a large amount of manual intervention in the data. For example, unlike later volumes where every item had a unique number within the volume, this was not the case with volume 1 because the organisation of the printed volume followed that of the main original manuscript volumes from which the material was sourced, but with additional material added from other document series. One consequence of this was that the contents of most of the <sourcenote> elements had to be inferred and a separate script written solely for this volume. In addition, other sources were irregularly positioned, and it was impractical to do anything other than examine and clean each one manually. Such data editing required a very sophisticated knowledge of the data, and the required model, which meant that one key member of the project team was tied up working on the data for a couple of weeks, which had not been expected [13].

6.8 Additional Material

Many of the volumes contained addenda which comprised items that had been overlooked in the preparation of earlier volumes, and there was a complete file which had never been calendared. The latter was prepared by the PRO, and these additional data chunks were treated as above and added to their correct chronological position; sometimes this required minor changes of the text (eg ‘above’ to ‘below’), which were simply silently emended. Furthermore, volumes 43 to 45 included appendices, which sat rather oddly in the structure of the rest of the series. These were taken into the data; appendices 1 and 3 were converted to new items (ie <surrogate>s), and appendix 2 (the lists of Trustees for Georgia) was added to the index, under Georgia. To facilitate the identification of these items, a type attribute was added to the <surrogate> element, with values of ‘addendum’, ‘new-addendum’ or ‘appendix’.

6.9 Cross References

Within the corpus there were a huge number of cross references created by the original editors. These are especially important for following threads through what is a large and complex body of documents, though there are some features of these references that should be borne in mind; because the material was originally published in chronological sequence, the cross references almost always only point backwards in time, except in the case where the referent appeared in the same volume as the referring item. Even so, there was some degree of stylistic consistency in how such cross references were marked in the volumes, and several thousand of these were automatically linked.

Unfortunately, the editors adopted a number of other cross reference-like citations which proved to be less easy to automate. Those that referred to previous volumes were identified by searching for a limited number of text strings in the main data, though each needed manual resolution and some minor changes to the referring text (eg changing the words ‘preceding volume’ to ‘above’ as in the main examples above). There were a large number of implicit cross references, especially within the <head> element which were not processed as links. These included those with text such as ‘Abstract of preceding’, ‘mentioned above’ etc. To have made all these hypertext links would have required each one to be examined manually and its correct target and id identified, which had serious budgetary implications. As these were largely in close proximity we took the view that users could easily navigate to the targets themselves.

This was also a solution we had to adopt where cross references took the form ‘Duplicate of petition Feb. 2, q.v.’ or ‘Relating to the seizure of the Society sloop. v. Dec. 15, 1709′. There are perhaps six thousand or more such cross references within the entire Calendar, and there was no way we could programmatically add an <xref> to any of them. To understand any of them required interpreting their context, and even where there was a string which could be captured, such as ‘v. Dec. 15, 1709′, there were multiple <surrogate>s with those dates, so a precise linking could only be done manually. In practice, as users could open multiple views of the data within the DynaText application, conduct reasonably quick date searches to identify the target, and even make their own hypertext links, on balance this was felt to be an acceptable compromise.

6.10 Index

Early on in our discussions about the project we thought that we would not need to include the indexes from the print edition, as full text searching would render them redundant. However, on closer examination, it became clear that they could serve a useful function, especially if they were merged together into a single master index; they contain information additional to that in the main body of text, they largely standardise spellings of place and personal names, and enable the user to look up people by title or administrative function, which may not be explicit in the actual document. Furthermore, with the exception of volume 1 and some very special cases, all index referents were to item numbers, which meant that in principle they could easily be automatically hyperlinked to the relevant document in the main text. In practice, this was not quite how it worked out.

Figure 5 shows the tagging scheme as received from the data capture company. There were in fact seven different levels of index heading, and the heading level to which something was assigned was not consistent across all volumes. In addition, the nature of the indexing itself changed to some extent over the course of the series, so there were some inevitable inconsistencies. This in itself forced us to abandon the plan to merge all the indexes together in one big alpha-sorted list; it was something that would have needed too much manual intervention to be remotely successful, and research on usage of other digital projects which include back of the book indexes suggested that such resources were only rarely used.

<ix1><p><b>Acadia,</b> 1226, 1666, 1898.
<ix2><p>bounds and description of, p. 504, 1645.
<ix2><p>a small part of Nova Scotia, and contains
the Provinces of Alexandria and Caledonia, 1877.
<ix2><p>the King's title to, 1598-1600.
<ix2><p>claims of right and title to, 44, 111-114,
163, 210-212, 224-227, 240-242, 1809.
<ix2><p>representation of Sir Lewis and John
Kirke concerning, 210.
<ix2><p>trade with, 1660.

Figure 5: Raw index data

As it was, trying to apply a single conversion script across all volumes was problematic because the raw data contained some inconsistencies in their capture of index levels, especially across column and page breaks, which meant that every volume needed some manual intervention. Furthermore, simply reproducing the style of the print edition was completely inadequate for electronic delivery:

  • In the print edition long index entries have continuation running heads to help you locate where you are in the structure; in the electronic text this basic map of the document was achieved in two ways. First, the table of contents on the left of the screen gave the basic location, and, second, all implied index heading levels were explicitly added.
  • If the user searches the index and simply jumps between search hits, it is not immediately obvious where in the chronological sequence the search results are. For this reason year identifiers were added after each hypertext link; these were picked up programmatically from the main document text as each index was processed.

Within the final software application, the data display is illustrated in Figure 6, showing these features. Even so, there is a large amount of manual clean up that could still be undertaken for any future re-use of the data; this project discovered that any decision to retrospectively digitise and merge complex print indexes together should not be taken lightly. Though the indexes contain useful additional material and are important finding aids, we are keen to assess how much they are actually used in the final application [14].

Index display

Figure 6: Display of index data in final application

7 Concluding Discussion

This article has illustrated some of the issues facing a publisher in designing and building an electronic resource which utilises markup guidelines for the retrospective conversion of an existing nineteenth century print edition of historical materials. I have outlined how our ideas about functionality are mapped to the appropriate work that needs to be done on the data, while keeping manual intervention to a minimum. Some of the details of lessons learned during this process are mentioned in the text above, but the main issues were:

  • Despite the conversion scripts being fairly good across the bulk of the corpus, the gross irregularities in volume 1 were a major problem. This simply underlines the importance of doing extensive data analysis on all the data and not assuming that a given model will necessarily apply across a large body of material. Even so, it is difficult to identify all exceptions to a data model when analysing such a large amount of paper.
  • The highly detailed planning and trialling processes were vital for the smooth running of the project; though there were issues that were not foreseen, the close match between the initial project design and the final application is testament to the time spent on the planning. Allowing for the hiatus in the work on the project over the summer of 1999, the actual development probably took less than nine months. Such an approach also means that the expensive development phase is compressed into a relatively short period just before publication, minimising cash flow problems.
  • The planning phase resulted in a set of clearly defined aims and objectives, which were a valuable benchmark when compromises were needed.
  • It is impossible to document too much. During the development phase we made considerable efforts to record exactly what we were doing and why, as well as to note issues that could be considered for data enhancements in the future. This has been invaluable in helping us review the project and plan for the future. Even so, some of the rationale and background to very early design and functionality decisions was not adequately documented, which meant that it was at times difficult for new members of the project team to be fully acquainted with the complete context of the project, which in turn impeded their ability to make independent decisions about specific technical issues.
  • During the project we tried never to lose sight of the fact that the processes we were creating had to be scaleable, and also measurable, so that future projects could benefit from our experience. This also meant planning for future possible spin-offs.
  • Even lightly tagged retrospective conversion projects can require extensive documentation and training for freelance checkers; if checking processes are not easily describable and relatively few in number these sorts of projects can quickly become economically difficult to justify on more than a one-off basis.

Finally, it is possible that a web version of this data may be made available at some point in the future. If so, there are certainly some ways in which we would want to enhance the data. Rather than treat the index as a separate item, it could make most sense to add the index terms as metadata for each item to support searching using a single search query. In addition, the data could be processed so that, for a given document one could see a list of other items containing hypertext links to it, thereby making it easier to follow the threads throughout the corpus. For a publisher, such an approach to data offers us the opportunity of allowing a data set to grow organically over time, and to add further enrichment in response to user needs once the main body of users of this material, that is, those outside of humanities computing, develop a more sophisticated understanding of the utility of these sorts of resources. It also means that cash flow can be more manageably controlled, in what is not yet a risk-free environment.

Acknowledgement

Thanks to Mandy Banton at the Public Record Office for information on the organisational and publishing history of the Calendars.

Notes and References

  1. The concept of ‘image editions’ has been developed by the Model Editions Partnership. See the MEP home page at <http://mep.cla.sc.edu/MEP-Home.HTM>, and see also David R. Chesnutt’s paper at ACH/ALLC97, ‘The Model Editions Partnership–Towards a National Database’, <http://www.qucis.queensu.ca/achallc97/papers/p036.html>. One of the MEP projects is described in Cathy Moran Hajo and Esther Katz (1998) ‘The Margaret Sanger Papers Project: A Documentary Edition in the Digital Age.’ Connect, Spring issue, <http://www.nyu.edu/acf/pubs/connect/spring98/HumSangerSp98.html>
  2. Brad Scott (1998) ‘Creating an Image Edition of Historical Material: Asia: Official British Documents, 1945-1965‘; paper given at DRH98, and to appear in [Marilyn Deegan et al. (eds) (1999) Digital Resources for the Humanities 1998: Selected Papers from DRH98, Glasgow University (London)].
  3. This is described in Adrian Driscoll and Brad Scott (1998) ‘Electronic Publishing at Routledge: An Overview and Case Studies.’ Computers and the Humanities 32: 257-70.
  4. From J.W. Fortescue (ed) (1899) Calendar of State Papers, Colonial Series, America and West Indies, 1685-1688 [volume 12] (London: HMSO) p. 43.
  5. Public Record Office, CO 1/57, Nos. 123 and 123i (ff. 302-303).
  6. For the Model Editions Partnership, see <http://mep.cla.sc.edu/>, and especially, David R. Chesnutt et al. (1999) ‘Markup Guidelines for Documentary Editions’ <http://mep.cla.sc.edu/MepGuide.html>.
  7. Project documentation: Rachel Nott, keyers.dtd 26th May 1999.
  8. Project documentation: Brad Scott ‘A Guide to Data Correction’ 19th October 1999.
  9. Project documentation: Brad Scott ‘Processing PRO references in <sourcenote>‘ 24th September 1999.
  10. Project documentation: Brad Scott ‘State Papers, Colonial: Source Queries’ 11th April 2000.
  11. Project documentation: Brad Scott ‘Source Corrections’ 17th May 2000.
  12. Project documentation: Brad Scott ‘Post-Processing Fractions in State Papers Colonial‘ 25th February 2000.
  13. Project documentation: Brad Scott ‘State Papers Colonial Markup, Conversion and Display, part 3 - Volume 1′ 16th November 1999.
  14. Details on the working-out of the index processing in the project documentation are: (a) Brad Scott and Savita Bailur ‘State Papers Colonial Markup, Conversion and Display, part 2 - Index’ (6th August 1999); (b) the major revision to it the former, Brad Scott ‘State Papers Colonial Markup, Conversion and Display, part 2 - Index [Document 2]‘ (7th January 2000); and (c) Brad Scott ‘State Papers Colonial: Index, Metadata and Searching: Notes for a future (web) edition’ (7th April 2000 and 26th May 2000).

Digital publishing consulting

With twenty years' experience in the information industry, and a broad range of activities in the digital/new media sector since 1994, Brambletye Publishing offer invaluable expertise for publishers and other information professionals. Read more