About the Collection

Creating the Digital Collection

Prior to creating the digital collection, the selection of the diarists to include was completed and the biographical sketches were written. Before the launch of the scanning phase, we also spent considerable time meeting as a team and discussing the approaches we wanted to take, applying “Project Planning and Management” to this specific project. In this phase we reached conclusions regarding the project as to its scope and schedule, and what resources would be available. We also determined what would be included in the project and what would not be included—generally called the “Is/Is Not” exercise in the team process.

We were using as our model the Overland Trails digital publication entitled Trails of Hope: Overland Diaries and Letters, 1846–1869. In that project, where nearly 10,000 images were created (7,000 of which were trail diaries and letters) there were a number of lessons learned. We hoped to carry these lessons to this much larger, even more complex project.

Project Goals

  • We wanted the researcher to have in part the experience of being at the archives and touching and feeling the original diary—an experience that very few people might have.
  • At the same time we wanted to take advantage of the great opportunities that technology brings not only in accessibility but also in “searchability.”

Based on the lessons learned and the overall project objectives, we made the following decisions relative to the project.

  • The original diary pages would be scanned in color, including covers.
  • A transcription, with XML encoding would be prepared.
  • The guidelines for transcription, established by the L. Tom Perry Special Collections, would be used, in order to retain the features of the original diary page, i.e., line length, spelling, punctuation, insertions, underlines, deletions, etc.
  • The software which would deliver the finished diary and transcription, would allow the researcher to see the original diary page and transcription side-by-side.
  • We would provide a PDF of the transcription, which would retain the original diary features listed above, for ease in printing and downloading.
  • A brief biographical sketch of each diarist would be completed prior to scanning.
  • Essays would be written on missionary work in general and for each of the large geographic regions.
  • Reading lists would be prepared for each of the regions and for the overall missionary experience.
  • A web page would be designed to lead the researcher to the material with a variety of interfaces and a link would be made to each diary through the online catalog.

Digital Image Preparation

With over 63,000 pages to be scanned and transcribed, the team determined early that we would work in geographical regions to create milestones that would provide not only a sense of accomplishment, but allow the team to put on the web site segments of this large collection. The following steps were then taken to prepare the diaries for scanning.

  • Spreadsheets were created of selected diarists based on the geographic regions and constructed from the larger spreadsheets that were created at the time of review of each missionary diary in the manuscript collections.
  • The Pacific region was selected for the beginning of this project because of the richness of this collection, and because we knew that there were many individuals eagerly awaiting a collection from this region.
  • The team leader than began the preparation of the diaries first for the Hawaii/Sandwich Island area by:
    • Pulling the diaries of one diarist at a time, so that no mix up of collections would occur.
    • Reviewing the selection made by the students and evaluating the rating and the physical condition of the diaries. At times decisions were changed and some diaries were removed from the overall project and others added.
    • Writing on each diary page numbers in pencil, when page numbers were either not existing in the original diary or were incorrectly entered. This was to facilitate scanning, transcription and also delivery of the digital images.
    • Determining a process whereby inserts related to the mission were scanned and page numbering created, i.e., 22A, etc. Inserts unrelated to the mission experience were generally not scanned.
    • Determining how much of an individual diary volume would be scanned. In the beginning of the preparation and scanning process, we originally decided to scan a diarist volume in its entirety even if only 20 pages out of 200 pages were relevant to the mission experience. Given the size of this digital collection, however, we later decided to be a more selective and in many cases have chosen not to scan the entire volume, but simply make a note at the transcription phase regarding what has been left out.
    • Flagging each volume that went into the Digital Laboratory. The flag identified the diarist, the call number, volume (i.e., 1 of 5), and the region. It also left room for specific notes for the scanner and the transcriber.

This process proved to be more time consuming than originally expected and resulted in several changes in the diarists who are included in the collection or even individual volumes which we chose not to scan at all or only partially scan.

Digital Imaging

All of the work in this phase is done by student employees. The diary pages were scanned at 100% using the Zeutschel OS 10000 as book scanner and Omni Scan 10.x software, 24-bit color, at 400 dpi. We were anxious not to repeat the experience that we had in the Overland Trails diary project where the diaries were scanned on flatbeds. As the Zeutschel is a camera—where the image is scanned from above—we felt that the original diaries could be protected.

To protect the spine, we use a cradle, and the diary lies face up where a large piece of glass comes down on the page and allows the camera to do its work as described above. When a diary consists of a bound volume, it is scanned one page at a time and when it is unbound, it is generally scanned two pages at a time and then the pages are separated in the cropping phase.

Cropping or Post Scanning:

  • Adjust color.
  • Make the text more readable by adjusting contrasts of dark and light.
  • Crop back to original page, removing the extraneous edges that are captured in the scanning.

Quality Control:

  • Verify pages to ensure all of the pages that should be there are included.
  • Check that they haven’t scanned blank pages within the body of the diary.
  • Confirm that all of the inserts are in order and have actually been scanned.
  • Re-check cropping.

Renaming, Burning, and Cataloging:

  • Each page within a volume is a separate file and needs to be renamed. The file naming convention includes the page number, the original manuscript call number and the volume number with the file extension.tif.
  • A production note is created and an Excel and CSV copy of this will be burned.
  • Metadata is added onto the Excel spreadsheet created for each geographic region.
  • All of the volumes written by a given diarist are combined into folders sized for DVDs (4.38 GB folders) with the volume naming scheme of MD_####.
  • Two archival copies are burned as Tagged Image File Format (TIFF) at 400 dpi and verified in iView (internal cataloging software).
  • Each archival DVD, with its individual diarists volumes, are cataloged onto iView.
  • The Excel spreadsheet tracking form is updated for each volume at each step of the process outlined above.

Creating Transcription Disks:

  • A transcription disk is completed by using the DVD TIFFs in Photoshop and changing the files to 150 dpi and into JPEG format.
  • The Production Note is added as a CSV document.
  • Two copies of each disk are burned and labeled with the region and the name of the disks. The naming pattern matches that of the TIFF, with the exception of adding a T at the end of the number (ex. MD_####T).
  • An Excel translation table, which includes the transcription disk number and the diarists name and volume number, is added in case reloading of individual pages onto the transcription server are needed.

Final Steps:

  • The archival TIFF’s are filed.
  • The transcription disk is given to the project manager who arranges for them to be loaded, from a library computer, using the Database Management application of iTranscribe. These files are loaded onto a server at iArchives, the developer of the transcription software.

Transcription and XML Encoding

During preparation and scanning of the diaries, the team and the library discussed the best technical method for the transcription. As stated earlier we wanted to use Extensible Markup Language (XML), TEI Lite for both structural encoding, i.e., line length, dates, insertions, deletions, but also name encoding, i.e., people, places, and organizations.

Although we knew that our display or delivery software (CONTENTdm) could not support the structural XML as yet, we were unsure if this software would always be the one that we used or if at some future date, CONTENTdm would support the structural XML. We also knew, from previous experience with the Overland Trails collection that we could utilize the XML structural encoding in the PDF printing version.

Transcription

In planning the transcription portion of the Missionary Diaries project, we decided that the diary transcription guidelines would be based on traditional archival standards. This was done to represent the transcribed text as closely as possible in the following ways:

  • The same line-endings
  • margin notes
  • misspellings
  • insertions and cross-outs
  • spacing, and punctuation.
  • ß In addition, because of the age and condition of many of the diaries, end notes would be needed in the transcribed text to identify faded ink, torn pages, water damage, illegible handwriting, pages or inserts that we didn’t scan, etc.

Encoding

The transcription/encoding experience of Overland Trails—a similar project that used Word Perfect and XML TEI Lite software to transcribe and encode text—proved to be more time consuming than originally planned. We wanted to make this encoding process more transparent, where the transcribers really didn’t need to understand XML. Therefore, for this and similar future projects, it was proposed that software be developed for the text production process.

The Brigham Young University Library contracted with iAchives, a local archival software development company, for this work. They were asked to:

  • Develop the software for each phase of the text production
  • House the transcribed text files on their server at their location during the text production phases
  • Provide the encoding and authorities module for key persons, places, and organizations
  • Produce the final converted text output files.

A requirements document was drafted and given to iArchives to begin their software development. This specification included the traditional archival standards for which we were basing the structural and name/place/organization encoding guidelines, as well as the XML TEI Lite requirements needed for the converted text output files that would be loaded into our presentation software, CONTENTdm.

The phases for text production included:

  • Loading the digitized page image files into the transcription software (including document-level metadata contained in a spreadsheet)
  • Text transcribing and encoding
  • Authorizing/normalizing the encoded names, places, and organizations.
  • Levels of security were established.

There were inevitable problems in the development of iTranscribe, fueled by the fact that almost no one in the library had experience in software development, and truly did not have the time to devote to it that was needed. But, because of our need to move forward with the transcription and encoding (due to gift funding), we found ourselves working as though the software were in production rather than in test. Both the library and iArchives had to work through the problems of previously transcribed text being lost when a new software version was released or to come to grips with insufficient back-up procedures at iArchives.

Open communications between our project team and iArchives project leaders and software developers became essential as changes had to be made to the software to accommodate complex text functions such as search and replace and table creation, which had not been considered to any great extent in the original specification document. Over the course of the year software corrections and enhancements were made and the transcription/encoding software evolved into a worthwhile and productive tool. The authority/name normalization module of iTranscribe will be discussed below in the “Metadata Creation” section. Had a more detailed specification document been provided up front, some (but not all) of these problems may have been avoided.

Metadata Creation

The Mormon Missionary Diary Project Team set as their overarching goal to apply both a controlled vocabulary with some authority control on individual names, places, and organizations, in order to provide the widest possible discovery for the researcher. We will outline below the modifications to this goal which became a part of the project.

To accomplish this goal full USMARC (subsequently referred to as MARC) parent records, which already existed for most of the diarists, were used for each of the diarists. These records are searchable in the online catalog at Brigham Young University, RLIN, and OCLC. Each MARC record contains a URL that links directly to the digital representation of the object created in CONTENTdm.

Plan of Work

  • As a part of the program development of the software package called iTranscribe by iArchives, a module was initiated in the requirements document that followed the transcription and review phases listed above. Its purpose is to review and resolve new names added by the transcribers and reviewers and when appropriate, apply library Name Authority Control (NACO and Subject Authority Control (SACO).
    • Authorizers can add, remove, and rename entries on the authorities lists
    • Authorizers can use standard search and replace features.
    • The authorities lists are stored on a server at iArchives rather than locally.
    • The authorities lists are created for each separate geographic region, mitigating some of the downloading of these lists from the iArchives server.
    The authorities/normalization phase software was the most complex to design and develop. Consequently, we see the need to continue to work with iArchives in a partnership mode to resolve some of the issues in this module of the program. As examples of the features that we would like to see: the ability to download a complete list of unidentified terms so the spreadsheet for name normalization could be eliminated; the ability to search across diaries; and increased speed of downloading these large lists, and possible local hosting for these lists.
  • With CONTENTdm as the delivery software, crosswalk links were developed for the selected metadata fields and these fields were mapped to Dublin Core and to MARC for the diaries. The crosswalks provided the links between the metadata fields, the MARC tags, and the Dublin Core elements. The crosswalks also include information regarding which fields are searchable, displayable, and under authority control.
  • The URL to the digital publication for the diaries was added to the MARC records, allowing online user access. The MARC records were contributed to RLIN and OCLC, allowing access directly from those online union catalogs.
  • The architecture of CONTENTdm requires that every image have metadata attached. Each page of the diary is considered an image. Utilizing XML previously created through the iTranscribe software, selected place names, personal names, and organizations were identified and placed under authority control. Early in this XML mark-up process, the team included authority control as an integral part of their work when tagging terms. Although the results allow incredibly deep searching within the diaries, there is a high price to pay for this (see benchmarking costs below).
    • Each page of the diary transcription, with its XML mark-up, was programmatically loaded into the metadata fields by extracting selective tags and importing to designated metadata fields. Minimal clean-up of the authority controlled fields was necessary as the iTranscribe software allowed for easy global name changes and identification of where given terms were located. The global changes occurred within each diary volume, but not across all of the volumes.
    • Because of the Overland Trails diary experience, we decided not to run every single XML tagged place, name or organization through the NACO or SACO programs of the Library of Congress. Rather, an attempt was made to normalize the spelling of personal and geographic names.
    • Some searching of geographic names was done for verification of spelling in the Lee Library authority file; Library of Congress authority files via RLIN; Geographic Names Information System (GNIS); Getty Thesaurus of Geographic Names; the National Imaging and Mapping Agency (NIMA): and the Internet.
    • No effort was made to establish family member names or persons mentioned in the diaries, unless they were found in the mission portions of the diaries. For example if a missionary was writing to a family member at home, no attempt was made to tag or authorize that name. We tagged names if they were a part of the mission experience in the mission area, and major leaders of the Church of Jesus Christ of Latter-day Saints.
    • iArchives, through their software iTranscribe, output an authority list in CONTENTdm format of the linked or authorized terms for personal names, geographic names, and organizations that can be loaded as the controlled vocabulary for CONTENTdm.
    • Authority changes, which are needed after the diaries were loaded onto the CONTENTdm server, were made in CONTENTdm, but also in the XML TEI Lite files stored on a server housed at BYU.

Loading Files from iTranscribe to CONTENTdm

Because iArchives wrote the software program for iTranscribe, we felt it would be more efficient to pay them to write the load program from iTranscribe to CONTENTdm as well as the XSL style sheet which would take the PDF version of the transcription and utilize all of the XML structural encoding.

We loaded five individual volumes on the test server several times to tweak various features such as: the inclusion of all of the metadata fields on both the document-level metadata and the page-level metadata; order of the metadata fields; and whether certain fields should be hidden from the public, such as the full-text on the page-level metadata.

After several test loads all of these concerns were addressed and it was obvious that the iTranscribe software interfaced well with CONTENTdm.

Per Page Cost Estimates for Digitizing, Transcribing, Review, and Authorities Using iTranscribe

Scope

Costs do NOT include planning, preparation of the document, supervision, web design (all of these functions are done by full-time staff who have other projects and responsibilities as well as this project), equipment costs, final loading into CONTENTdm, and maintenance

Average transcription page size: 184 words/page (23 lines x 8 words/line)
Average number of Authorities tags: 6 terms/page
Hourly rate for Digitizing $9.50
Hourly rate for Transcriber/Reviewer: $8.00
Hourly rate for student Authorizer: $10.00

Student Productivity Scenarios

  MINIMUM COST/PAGE
1. Students only (with Authorities):
Average student Digitizing: 42 pages/hr. (1.44 min/page) $.23
Average student Transcription: 3 pages/hr. (20 min/page – includes tagging) $2.68
Average student Review: 6 pages/hr. (10 min/page – no tagging) $1.34
Average Authorities time: 3.25 pages/hr. (18.5 min./page) $3.08
  Total: $7.33
2. Students only (no Authorities):
Average student Digitizing: 42 pages/hr. (1.44 min/page) $.23
Average student Transcription: 4 pages/hr. (15 min/page – includes tagging) $2.00
Average student Review: 6 pages/hr. (10 min/page – no tagging) $1.34
  Total: $3.57

Lessons Learned

  1. The size of this project was too large from the beginning. We would have been better served and seen results sooner, if we had chosen to do a smaller scope of approximately 10,000 pages of diaries.
  2. We should have done a proof of concept sooner in the project.
  3. The software development of iTranscribe—a program which has proven useful to us in this project—did create numerous production problems in the short run. We learned several lessons from this development process: One or more individuals in the library, who had experience in software development and the subject expert on this collection needed to be devoted to the development process; We had to expect delays, frustrations, and even loss of work in the development phase; We should not have expected to be in production mode while software was being developed.
  4. Because of software development, we were unable to benchmark the costs of this project until well into it—almost two years. We now have these cost estimates (which naturally will fluctuate slightly as we become more proficient and depending upon who is working on the project).
  5. As you progress in a project of this size, you learn you have to be willing to change the schedule and the scope. For example, we were initially planning on tagging all names, places, and organizations. Our first adjustment was pulling back to mission specific names, places, and organizations. Our last adjustment was to complete one region—the Pacific—with tagging of names, places, and organizations, and not tag any names, places or organizations in the remainder of the regions.