Processing raw data extracted from an Authority Record’s view page

The NCTR’s online archive lists its Authority Records in a series of web pages that includes links to view pages of these Authority Records. We extracted a good deal of data from the HTML code underlying these view pages. Unfortunately, these data required considerable processing.

First, we needed to sort hundreds of values of different types that were entered haphazardly in catch-all fields named “Creator of”, “Creator of 2”, “Creator of 3”, …, “Creator of 13”. Second, many entries in these “Creator of n” fields were tagged with labels that were nearly identical to the names of other (primary) fields (Table 1). We found that every one of these tagged entries in a “Creator of n” field was matched by the entry in the corresponding (primary) field  – and so considered these tagged entries to be redundant and ultimately discarded them.

Table 1. Names of primary fields and related label of entries in “Creator of n” fields (NCTR on-line archive, accessed October 2022).
Name of primary field Related label in a “Creator of n” field
Parallel Forms of Name Parallel form(s) of name
Other Forms of Name Other form(s) of name
History History
Sources Sources
Functions Occupations and Activities Functions, occupations and activities
Places Places
Subject Access Points Subject Access Points
Place Access Points Place Access Points
Internal Structures Genealogy Internal structures/genealogy

Finally, it’s worth noting that many of the remaining entries in “Creator of n” fields referenced multiple target entities. While these entries give the total number of targets, no more than ten entities are named. We had to look elsewhere for a possible solution to this problem.