Processing raw data extracted from an Authority Record’s view page

The NCTR’s online archive lists its Authority Records in a series of web pages that includes links to view pages of these Authority Records. We extracted a good deal of data from the HTML code underlying these view pages. Unfortunately, these data required considerable processing.

First, we needed to sort hundreds of values of different types that were entered haphazardly in catch-all fields named “Creator of”, “Creator of 2”, “Creator of 3”, …, “Creator of 13”. Second, many entries in these “Creator of n” fields were tagged with labels that were nearly identical to the names of other (primary) fields (Table 1). We found that every one of these tagged entries in a “Creator of n” field was matched by the entry in the corresponding (primary) field – and so considered these tagged entries to be redundant and ultimately discarded them.

Table 1. Names of primary fields and related label of entries in “Creator of n” fields (NCTR on-line archive, accessed October 2022).
Name of primary field	Related label in a “Creator of n” field
Parallel Forms of Name	Parallel form(s) of name
Other Forms of Name	Other form(s) of name
History	History
Sources	Sources
Functions Occupations and Activities	Functions, occupations and activities
Places	Places
Subject Access Points	Subject Access Points
Place Access Points	Place Access Points
Internal Structures Genealogy	Internal structures/genealogy

Finally, it’s worth noting that many of the remaining entries in “Creator of n” fields referenced multiple target entities. While these entries give the total number of targets, no more than ten entities are named. We had to look elsewhere for a possible solution to this problem.