Metadata (part 2): Controlled Vocabulary and Metadata

By Cherryl Schauder, Keith Grove

In the previous issue of Connections we examined metadata and its relationship to MARC and Dublin Core. In this article another aspect of metadata is considered: controlled vocabulary for subject searching.

When catalogue users search for a book or videorecording on a particular subject they are hoping that the computer will match the topic word(s) they have entered with the same word(s) in the description of the item created by the cataloguer. Computer systems usually offer a range of searching approaches. For example, one can search every word in a set of fields of a catalogue record, for example, in the author, title, notes and subject heading fields for the topic word(s) one is seeking. This is often referred to as a keyword or free-text search. Or one may limit the search to just the title field, or to just the subject heading(s) field.

When we search in the subject heading(s) field we understand that we are trying to create a match on word(s) assigned by a cataloguer who has spent time examining the item being catalogued, deciding what the most useful retrieval concepts(s) might be for the given user group, and then translating these into words or phrases from a list of 'allowed' or authorised words and phrases called a controlled vocabulary, list of subject headings or thesaurus. A particular idea or concept can be expressed in many different ways. 'See' references are thus inserted (or automatically mapped) into thecatalogue to guide the user to relevant resources via a range of alternative words or phrases. Proper names are frequently too numerous to list comprehensively in a thesaurus; thus the cataloguer may be required to supply these, often using special rules to achieve consistency in the choice and form of a particular name.

Controlled vocabulary subject words are often contrasted with 'uncontrolled' or 'natural language' words appearing in the title, contents, or summary note in the catalogue record. Full-text databases store the text of entire documents that can be searched in keyword (or free-text) mode. There is extensive ongoing debate in the library and information science literature about whether controlled vocabulary or natural language systems give the best retrieval performance. The intellectual effort of subject cataloguing with controlled vocabularies is a t ime consuming and therefore expensive process. The maintenance of the vocabulary itself is labour intensive as it needs to be updated and modified on an ongoing basis. The aim of replacing the human indexer with a computer is ever appealing. Even the task of abstracting or preparing summaries (which usually does not involve consulting a controlled vocabulary) can be undertaken by a software package which scans the contents of large documents and automatically produces abstracts of them.

However, despite the searching power of present day computer systems there seems to be considerable consensus at this point in time that a combination of both controlled and natural language vocabularies achieves good, reliable searching. The controlled vocabulary might be viewed as a kind of insurance policy ensuring a level of predictability in searching that is reassuring.

As far back as the 1950s I ibrary and information professionals began to seek ways of mechanising or automating the subject indexing process. Many experiments have been undertaken with increasingly powerful computer software systems to explore the issue. One of the problems with this kind of research is that the sample databases used in the studies are usually much smaller than those in 'real life'. Beginning in the 1960s Gerald Salton researched and published many articles and books on the subject of machine processing of text. Salton obtained good results using automatic indexing and ranking techniques, but it has taken nearly twenty years for computer hardware and software to develop to the current stage where the methods he pioneered are being widely used in the search engines we know today.

Automatic indexing systems are able, for example, to utilise lists of 'stop' words (common words not useful in retrieval, e.g. 'because', 'and', 'the', etc); to automatically stem words so that 'hous' would be the root or stem of 'housing', 'houses', 'housed' etc; to generate pairs of adjacent stemmed words within a sentence; and to display documents which the system has calculated to be relevant to a query in a ranked order from most to least relevant to the query. The algorithms on which this ranking is based vary from system to system but usually involve such methods as counting the number of times a search/query word appears in the document, the position of the word in the document, the extent of the match between a range of words in the query against those in the document, etc.


World Wide Web search engines such as Altavista use computer programs to automatically move through Web addresses, titles/headers and certain numbers of words on Web pages, collecting addresses and words, and placing them in a text index. The search engines then apply one or more algorithms to rank the relevance of sites to a search query.

The retrieval of material published on the Web is viewed by many as a difficult issue. The size and rate of expansion of the Web has highlighted the shortcomings of even these sophisticated indexing and searching approaches. The question now is how do we index the Web in a way that ensures effective, reliable retrieval? How do we narrow down the hundreds of unwanted 'hits' yielded by the average search-engine query on the Web?

USMARC provides a standard set of fields with which to label documents. Metadata standards such as Dublin Core provide a similar functionality for documents on the Web. The Dublin core standard and others like it (or based on it) are being hotly debated in research articles and internet discussions. As mentioned in the previous article, librarians use the Anglo-American Cataloguing Rules and various available lists of subject headings and thesauri in the MARC record. Metadata researchers are asking questions such as how should a Web document be defined for indexing purposes; (Web documents, with their many hypertext links, may require indexing at the level of individual objects, e.g. a logo within a letter, an advertising jingle, a single Web page or a vast set of pages); what kinds of fields are relevant to particular kinds of documents; can there be one overarching set of fields for all documents; how prescriptive should the rules be in relation to each field; what controlled vocabularies should be used in relation to given user groups. There are no easy answers, and each information community needs to develop strategies for its own specific needs while keeping open as many options as possible for interacting globally.



Cherryl Schauder

National Coordinator, Cataloguing and Metadata


Keith Grove

Manager, Information Services