Internet Resource Management-The Role and Development of Metadata

Articles

Feature article

The Great Divide? Physical and Digital Resources in School Libraries

Articles

Regular features

Internet Resource Management-The Role and Development of Metadata

By Kyle Hassan

This article is the final in our series on metadata. Kylie Hassan's straightforward approach will be of particular interest to library staff who want a concise explanation of meta data's relevance to their library and their users. This essential background reading raises many important issues for discussion and you may wish to pass the article on to other staff in your school, particularly if you are currently setting up websites and intranets.

Since the early 1990s the number of searchable files available via the Internet has exploded at almost unimaginable rates, resulting in millions of websites and literally billions of web pages. The ease with which we can electronically publish a document and make it universally available is astounding.

Therefore one of the fundamental questions of this information age is: how do people find -and how do we as information professionals ensure-that people find the information they are seeking in this global environment?

Definitions

So, what is metadata? Throughout the literature, 'metadata' is essentially defined as data about data. Any data associated with, but distinguished from, an information object can be called metadata. Milstead & Feldman (1999) provide a useful definition in stating that metadata acts as a surrogate for a larger whole, highlighting those characteristics of a work that enable the user to understand its contents as well as its purpose, source and conditions for use. Many aspects of metadata are already familiar to us. One of the most obvious examples is the library catalogue. The characteristics of items listed in a library catalogue such as title, author and subject are all pieces of metadata that can be used to find books and retrieve them from library shelves.

In relation to the WWW, metadata refers to a set of attributes used to facilitate the identification, description and location of Internet-based resources. Metadata allows us to capture information about each web page and relate what a resource is, what it is about and where it can be found. As Ng (1996) imparts, metadata can provide information about the whole resource or only parts of it, and supports the effective use of information from creation through long term use. By using metadata to provide simple and consistent descriptions of Internet resources, we will continue to be able to find, understand and maintain access to web-based information.

Although metadata is most commonly associated with web-based resources, the concept itself is not new. The term 'metadata' was coined by Jack Myers in 1960 to describe datasets effectively, and first began to appear in the literature in the early 1980s in relation to database management systems.

Problems with the Internet

With regard to the WWW, there are many current problems that the use of metadata may provide solutions for. Perhaps the most important problem as mentioned above, is the huge growth in information published on the Internet. The sheer number and variety of resources often threaten to overwhelm the user.

lannella & Waugh (1997, online) remark that the ability to find and retrieve relevant material has decreased as the quantity of information on the WWW has risen. Some information has become so difficult to locate it is effectively unavailable. A simple method is needed to retrieve and provide access to resources that are of interest to a particular user.

Searching the Internet has traditionally been accomplished using tools such as web crawlers. These software robots roam the WWW collecting millions of hypertext document links. The information from each HTML (hypertext markup language) web page is automatically formatted into full text keyword databases. These databases of harvested links can be queried using search engines. A search engine is an interactive interface that allows the user to find information by matching query terms to words stored in the database. If the terms do not match, the document will not be retrieved regardless of how relevant it is to the subject of the enquiry.

However, as the internet has grown, this type of searching has proven less and less effective. Search engines frequently produce extensive lists of results, of which few hits correspond to what the query was intended to find. This causes frustration as the user is required to spend time examining a large number of irrelevant citations to find the few applicable documents. In response to these problems, the W3C (World Wide Web Consortium) (1999) remarks that a significant part of the web is missing. It lacks the type of labels, cataloguing data or descriptive information that would allow web pages to be properly searched and processed by a computer. Also absent are features such as standard vocabularies and authority control mechanisms that make traditional bibliographic tools useful. In addition, search engines have no way of indexing non-textual multimedia objects such as the image, audio, video and executable program files that populate the web. As a result, it must be said that one of the primary reasons for developing metadata is to facilitate and improve information retrieval. It is important to help to close the gap between user expectations of the Internet and the current reality of searching. According to Efthimiadis & Carlyle (1997), metadata can enhance the probability that a resource will be retrieved, allow users to discriminate amongst similar resources and preserve the intellectual content of resources over time. Metadata will also improve the recall and precision of searching through using the same standardised term for every occurrence of a subject. If metadata is properly applied, a document could be retrieved even if it never uses the controlled term within its text.

Metadata elements

lannella & Waugh (1997, online) relate that the basic model for metadata is known as an attribute type and value model. Each fact about a resource is known as an attribute or element. An element contains a 'type' that identifies the information that element should contain, also one or more values which is the metadata itself. For example, if we were going to add metadata to a document called 'My Skiing Trip', the element type in this case would be 'title', and the 'value' would 'My Skiing Trip'. Throughout the literature the appear to be three major groupings of elements. Marsh (1997) defines these as elements relating to the content of a resource, elements relating to the intellectual property of resource, and elements that describe the instantiation of resources. Content elements relate to the title, subject and description of an item, while intellectual property elements relate to the author, publisher and permissions for re-use. Lastly, instantiation elements describe the type and format of the resource, as well as where and how that resource is stored.

Placement of the elements

Once the decision has been made as to which metadata elements to use, it becomes necessary to work out where they will be placed. There are essentially two schools of thought regarding this issue. Some believe metadata should accompany the resource it describes, while others maintain it should be separate and linked to the resource through other means.

Perhaps the easiest way to deploy metadata is to embed the elements within the document they describe. One advantage of integrating data and metadata is that no additional system must be in place to use it. Weibel (1997) states that once metadata becomes an integral part of the resource, it can be easily harvested and manipulated by web-indexing agents. Search engines such as Altavista look for meta tags when indexing websites, and summarise a document based on the information they find there.

The effectiveness of embedded metadata depends largely on which web-based syntax is used, and as a result, a variety of markup languages have been proposed to encode metadata elements. lannella & Waugh (1997, online) point out that HTML meta tags are fast becoming the de facto standard, as they are widely used and easy to include in the header fields of HTML files. However, Boeri & Hensel (1998) dispute the use of HTML, claiming that it was defined as a presentation language and not as a method to structure document information.

Due to problems with HTML, projects have been conducted to trial the use of XML (extensible markup language). XML is a subset of SGML (standardised general markup language) which provides the basis for the encoding language used on the WWW. Unlike HTML, XML allows locally defined meta tags to be created as they are needed. This structure provides increased flexibility and specificity, and may allow users to conduct fielded searches of web documents in much the same way as we now search library catalogues (Boeri & Hensel, 1998).

The alternative to embedding metadata within the original document is to store metadata labels separately from the resources they describe. This method is more in keeping with the model of the traditional library catalogue. Marsh (1997) advocates this technique, claiming that the meaning of metadata elements is not affected by whether or not the element is embedded in the resource that it describes.

Who should assign metadata?

A subsequent point of contention relates to who should be involved in indexing electronic documents and assigning their metadata elements.

One emerging trend has been to encourage authors to describe their own resources through the local generation of metadata. Ideally this descriptive information would be provided at the time the document is created or shortly after. Fietzer (1998) mentions that author-generated metadata will help to build greater semantic coherence and result in more effective indexing processes. Ng (1996) believes entries are likely to be more reliable than those created in other ways, as the author is best positioned to know what their particular document is about.

Many web authors are interested in working with metadata to improve the ranking of their site with various search engines. However, metadata may not accurately reflect the contents of a site if practices such as 'spamming' are adopted by authors trying for high hit rates. Milstead & Feldman (1999) also elicit a warning that individuals creating metadata for their own resources may have little understanding of the finer points of description, and be unaware of the importance of their work to the overall information retrieval process.

This situation may be improved through the implementation of tools for creating and managing web-based metadata. Often these tools include templates stipulating the minimum number of elements an author must enter in order to adequately describe an information resource.

The second approach to assigning metadata involves the use of third party indexers to describe Internet sites. With this scenario metadata records are created and stored separately from the resource. They refer to the resource but are not actually embedded within the resource itself.

This concept closely emulates the cataloguing and indexing activities conducted by the library profession. As mentioned earlier, libraries have a long history of producing metadata in order to describe and facilitate the retrieval of print-based information resources. Each time we add a MARC (machine-readable cataloguing) record to our catalogues, we are entering a standardised set of metadata about an object.

As the Internet grew in size, the library world developed projects that attempted to catalogue web resources in traditional ways. One of the first attempts to provide structured access to Internet sites was the OCLC Internet Resource Project in 1991. The aim of this project was to determine whether the USMARC format and AACR2 (Anglo-American cataloguing rules) could be used to index Internet sites. Research found these tools could be used with only slight modification, leading to the creation of the MARC 856 field to carry the URL (unique resource locator) of a web page. This enabled users to access remote electronic resources directly from the library catalogue. Another well-known project is Cyberstacks, which attempted to classify the Internet according to the Library of Congress classification scheme.

However, despite initial success it quickly became obvious that traditional cataloguing methods were not the answer to locating information on the Internet. As Weibel (1995, online) relates, the massive amount of information requiring organisation is more than professional cataloguers and indexers can manage using existing methods. While formal library standards such as MARC provide richness in description, they are time consuming to create and maintain. Staff require extensive training and specialised software to design records that conform to recognised standards. MARC is ineffective in an environment where information is complex and constantly changing, and for the high level of Internet ephemera that does not warrant detailed cataloguing.

As a result we must take into account the nature of the Internet and consider the most appropriate methods to organise this broad spectrum of resources. Lange & Winkler (1997) believe it is the principles of librarianship and the strengths of cataloguing that will be carried into the digital world, although not in their traditional time consuming format.

Metadata standards

To simplify the implementation of metadata, various standards have been developed. Some are quite basic in their description, while others are complex and information rich. Standardisation initiatives are concerned with determining a common structure for the format and content of metadata elements. Each metadata standard should define the types of information to be described, what each element means, and the syntactic rules for individual element sets. See Appendix One for a list of significant metadata standards.

One of the biggest impediments to the development of metadata is the sheer number of different metadata formats. There is no one standard for the creation of metadata, and people are free to develop schemes for use within any discipline. This has resulted in many disparate systems, often with a high degree of overlap.

However, the value of metadata is limited if there is no agreement on which elements to use or what their contents should be. Cathro (1997) remarks that improved access to information resources will only be achieved by reaching a consensus on an international set of metadata elements, with the corresponding commitment to adopt them. Yet it seems unrealistic to believe that one standard will be adopted by all players in the electronic arena. The Internet is a decentralised initiative with no governing body, so the best we could hope for is a move towards a smaller number of standards with core elements sets that are applicable to the widest possible audience.

Various organisations around the world have sought to regulate and control the development of metadata. The International Organisation for Standardisation (ISO) has set up a metadata working group to take responsibility for the specification, management and exchange of metadata. The American National Standards Institute (ANSI) has also formed a committee to develop a model for metadata representation and to investigate the use of registries to standardise metadata in specific domains (Milstead & Feldman, 1999). Taking a different focus, the W3C serves as a registration facility and development ground for a variety of metadata initiatives.

Dublin Core metadata element set

One of the most renowned metadata standards is known as the Dublin Core Metadata Element Set. The Dublin Core grew out of a series of workshops designed to develop and promote metadata elements to facilitate resource discovery on the WWW.

The first workshop was convened in Dublin, Ohio, by OCLC in March 1995. The aim was to achieve consensus across a spectrum of international stakeholders to develop a simple method for describing a wide range of information resources, and to promote interoperability between resource discovery tools. Broadly speaking that consensus was achieved, resulting in a metadata standard with 13 descriptive elements. These elements are optional and repeatable, and are stored in the head HTML tags of the resource they describe. Most of the elements have commonly understood semantics enabling them to be applied and understood by many different users.

The second workshop was held in Warwick, England, in April 1996. Here decisions were made regarding the specific syntax of the elements, and impediments to the deployment of the Dublin Core model were identified. The second workshop also led to the development of the Warwick Framework. This framework adopted the realistic view that no single metadata standard could accommodate the needs of all communities, and provides a conceptual model for many different varieties of metadata to coexist (Brady, 1997).

The third workshop, run in September 1 6 by OCLC and CNI (Centre for Networked Information), discussed the requirements for describing graphical images such as photographs, slides and video clips. It was agreed to expand the basic element set to 15 elements, adding a descriptive element for the content of visual resources, and also a rights management statement. The final 15 elements are: title, creator, subject, description, publisher, contributor, date, type, format, identifier, source, language, relation, coverage and rights.

The fourth workshop was held in Canberra, Australia, in March 1997 as a joint venture between OCLC and the National Library of Australia. This meeting dealt with keeping the Dublin Core standard relatively simple, while still supporting the needs of users requiring a precise searching mechanism. It was suggested that the metadata set be extended to include the use of scheme, type or language qualifiers as they add richness to the semantic value of the metadata (Cathro, 1997). These optional substructures have since become known as the 'Canberra Qualifiers'.

Scheme qualifiers interpret the value of the metadata based on existing external standards, while type qualifiers refine the meaning of the data element itself. An example of a scheme qualifier would be the application of a controlled vocabulary such as LCSH (Library of Congress subject headings) within a metadata element called subject. Cathro (1997) mentions that the decision to implement qualifiers led to the minimalist-structuralist debate. Minimalists believe every element should contain unqualified free text while structuralists argue for the right to use schemes in which they increase the level of detail and improve search precision.

The most recent metadata workshop was held in Washington DC in November 1998. To broaden the perspective, representatives from external metadata groups were invited to participate. The main focus of this gathering was to identify unresolved issues and assign them to formal working groups for resolution. One of the major issues considered was the implementation of controlled vocabularies (enumerated lists) for elements such as resource type and format to promote interoperability.

Given that the Dublin Core metadata set has no formal status as a standard, the rate of take up and the amount of interest it has generated has been remarkable. There has been widespread interest to adopt it as a standard and progress its development. See appendix Two for a list of experimental projects that deploy the Dublin Core standard.

Pics

Although less publicised than the Dublin Core effort, PIGS (platform for Internet content selection) is another metadata standard aimed at describing the content of Internet sites. PIGS was originally conceived as a filtering mechanism to prevent access to certain sites according to a given set of criteria. However, as Armstrong (1997) points out, the mechanism to restrict and gain access to Internet sites are two sides of the same coin, and therefore PIGS could be used to enhance subject searching and resource retrieval. It would be possible to search by subject in the normal way, and then ti Iler out sites that do not match the criteria. Such filters could be set by information specialists or implemented on a search by search basis. Should search engines adopt these mechanisms, end users will gain extremely powerful access tools.

Interoperability between metadata sets

It is unlikely there will ever be agreement on a single metadata scheme, as evidenced by the coexistence of many independently maintained metadata formats. Therefore, an ongoing subject of research is the relationship between different metadata schemes. Of particular note, the W3C has developed a concept known as the Resource Description Framework (RDF). The RDF is a metadata architecture for the WWW that will support the interaction of a wide variety of resource description models.

A framework for the transmission of metadata is necessary, as although metadata standards specify the elements to describe items, they don't specify a transfer syntax. As Chilvers (1998) relates, RDF uses XML to create a modular infrastructure that provides containers to aggregate packages of similar data types. A concept known as XML namespace is used to transfer metadata and prevent collision where two elements have the same names.

Further advances in this area include the concept of mapping and the development of crosswalks and metadata registries. One way to reconcile different description models is to map between related record sets. This creation of crosswalks aids the integration of metadata schemes by providing users with a single query model.

Chilvers (1998) points out that formal metadata registries are necessary to describe the semantics, structure and transport syntax of a metadata element set. Registries enable developers to list the authoritative version of metadata schemes and the specific elements defined within them.

Conclusion

In conclusion, it can be seen that many organisations are attempting to improve the retrieval of resources on the Internet through influencing how the web is indexed. A number of metadata standards have been proposed, together with the technological framework to support them. Ideally there would be a single metadata scheme applicable in all situations, but as this is unlikely to eventuate, one of the major challenges for the future will be to understand and integrate the different metadata schemes.

In terms of the library sphere, we must seek to redefine the concept of a library in the electronic age and how we might hope to select, manage and provide access to information resources. Initiatives such as metadata have a direct impact on service provision, as without the means of identifying and describing resources, we cannot present them to our clients.

Appendixes and references

The appendixes and references for this article may be located on our website at <http://www.curriculum.edu.au/scis/connecV connect.him>. Readers may also be interested in the article 'Demystifying metadata' by Marty Lucas at <http://mappa.mundi.neVtripm /metadata/>.

Kylie Hassan is a student in the School of Information Management and Systems at Monash University. This article was first published in Cataloguing Australia and is reprinted with permission. Copyright© ALIA2000.

Kyle Hassan

Student

School of Information Management and Systems at Monash University

Issue 34

Issue 35 Term 4 2000 | 12 articles

Issue 33 Term 2 2000 | 10 articles

Articles

Internet Resource Management-The Role and Development of Metadata