UN-CS-RAI-USAA-DB01-2004-00205
Table of Content
Executive Summary 5
1 Introduction: the need for metadata 7
1.1 What are the
benefits of metadata? 9
1.2 Context: metadata
schema, thesauri, ontologies and the Semantic Web 10
2 Problem: the situation within FIS 14
2.1 Metadata
specifications in UN and related agencies 15
2.2 Peculiar FIS needs 16
2.3 FIS processes and
tools 17
2.4 Two scenarios 19
2.4.1 Organizing a meeting 19
2.4.2 Leaving for a
different assignment 20
3 Solution: an architecture for safe document interchange within FIS 20
3.1 Ontology 21
3.2 Technical
architecture 22
3.3 Metadata tools 25
3.3.1 User tools 25
3.3.2 Local tools 26
3.3.3 Central tools 27
3.4 Work processes 27
3.4.1 Installments and
configuration 27
3.4.2 Writing and modifying
documents 28
3.4.3 Sending documents via
email 28
3.4.4 Uploading documents
on the shared directory 29
3.4.5 Updating the
predefined settings 29
3.5 Two scenarios with
the new tools 29
3.5.1 Organizing a meeting
with the new tools 29
3.5.2 Leaving for a
different assignment with the new tools 30
4 Conclusions and next steps 31
References 33
Appendix A: A comparison of relevant metadata
schemas 35
Appendix B: The proposed packaging data
formats 39
Appendix C: The proposed ontology 41
Document 43
q Identifier 43
q Title 43
q Description 43
q Language 43
o Rights 43
o File 44
o Subject 44
o Relation 44
o Coverage 44
o (Management) 44
File 45
q Identifier 45
q Version 45
q HashValue 45
q Location 45
q Date 45
q Status 46
q Format 46
q FileDescription 46
o Document 46
o Author 46
Person 47
q Identifier 47
q Name 47
q Contact 47
q Gender 47
o Organization 47
o Location 47
Organization 48
q Identifier 48
q Name 48
q Contact 48
o BelongsTo 48
Term 49
q Form 49
q Definition 49
q Reference 49
o RelatedTerms 49
Location 50
q Identifier 50
q Name 50
q SpatialCoordinate 50
q TemporalCoordinate 50
o RelatedLocation 50
o Type 50
(Management) 51
The purpose of this document is to streamline paradigms, tools and data formats that can be introduced within the OCHA Field Information Support (FIS) offices to dramatically increase the effectiveness of document exchange and management at all levels of their activities. The foundation of this proposal lies in the massive introduction of ontologies as well as metadata structures and tools in the everyday processes within FIS.
Metadata and ontologies are among the latest concepts that experts in information management and knowledge management use for designing new techniques and tools to extract and leverage data and documents flowed around within large-scale organizations. Although the precise definition and the subtler points of ontologies and metadata might be the domain of such experts, the basic ideas behind them are easy to understand and well-known way before the advent of computers and information technologies.
Their first and foremost application surely is in the improvement of the effectiveness of information retrieval within large document collections, and it is here that we will start the discussion in the present document. Most retrieval systems match words in users' queries with words in the text of documents in the database. Such systems are far from perfect: online library catalogs and Web search engines show a surprising lack of precision - on average 50% of the information retrieved will be irrelevant - and of recall - often as little as 20% of the available relevant information is retrieved. Furthermore, limited recall is hard to identify and grasp, since the user doesn't know what is missing.
The main reasons for missing relevant information is that there are surprisingly many different ways to describe the same idea or concept. If a document author uses one word and a searcher another, relevant materials will be missed. A query about "laptop" computers, for example, will fail to find articles about "portable" or "lightweight" or "notebook" or "palmtop" or "ThinkPad" computers, as well as any of the innumerable spelling variations or errors that might be present in the actual documents, such as lap top, lap-top or latpop. Searchers and authors alike find it very difficult to anticipate the many ways in which the same idea might be described. [1]
One possible solution would be to force document authors to use only one, well-known term for each concept or idea, and to systematically rewrite existing documents using only such restricted vocabulary. But this approach is of course unrealistic and fanciful, in both scale and equivalence of the final result with the original.
On the other hand, rather than write or rewrite documents with new words, one could think of flanking such (unchanged) documents with terms from the unambiguous vocabulary and of describing with these terms only those concepts that are the most relevant of the document. Searches performed on these flank terms would of course be much more precise and recall would be almost perfect, provided that the terms are defined precisely and chosen correctly.
Besides the unambiguous rephrasing of the actual content of the documents, one could note that relevant information about the document as a whole is often not part of the document itself, or it is present in the content in an undifferentiated form. Yet this information has great value for storing, cataloguing, classifying documents. Yet again, rather than forcefully append such data to the content of the document, one could think of flanking the document with this information in a manner that allows its easy use.
Yet again: flanking documents with terms allows precise use of unambiguous data without direct intervention and modification of the original content of the document. The official term for such flanking data is metadata, or data about data. In other words, metadata is data associated with objects which relieves their potential users of having to have full advance knowledge of their existence or characteristics[2]. The list of the interesting metadata elements to flank a document with is called a metadata schema or, when additional requirements are in place, ontology.
Metadata and ontologies are a way of organizing data about data, or information used to retrieve information. A bibliographic record such as a library catalog card is metadata about a book; the nutritional label on a soup can is metadata about the soup. Because it uses a set of standardized fields and a "controlled vocabulary" to fill those fields, the nutritional label is a metadata model or "scheme": because that scheme is universally accepted, it can also be considered a "metadata standard." The Dewey Decimal System and ZIP codes are other examples of metadata standards[3].
Metadata represents a crucial difference between electronic and printed documents. All the information in a paper document is displayed on its face. Not necessarily so with electronic documents. Electronic documents may be asked to carry their history with them. Paper shows what a document said or looked likemetadata may be used to tell where the document went and what it did. [4]
In fact, the information shown when browsing a Web site or performing a database search is just the tip of the information iceberg. Below the "waterline" is a larger body of often invisible metadata information about the document, its author, and its sources. Metadata helps people locate the document, assess its quality and value, control access, and evaluate its usefulness.[5]
For years, even before computers, specialized metadata has been used to describe large collection of books or documents, and applied in a large variety of fields, such as:
· in libraries, where it is used to catalog books (i.e. assign call numbers and subject headings);
· in database publishing, where it helps users search bibliographic, contact, or transaction databases;
· in dictionary and encyclopedia publishing, where it is used to organize information on words or topics;
· in book publishing, where it is used to prepare back-of-the-book indexes and organize information in reference books.
· in digital spatial data, where it is used to describe the background information that describes the geographic coverage, quality, completeness, accuracy and other appropriate characteristics of the data[6].
Today, as all kinds of information migrates to the desktop via the Internet, the specialized metadata systems developed for each of these applications are merging into a comprehensive system. Increasingly, knowledge and information owners and users need to be familiar with all of them. [7]
Data that do not have accompanying metadata are often hard to find, difficult to access, troublesome to integrate, and perplexing to understand or interpret. Furthermore, as personnel change in an organization, undocumented data may lose their value. Later, workers may have little understanding of the contents and uses for an unmarked collection of digital documents and they cannot trust the results. Thus lack of knowledge about other organizations data can lead to duplication of effort. It may seem burdensome to add the cost of generating metadata to the cost of data collection, but in the long run the value of the data is dependant on its documentation.
Several advantages in using metadata for large scale organizations are discussed in the next subsection.
The benefits of assigning and structuring metadata to document collections include[8],[9]:
· Reliability in searches: providing information that search engines can use to find relevant documents in large collections where text search alone brings up many irrelevant documents or lists of documents too long for users to look at.
· Workflow support: providing descriptive information so that users can tell how old a document is, who wrote it, or how to get additional information on its content. Most documents on unstructured data collections cannot tell the user whether they are 5 days old or 5 years old, and whether the content is regularly updated or left in an unfinished state.
· Data filtering: enabling data administrators to introduce greater efficiency and accuracy into their data operations, eliminating inconsistencies, redundancies and irrelevant information. In turn, data growth and change can be more easily managed because new information can be filed using the metadata index
· Inventory support: easing the creation of a list of what information the organization holds so that the information can be managed, tracked, updated, analyzed and used efficiently.
· Inter-departmental consistency: providing the framework and many of the rules for use so that metadata can be applied consistently within large and diverse organizations. This creates an environment in which users can search for and find information without needing to know which department produced it or to which program it relates.
· Interoperability: providing a way for information resources in electronic form to communicate their existence and their nature to other electronic applications and to permit migration of information between applications.
· Long-term organization memory: protecting an organizations investment in data, as it is not vulnerable to losing all the knowledge about their data when key employees retire or accept other jobs.
Metadata creation is typically an obligation of the data producer. If the metadata producer is not the data producer, then good liaison need to be developed between these two. There may even be individuals specially trained to assist with the production of metadata for the organizations data sets.
Metadata, defined, as mentioned, as
data about data, is meant to flank the documents it refers to, and accompany
them throughout their useful life. What kind of metadata to store with the
documents, and how to organize it in meaningful structures, is yet another
important issue to be faced when deciding to enrich an organizations document
system with additional information.
There are several collections of
metadata information proposed in literature. They are called metadata schema,
and the most relevant and widespread for on-line documents surely is the Dublin Core[10].
The Dublin Core metadata schema is meant to identify and label the most important
metadata elements that a librarian might be interested in knowing when looking
for an electronic document, and it is composed of 15 base elements (the Simple Level) plus some 30+ additional
qualifiers (the Qualified Level) that
help in describing online resources.
Although the Dublin Core is by far the
most widely known metadata schema available in the world of libraries and
online document collections, its generality is in many fields felt as too
limiting in the scopes and extent of applications internal to an organization.
For this reason, many metadata schema internally used within large
organizations, while building over the Dublin Core, do actually extend it for
purposes of local interest only, to support advance functionalities that are
felt relevant within the organization. Therefore, although for external access
to an organizational document collection the Dublin Core is widespread and
almost universal, this is rather often just a limited view of the actual set of
metadata used internally.
In fact, the very term metadata
schema, referring to a set of labels used to describe and classify the
metadata values used to accompany the document, is in itself limited and tends
to be substituted by the more encompassing term ontology.
Library science, philosophy, computer
science, artificial intelligence, data mining, semiotics and a number of other
scientific disciplines have contributed and are contributing to the development
of the discipline that defines and discusses metadata and ontologies.
The concept of ontology has been around
for a long time in philosophy, but in recent years it has become known in the
computer world as a machine-readable vocabulary of concepts on which automatic
inferences can be driven and expected. Before arriving to ontologies, though,
it is worth to have a look at intermediate concepts such as subject-based
classification, controlled vocabularies, taxonomies, thesauri, and facets. All
of these terms have definitions and histories that predate computers and
Internet, but have found new life and uses with their advent[11].
Subject-based classification is any form of content classification that groups objects by the subjects they are about. For instance, the use of keywords to classify papers is a subject-based classification approach. Metadata properties or fields that directly describe what the objects are about by listing discrete subjects use a subject-based classification.
Controlled vocabulary is a closed list of named subjects used for classification. In library science this is sometimes known as an indexing language. The constituents of a controlled vocabulary are terms, i.e., univocal particular names for particular concepts. A controlled vocabulary consists of terms, and not directly of concepts, and in general each term will be disambiguated to refer to a single subject (that is, there will be no duplicate terms). Note that "subject" as we have used the term so far is effectively equivalent to "concept". The purpose of controlling vocabulary is to avoid authors defining meaningless terms, or terms which are too broad or too narrow, and to prevent different authors from misspelling and choosing slightly different forms of the same term.
Taxonomy, in modern times, is used to mean a subject-based classification that arranges the terms in the controlled vocabulary into a hierarchy. The benefit of this approach is that it allows related terms to be grouped together and categorized in ways that make it easier to find the correct term to use whether for searching or to describe an object. Taxonomies helps users in describing the subjects: from the point of view of metadata there is really no difference between a simple controlled vocabulary and a taxonomy. The metadata only relates objects to subjects, whereas here we have arranged the subjects in a hierarchy. So a taxonomy describes the subjects being used for classification, but is not itself metadata;
Thesauri, according to two different ISO standards that describe them ([ISO2788] for monolingual thesauri, and [ISO5964] for multilingual thesauri), extend taxonomies to make them better able to describe the world by not only allowing subjects to be arranged in a hierarchy, but also allowing other statements to be made about the subjects. Thus terms can be related one to the other by using relations among terms, such as
· BT (short for "broader term"), referring to the term above this one in the hierarchy - that term must have a wider or less specific meaning; . In practice some systems allow multiple BTs for one term, while others do not.
· NT (short for "narrower term") is the inverse relation to BT, and it is implied by that.
· USE refers to another term that is to be preferred to this term; implies that the terms are synonymous, and that the referred terms is preferred to this one for the thesaurus.
· UF (short for use for), is the inverse relation to USE, and it is implied by that.
· RT (short for "related term") refers to a term that is related to this term, without being a synonym of it or a broader/narrower term. It can also be considered as a partial synonymous.
· SN (short for scope note) refers to the string attached to the term explaining its meaning within the thesaurus. This can be useful in cases where the precise meaning of the term is not obvious from context.
· TT (short for "top term") refers to the topmost ancestor of this term, found the BT axis until no more BT terms exist. This property is strictly speaking redundant, in the sense that it doesn't add any information, though it may be convenient.
One could say that taxonomies as described above are thesauri that only use the BT/NT properties to build a hierarchy, and do not make use of any of the other properties, so it could be said that every thesaurus contains a taxonomy. In short, thesauri provide a much richer vocabulary for describing the terms than taxonomies do, and so are much more powerful tools. As can be seen, using a thesaurus instead of a taxonomy would solve several practical problems in classifying objects and also in searching for them.
The term faceted classification is used when it is possible to identify a number of different aspects (called facets) into which the terms can be classified. The facets can be thought of as different axes along which documents can be described, and each facet contains any number of terms. How the terms within each facet are described varies, though in general a thesaurus-like structure is used, and usually a term is only allowed to belong to a single facet. In faceted classification the idea is to classify documents by picking one term from each facet to describe the document along all the different axes. This would then describe the document from many different perspectives. Faceted classification may seem very different from a thesaurus, but in fact faceted classification could be seen as simply a very disciplined way to construct a thesaurus as well as to use it for classification purposes. Furthermore, there exists a generalized view of faceted classification wherein each facet is generalized to the point where it becomes a general property. In this view there is little difference between faceted classification and ontologies as they are described below.
The term ontology has been applied in many different ways, but the core meaning within computer science is a model for describing the world (or a set of documents) that consists of a set of types, properties, and relationships. Ontologies represent the culmination of the above progression of terms, in the sense that all of the above are vocabulary languages for subject description.
In a taxonomy the means for subject description consist of essentially one relationship, the broader/narrower relationship used to build the hierarchy; thesauri extend this with the RT and UF/USE relationships, and the SN property, which allow them to better describe the terms; faceted classifications do not really extend the relationships, but provide a consistent and useful discipline for applying them, since they have the author of the indexing language creating a set of facets and filling each with a thesaurus that does not overlap with the others.
With ontologies the language is no longer closed, but the creator of the subject description language is allowed to define the language at will, defining not only a single type (the term) and a few relationships among them (e.g., as in thesauri, the BT/NT, USE/UF, and RT relationships and the SN property), but as many types as needed, as well as their properties and relationships. Each type ends up being an independent concept, provided with a list of properties and relationships connecting its values to each other and to other concepts, and allowing complex inferences to de drawn out of them.
Ontologies helps make software more efficient, adaptive, and intelligent because they allow[12]:
· Sharing of common understanding of the structure of information among people or software
· Separation of domain knowledge from the operational knowledge
· Making domain assumptions explicit
· Reuse of domain knowledge
·
Analysis of domain knowledge.
Within the World Wide Web Consortium (or W3C, the non-profit organization dealing with the development and advocacy of technical standard to increase interoperability within Internet and the World Wide Web in particular) a new initiative has started a few years ago dubbed Semantic Web. Heavily backed up by the director of W3C itself, Tim Berners-Lee, the inventor of the Web, the Semantic Web is an on-going effort to provide web-related software (and, in fact, any desktop software) with the features to understand documents, rather than simply display them. The idea behind the Semantic Web is to create sophisticated applications that can derive new knowledge and exhibit complex behavior (e.g. inferences, comparisons, searches, and so on) based on formalized statements about the content of the documents, and expressed in terms of metadata accompanying the documents themselves.
According to the vision established at the end of the nineties by the Semantic Web Working Group at the W3C (and displayed in fig. 1), the organization of the Semantic Web will be laid out in several layers, each providing more sophisticated and extensive service over the data expressed at the lower layers. Thus, at the very base there are documents, which are expressed in Unicode for supporting internationalization and disambiguate character encodings in different operating systems, structured in XML to avoid syntactical and semantic misrepresentations. Documents will be accompanied by metadata statements expressed in RDF, the first standards to be really specific of the semantic Web: RDF (Resource Description Framework) allows metadata creators to express metadata on documents in an easy and unambiguous format as true statements about them (e.g., John Smith is the author of document Mydoc.doc). RDF statements will be organized and ruled by collections described and delimited by RDFS (RDF Schema) which allows metadata organizers to describe the number and name of the metadata elements that need to be expressed for a given class of documents. For instance, the Dublin Core, as well as hundreds of other metadata schemas, is defined as an RDFS schema.
In order to build real inferences and comparisons on metadata collections, these schemas will need to be enriched by additional information turning RDFS schemas into inter-related classes of entities with properties and relationships. These classes form in fact ontologies and are expressed in the OWL (Web Ontology Language), the subsequent standard language of the Semantic Web layer. At the moment (end of 2004), the development of standards is complete to the OWL layer, but new activities are being defined for the further layers, in particular for providing mechanism to derive logic inferences out of metadata structures laid out according to OWL ontologies, and to prove the truth of such inferences in a given domain. Finally, a layer of trust preferences will allow inconsistent and contradictory metadata to coexist on the web without ill effects on the ability to draw correct inferences, and will contribute in defining those domains where truth proofs can be actually demonstrated.
fig. 1: the layered structure of the Semantic Web as proposed by the W3C[13]
The Semantic Web needs ontologies with a significant degree of structure. These need to specify descriptions for the following kinds of concepts:
· Classes (general things) in the many domains of interest
· The relationships that can exist among things
· The properties (or attributes) those things may have[14]
An ontology, in short, comprises a
formal explicit description of entities
(also called classes)
in a domain of discourse, properties
(also called slots)
of each concept describing various features and attributes of the concept, and restrictions
on properties (also called facets). An ontology together with a set of individual instances of
classes constitutes a knowledge base.[15]
Finally,
it becomes clear that in order to provide a robust set of metadata elements on
a big collection of documents, it would be advisable to go beyond the selection
of a well-known, but general metadata schemas such as the Dublin Core, and to
implement an ontology that expresses all the relevant information specific the
document class, that can be made compatible with the well-known but general
metadata schema such as the Dublin Core, but that does not limit itself to it,
and that can be easily expressed in OWL, although it does not have to be
created from the beginning in that language.
The main purpose of this work is to propose, defend and detail a process for the introduction of metadata and metadata tools for the documents being created, exchanged and published within the activities of the Field Information Support offices.
In order to perform this activity in the most appropriate way, one needs to examine the current situation of document exchange and publication within FIS, OCHA and, in general, the UN, and to put forward the proposal that has the most effective and least disruptive repercussions on the daily activities of the offices themselves. To this end, we will first examine the current situation regarding metadata and metadata specifications within UN, OCHA and FIS, and then a few specific issues that arise within FIS and its own peculiar way of managing documents.
The issue of providing OCHA documents with metadata is tackled by a number of different projects and activities within OCHA and the UN. A brief list includes:
· The UNBIS thesaurus is used in all UN projects for subject classification, and contains 7000+ terms in six languages, with a complex and very complete hierarchical structure of relations among terms. The whole tree of terms is divided in 18 categories which are further split into about 150 subcategories.
· The OchaOnLine subject listing is a thesaurus directly derived as the subset of the whole UNBIS thesaurus that includes some 50 terms that are meant to be all and only the relevant terms that are to be used within OCHA.
· The ReliefWeb metadata for document management includes a 9 element schema, a 9-term document type index, and a 45-term subject index, which constitute a first and rough classification of documents being proposed for publication on the ReliefWeb site.
· The UN-ARMS record keeping initiative has introduced a 17 element metadata schema for the specification of electronic records, closely based on the Dublin Core schema.
·
The archival metadata proposal
by the IT section on OCHA, especially in the
· The metadata working paper (draft 1.0a) by FIS specifying a 22 element structure, most of which is based, again, on Dublin Core.
These experiences, although with various degrees of completion and aimed at different purposes, appear to be extremely well-thought out and directly applicable to the reality of FIS documents. Furthermore, all these projects appear to be more or less compatible to each other, with lesser differences that are most probably easy to reconcile. In Appendix A we will compare side by side some of these different schema to identify the similarities and the differences.
It should be noted, on the other hand, that all these project are meant for the management of finished documents at their final destination: archives, record keeping, web sites all accept complete documents that are meant for long-term storage and consumption.
Furthermore, they only minimally consider (with the praiseful exception of the document from the IT section) the constraints of system architectures and user acceptance on the number of elements and the complexity of the specification of metadata. Yet these constraints may heavily impact on the efforts to proficiently use the provided vocabularies and to follow the procedures for metadata specification.
Finally, and most importantly for our purposes, all these projects assume that the process of uploading the documents onto the repository and of detailing their metadata is a specific activity to be performed with appropriate tools. In fact, most of them imply that web interfaces or Lotus Notes interfaces are used for form-filling and file uploads.
Yet in many situations, and in particular within field offices, web-based applications may not be a solution, and may cause practical problems that would concretely prevent the application to reach any form of success at all.
When dealing with web-based forms, it is necessary that the network is running and running reasonably fast every time a new document is uploaded. The upload operation belongs to a different environment than the editing of the document itself, and may require several minutes for the completion of the operation. This in practice means that the officer has to quit the editing application (say, MS Word), connect to the Internet, possibly activate the procedures for entering in a private Intranet for security reasons, fire the browser, reach a certain page, fill in the form and attach the file.
Furthermore, although much of the information (title, dates, authors, etc.) requested for uploading are actually available within the document itself, the web based applications cannot get them directly. Thus the officer must either key in the data himself, or keep both editor and form open at the same time, and copy and paste data from the editor into the web form.
In the next section we will examine how practical constraints and pressures given by the environment and working conditions may heavily impact on the acceptability of either one of these proposal for FIS offices
Field Information Support units have
different and specific problems to deal with than people in central offices in
Among the differences that are worth noting are:
· Field officer prepare a large number of drafts and documents that are subject to intensive circulation, but never reach a final stage meant for long-term archival or actual publication. These documents need to be described for easy retrieval and classification, even if they are not in their final form.
· Field officer may usually not have much time and much patience to fill in pages and pages of long and tiresome metadata forms, and they need mechanisms for super-fast and super-easy metadata specification
· Field officer may not have good Internet connections (they may have none at all) which may slow down considerably or forbid on-line interactive applications such as web-based forms
· Field officer may need to have at hand (locally) a copy of relevant documents, and may need to interact with a smaller, decentralized document system that only provides access to a smaller number of relevant documents.
· Field officer may not be fluent in English (they could be native French speaker or other) and may have problems in selecting the most appropriate term for describing the documents.
All in all this seem to imply that full support for some document management and metadata specification has to be provided at the local offices, with disconnected operations to the central repository and only sporadic updates on the central repository.
Furthermore, the current procedures for exchanging and working on documents are peculiar and well-understood by officers, who may be unwilling to completely revolutionize their processes only because of metadata specification.
The combination of these issues leads us to conclude that no web-based architecture can be reasonably employed for the management of metadata in remote offices of FIS, but something different needs to be designed.
In particular, it appears that the peculiarities of the work processes of field officers will require a solution as painless and unimpactful as possible. This, in turn, requires that the current work processes and the tools currently used are considered before suggesting a new one or any change in the existing ones.
Without detailing what is being accomplished within FIS, we can observe that the most interesting processes being executed by FIS officers and interesting to our purposes are connected with the writing of reports and other kinds of documents, and their exchange with other officers from FIS and/or other agencies and organizations.
Given the ad hoc nature of officers' interventions, the short timeframe for decision making and news spreading connected with their daily activities, and the sparse technological commodities available, computers and software are used in the most informal way, without proper organization or formal procedure, and with no specially built application installed and run on the officers' machines.
Formal procedures are but a minor part of the daily activities of the officer, as for instance when, after a number of draft versions of a report are being informally circulated and approved, the author sets to modify and correct it so as to create the official report which is then uploaded or sent to the central offices for auditing, publishing and final archiving. Actually, most documents being exchanged on a daily bases among officers never eventually end up as finished documents in the central archives.
Most activities in fact revolve around the use of two particularly low-tech tools, namely e-mail and shared directories on a central repository, while the web is mostly used to gather documents and information, rather as the environment for special web-based applications.
E-mail is the most frequently used means for exchanging data and documents. E-mail is well understood by officers and easy to use, and it would be at least impractical to ask officers on the field to give up e-mail for document exchanges: meetings are organized, minutes of previous ones are distributed, documents and drafts are circulated all through e-mail.
E-mail messages can be divided in three main categories:
·
Transport-only messages: the message
itself contains no useful information, but it is mostly an envelope for the
attachments that come with it (Please find attached the latest report on the
situation in
·
Mixed messages: the message contains
some useful information, and there are also attachments that constitute
independent documents and that have traveled together with the actual message
(This is the first draft of my report on the situation. It is a follow-up of
the discussion we had yesterday on the problems on the west border of
· Content-only messages: the message contains only textual information in the body of the message, and has no (or no relevant) attachment. Content-only messages, therefore, can and must be considered documents, and only when it is clear that they have no useful content they can be ignored.
E-mail messages provide both an easy to use channel for spreading news,
calling meetings, discussing ideas, and for the diffusion and share of final
and intermediate drafts of documents. Mailing lists are set up for allowing all
interested individuals to take part to the complex exchange and movements of
documents and news, and, new groups of addressee for special documents can be
defined immediately and with no hassle, completely within the control of the
sender, without the need to recur to any advanced tool or server-side
application.
On the other hand, e-mail offers no control of the content being
exchanged at all. People may send each other the same document, different
versions of the same document, different documents with the same file name,
etc., with no chance that the problems are discovered and managed by the system
itself: it is all left to the final addressee to sort out the content and the
context of the documents being exchanged over e-mail.
Shared directories are another simple mechanism for data exchange. A server is set up within an office easily accessible by all interested officers, where all can log on and copy interesting documents or back up whole directories of their laptops. Thus, without any attempt at imposing or even suggesting an order for subfolders, a large number of documents in all stages of evolution (from simple notes to final versions) are put in a shared directory for everyone to access and copy, or modify.
Shared directories make a rather risky environment, since they allow very limited control and protection on access and modification of shared content. Nonetheless, experience shows that they are in fact a very common occurrence in all local situations, and, as long as the number of users is small, trustable and knows each other fairly well, only rarely causes loss of data or corruption of important documents.
Shared directories are one technical solution to at least three different needs of their users:
· as a location for long-term storage of lesser documents and tools that do not fit on the laptops and that are not necessary at the moment, but might as well become useful at some unforeseeable moment in the future,
· as a safe-keeping backup of very important documents and data, especially just before embarking in a potentially dangerous activity, such as switching to a new computer, a new operating system, or leaving for a long-term out-of-office mission,
· as an exchange mechanism for documents and data that are copied and extracted freely and without bureaucracy by all interested parties, so that each can access, modify and print the same copy of the document without any complex pattern of e-mail messages.
The advantages of shared directories are clear: there is no bureaucracy to speak of, they are incredibly easy to set up and use, they allow needs, uses and content of documents to evolve in time with the evolving situation without changes in the underlying technical mechanism and procedures. Given these considerations, it might be found out that it is inappropriate and eventually unsuccessful to block users from setting up and relying on these tools.
On the other hand, shared directories make it very difficult (in fact, impossible) to manage concretely the documents they contain, or to prevent misuses and mistakes. The presence of different copies of the same file, most probably in different stages of readiness and completion, with different file names and in the most diverse location, has to be considered inevitable, as well as the number of different files with the same name, or the directories organized with all sorts of different criteria (by owner, by location, by date, by application, etc.). In short, the daily situation of a shared directory will inevitably reflect the short-term and hasty decisions of the users rather than any long-term and systematic planning effort by the data management.
In order to better focus on the issues that present when dealing with unstructured work processes driven by urgency, short-term needs and informal tools, let us consider two possibly common scenarios in the daily process of management of document.
After having done that, she thinks it would be better if Bruce, from another agency, took part to the meeting as well, since he has just arrived and that could be a chance to meet and get to know all the people. Bruce and one of the addresses of the mailing list, Carla receive the same message as two different e-mails.
Upon receiving the message, Carla copies the report into a folder on her laptop, called "monthly reports", and renames it "June2004.doc", to avoid confusion with the other files in the same place. She then reads and modifies the reports, subtly changing the words in some paragraphs, then saves it and sends it back to Alice as a new version of the same document. Then, she remembers that Bruce, who has just arrived, has the needed expertise to provide feedback on the report, and thus sends an e-mail to him, asking for his opinion on the document "June2004.doc".
Therefore, Bruce finds on his computer two files, one called "report.doc", and the other "June2004.doc", and has to print and examine both of them before finding out they are actually the same document. Furthermore, there is nothing in the names or content that easily informs him of which version comes first and which is a subsequent modification, who is the author of the modifications, and which are the modifications. So he decides to closely read both of them, and leave for the meeting to ask for further explanations.
As a result, Bruce has lost time and paper to print and carefully read two copies of the same document and find the subtle differences, and on Alice's and Carla's computers the same document will exist with two different names, and nothing that can hint that they are actually the same document.
After a few months,
Furthermore, she decides that David, her substitute, will need a crash introductions into the issues that are still open, and thus she creates on the server an additional folder, "Files for David", copies there all the important files that need to be closely examined by him, and renames them with meaningful names. She then sends an e-mail to David telling him that the files needed to learn about the local situation can be found on the server.
When David arrives, he logs on the server
and examines the content of the shared directory, and he immediately finds the
directory "
In order to make some sense in the files he
is examining, David decides to re-organize them according to his own criteria.
Thus he creates a folder called "minutes" where he stores all the
files that look like containing minutes of past meetings. But, while
As a result, David spends days in reading
thousands of pages of
Both scenarios presented in the previous section show common situations in offices whereby time and money and knowledge is lost because of a faulty message exchange among willing and bright co-workers, or because of difficulties in finding the right information when it is needed.
Both scenarios would have been different if the users had more information to rely on than the filenames and the folders in which the files are stored. Concretely, they would have both been different if the document had been self-descriptive, that is, if they had within them the information that allowed their users and readers to know about them, their authors, their history, and so on.
In this section we propose a solution that, while keeping in mind the peculiarities of the FIS offices and their specific needs, could improve the way in which documents are managed and exchanged and stored and accessed by their users. The proposal is based on the systematic adoption of a number of metadata fields through which documents are enriched so that they can provide information about themselves.. The proposal is divided in three different parts: the metadata that need to be stored with the documents and the ontology they represent, the tools that are needed to enter and manage the metadata and to provide for safe keeping and archival of the documents, and the processes that need to be put in place for this solution to succeed in practice.
The existing proposal within UN, OCHA and FIS addresses the need to create a metadata schema for the enrichment of documents. In this document, we propose, on the other hand, a full-featured ontology for the same purpose. There are several reasons for this:
· A richer data model: ontologies allow to specify not just data values for metadata fields, but complex interrelationships among entities that describe the document. For instance, a metadata schema could require an author name field and an author email field. An ontology, on the other hand, could require that an entity person is specified for the author relationship, of which the name is simply one property, as well as the e-mail, the organization, the telephone number, and any other information that identifies and describes that person. Since author and author email are no longer entered as simple values, but derive from a reference to a complete data structure, we gain increased liberty in determining the value that identifies him/her (possibly only the email, since the name would come out automatically from an inspection of the relevant data structure in an existing database), and an increased amount of information about the author that does not need to be inserted manually by the author of the metadata.
· An increased control on incomplete or wrong information: having multiple fields to specify relevant properties of the same entity (for instance, both an author name and an author email) opens up the possibility of wrong information (e.g., if the name and the email of the author did not match) or missing information (e.g., if only the name but no email as specified). A metadata schema would not be able to notice inconsistencies, and would at most stop until all the required information were entered. A working ontology would be able to check for wrong values, and actually provide for the missing information, or at least signal for incompleteness, to be filled in when the relevant data became available.
· More complex applications: besides searching on the explicitly specified values of a metadata schema, the use of an ontology would allow more sophisticated applications: consistency checking, for instance, or searching on non-explicit values: if the entity person had an organization property, then we could search for all documents being authored by members of that organization, even if no explicit mention of the organization was contained in the actual metadata of the documents. The possibility of cross-referencing implicit information in searches is possible when searches act on entities rather than actual values. This advancement has created a whole new scientific discipline called "data mining", to which we refer for further details[16].
· Compatibility with present and future standards and tools: as mentioned in section 2.2, there has a huge increase in Semantic Web activities, products and tools that rely on the existence of interoperable ontologies for sophisticated applications. Even if at the moment no inter-organization data exchange is foreseen, a moment may arise in the future when all the data that can be drawn from documents will be required in a more sophisticated format such as the one that is made available through ontologies.
In fact, there is very little difference in employing a metadata schema and an ontology for a given class of documents. In both cases, the amount of information that an author or an editor knows about a document needs to be identified and specified within the relevant tools, and the values to be specified will most often be exactly the same.
The difference lies in the underlying structure that justifies and gives meaning to the actual values, and in some cases allows for their validation and consistency checking. Going back to the previous example, both the metadata schema and the ontology will probably require the identification of an author for the document. But while the metadata schema will be satisfied by a string (the author name), the ontology will require the correct identification of a person, the author, who can be identified through a name, but is not his/her name.
In fact, there are several ways that one can use to identify a person, which can be employed fruitfully when dealing with ontologies: for instance, we may know the name or the email address, and in both case the system could identify the correct person. The system could immediately check for a misspelling of the name (no person exist with that name) or an inconsistency between name and email. Finally, it would know how to send an email to a person it only knows the name of.
In summary, relying on an ontology instead of a metadata schema would add to the flexibility of the system, the sophistication of the applications, and the precision and correctness of the information, the easiness by which the data is entered.
As further detailed in appendix B, the proposed ontology identifies five main entities (documents, persons, organizations, places, and topics) and a number of smaller ones (among which times, actions, rights, metadata, etc.). Each entity will be associated to properties (strings, numbers, dates) and relationships (with other entities). So for instance the title of a document will be a property, while the author will be a relationship with a person, possibly identified by his/her name, which is a property of the entity person.
Work is needed for letting FIS officers deal with metadata, and for having their processes improved by their usage. In particular, rather than knowledge structuring, the work really needed appears to be system-wide, and a new architecture for tools and data exchange seems needed to allow low-bureaucracy, informal document processes as required by FIS officer coexist with the more formal procedures being foreseen within the projects described in section 3.1.
At the very first, therefore, a new architecture needs to be described for the specific needs of FIS local offices. Then a set of tools (all of which might be needed for the architecture to succeed) will be proposed and discussed in the next section.
As mentioned, the architecture needs to be designed and implemented to impacts as little as possible with current uses and processes, while at the same time providing for easy and precise management of metadata.
Behind the architecture there are a few fundamental assumptions:
· Existing work processes are only marginally affected by the management of metadata, and complete freedom is left to local offices and individual officers to bring forth their work and the internal organization of jobs and tasks as they prefer. In particular, e-mail and shared directories need to be maintained and used as they have already in the past.
· Multiple layered storage of documents is a necessity, for letting local offices keep on working even in a fully disconnected manner. Automatic synchronization will be performed on the (possibly rare) occasions in which some connectivity exist with the central headquarter.
· Client-side editing of the metadata is a necessity. This means that metadata needs to be embedded within the documents, and never handled separately (except as redundant copies for faster retrieval and indexing). Users embed metadata as soon as possible (on the first saves of the document) rather than as an afterthought at the end of the useful life of the document. Authors are best equipped to add meaningful metadata than editors.
· Officers have very little time to waste in trivial jobs. Looking through a 15 pages long manual that instructs on how to specify metadata, or a several hundred pages long document that contains all the possible keywords that can be used to describe a document is a task that no officer will agree or submit on doing for a long period of time. Sooner or later they will either stop doing it or provide just the first values that are felt appropriate and just be done with it. Thus massive and systematic help in specifying values for metadata information must be considered and implemented. This requires a systematic effort to use machine memory, context awareness, document content interpretation, etc. for the suggestion of relevant values to the user.
Thus said, the architecture of the system can be outlined in the following figure:
fig. 2: The proposed architecture for the document management system
for
A central repository exist in the headquarter in
A subset of these documents is downloaded (via a number of means, including e-mail, HTTP, or even CD-ROMs) onto the local repository, which is on a server placed at the premises of the remote field office. Documents here are either copies of others stored on the central repository, or documents created locally and yet to back up in the central repository. They are accessible via a number of means, including direct file access (shared directory).
On the computer of each field officer machine there are further copies of some documents either live (i.e., currently being edited) or redundant copies of completed documents used for reference. Whenever the officer modifies one of these documents, an automatic system suggests metadata values and expects modifications or approval from the officer before saving the document.
Whenever a document is actually composed of
a single file and the data format allows to do so, the
metadata are stored within the document. For instance, Microsoft Office
documents, PDF documents and some
Whenever documents on the field officer machines are checked in (i.e., copied on the shared directory of the local repository) or exchanged via e-mail with other officers, the central repository has a chance of seeing them and checking them against stored copies. Furthermore, it can verify whether the documents metadata are complete and correct.
Every once in a while, the local repository sends automatic updates of the local documents to the central repository, who stores them and becomes the authoritative source for them. Documents coming from the outside (e.g. from other UN agencies) are being enriched with metadata by whoever copies them in before being added to the local repository. A checkpoint application verifies whether this has happened correctly.
To summarize, the proposed architecture has the advantages that it has a low impact on existing computing and communication requirements, and it easily blends with work procedures and daily tools being used at the moment (especially shared directories and email).
Tools are divided in three categories, according to the three relevant places identified by the architecture previously described:
· User tools, installed on each individual machine and regularly used by each officer during his/her daily activities. As much as possible, these tools need to be unobtrusive and helpful, limiting as much as possible the additional workload of the officers in the specification of metadata values.
· Local tools, installed on the server holding the local repository, providing for verification, validation and communication services with the other tools. Once installed, these tools are meant to work automatically and unattended, and dispense their services with little or no training in their use.
· Central tools, installed on the central repository, providing for verification, conversion and communication services with the local repositories from which the new documents come at irregular intervals.
These tools must be all designed and implemented together, because they closely interact with each other, and must be carefully crafted for leaving as few as possible design defects and implementation bugs.
The main purpose of the user tools is to allow the user to embed and review the appropriate metadata, as specified by the underlying ontology, to the document. This need to happen, as mentioned, in the least intrusive and most helpful way, to lessen the additional workload of FIS officers.
On
the other hand, the little additional effort they require must be clearly and
without exceptions requested on all users, regardless of their status, the
haste they are experiencing, or the perceived (lack of) importance of the
document. It is important to remember that metadata has to be added by the
document authors and modifiers while modifying the document, and not afterward,
by some aid or assistant to the document's author.
The tools that need to be installed on the user's machine are but three, and the processes involved are similarly limited. They are all metadata editors that allow and help users in the specification of the relevant metadata values to attach the documents.
As mentioned, there are two kinds of data format, some allowing the embedding of metadata values within the file itself, and others not allowing any additional bytes within the document. Fortunately, it is worth remembering that the most common cases (MS Word, MS Excel, PDF and email documents) all allow the embedding of metadata, and the number of data formats that do not is somewhat less frequent in real use than the others, substantially limiting to images such as GIF or JPEG files.
Accordingly, there will be two different tools for these two cases:
· An embedded metadata editing application that is within or in close contact with the document editor, and allows new metadata to be associated to the document. Tools are foreseen for MS office documents and PDF files. Upon saving the document, a form will be shown with all the requested fields ready to be filled. Only after having approved the entered values the application will allow the document to be filled. In order to lessen the burden on users, the application will systematically provide meaningful suggestions for values, taken from the content of the document, similar metadata that have been specified in the past, and context information it may be provided in the configuration phase of the document. Hopefully in most cases the user will only have to review the proposed values, approve them and save the document, for an additional expenditure of but a few seconds.
· An external metadata editing application for those documents whose format does not allow embedded metadata (e.g. image files). In this case, unfortunately, the user has to remember autonomously to start the application and to specify the file to package. The application will show a form similar to the one previously mentioned, and will expect the user to fill-in the missing values. Of course, not having access to the content of the document, it will not be able to propose as many and as satisfying suggestions as the other editor, but it will still be able to use default pre-configured values.
In addition to these tools, a third module need to be created for the verification of the correct use of email in the transmission of attachments.
· A mail validation module, that works within or in strict contact with the mail application of the user, and checks whether the messages are accompanied by attachments, and in this case whether they are correctly enriched with the appropriate metadata, suggesting the user to add them in case they are not.
The main purpose of the local tools is to provide for storage and access to documents previously enriched with metadata. The two tools we describe here will validate the metadata and organize access and storage of the documents according to the metadata values and the ontology they are an expression of.
Since we have identified shared directories and email as the two main applications used by FIS officers, the local tools will integrate with them to provide their services behind the scenes, without breaking the normal work habits. Accordingly, the local tools will be:
· A folder monitor, i.e. a daemon application that checks for modifications to a folder and its subfolder (i.e., the shared directory) and validates its metadata content. Whenever an officer creates a new folder or changes its content by copying in a new document or removing an old one, the application notes the metadata content of the new document and validates it. Lack of necessary metadata will cause the generation of an email message to the officer, and the flagging of the documents as incomplete. The upload of multiple copies of the same document, even if they are in different evolutions stages, will accordingly be noted and specified. Identification of documents and versions thereof will happen based on the metadata rather than the file name, ensuring a safer and more reliable organization. Furthermore, by a systematic use of Windows shortcuts, the monitor will create a new hierarchy of folders organized according to date, topic, location, author, etc., whereby the document will be referenced and can be accessed in all these locations according to the relevant values. Since the ontology also include confidentiality values, the monitor will autonomously grant access and/or visibility to a document only to those users that are allowed to see and/or read the document.
· A mail agent, associated to an email address, which stores on the shared directory all documents that it receives via email. By adding the address of the mail agent to every mail message and every mailing list used within the organization, the agent will regularly receive a copy of all documents being exchanged among the member of a mailing list or private email conversation, and will be able to archive them and make them accessible via the shared directory. Since the ontology also include confidentiality values, the monitor will autonomously grant access and/or visibility to a document only to those users that are allowed to see and/or read the document. Thus the mail agent is strictly connected and possibly integrated with the folder monitor.
The main purpose of the central tools is to verify that the documents being uploaded from the local offices around the world are correctly enriched with metadata and are acceptable according to the stricter archival rules put in place at the central repository. Furthermore, they act as translators and extractors of metadata from the actual documents.
· A checkpoint application, placed at the borderline of each repository, that checks for errors in metadata or incomplete specification and knows how to alert the owner of the document about it. Furthermore, it translates the metadata according to whatever system needs to receive the document: whenever two different ontologies are put in contact (e.g. the FIS ontology and the UN-ARMs metadata schema, the tool will provide the set of information in the format and language needed by the receiving schema.
· A downloader/uploader application that helps in setting up the local repository and regularly updates the local content with the central repository. The update is necessary to keep the local and the central repositories synchronized in their contents, and to allow the central repository to perform regular and reliable backups of the content of the local repositories, thus guaranteeing persistent and long-term safekeeping of local data.
The architecture and the tools being illustrated in this document are designed to have a minimal impact on current work processes and habits. On the other hand, they do have some impact, that need to be considered. We have identified five such processes.
For the metadata tools to work correctly, and to suggest appropriate values whenever possible, they have to be configured correctly at the beginning. Failure to install and configure correctly the operating system, the editing application and the metadata tools will inevitably end up in improper, incorrect and incomplete information, and in additional time spent for making up the missing information and correct the wrong data.
Among the configurations that need to be performed correctly, it is worth mentioning:
· The operating system must be correctly installed and configured.
· The system date must be exactly specified, including the time zone. Whenever a user moves to a different office, the location and the time zone need to be reset correctly.
· Each officer must be given a personal email address, and must be instructed not to share it with anyone else.
· Every application must be properly configured and installed for the officer that will actually use it. Using a dummy name in the installation of Office application, for instance, will result in the same dummy name being proposed as author of the documents.
· The address of the automatic mail agent must be added to all the relevant mailing lists.
· The user tools must be properly configured for suggesting appropriate values. These include place names, people, organizations, with which each officer is in closer contact, so that these values might be given the proper preminence on the others.
· The downloader/uploader application must be properly configured according to the needs and characteristics of each local office. In particular, the frequency of automatic backup must be decided according to the network type and the availability of connectivity with the central server, but also to the number and frequency of updates in the documents stored locally, and the risk of losing local hardware due to theft, fire or other unpredictable events.
The metadata editing application will have the most visible impact on the officers. It is important that they are correctly instructed to fill in the relevant fields of the metadata form, and that they update them every time some content in the document changes.
In most cases, the system will propose a number of values for all the important fields. It is important that the officer actually reviews them and perform any appropriate change. Failure to do so make the metadata stored in the document outdated and ultimately useless if not downright harmful.
Having the system do its best to suggest values need not be an excuse for not verifying these suggestions and actively modify the wrong ones. The temptation to simply accept whatever the system is proposing will be strong, but each individual must be instructed to resist the temptation and do his/her best to verify and correct the proposed values.
Every time an officer sends a relevant message via email, he/she needs to add the mail agent as one of the addressee of the message, especially when the message contains an attachment.
This address has to be specified always, every single time, even when the attachment has been sent around before. The mail agent will be sufficiently smart to recognize that the documents are actually the same, and will only store one copy of it.
Of course, all mailing list will include such address among their members, so that when writing to a list one could even omit the mail agent. But it will always be better to include the mail agent twice than never, as the agent will be capable to recognize duplicates in both messages and attachments.
The users will always be able to create new folders within the shared directory, and organize their content according to needs and time. The folder monitor will notice new content has been added to the shared folder, and will examine and create the additional views to the new material.
Of course the views (folders within a folder named "FIS" or some such name) will not be writable, but only readable: users will be expected to access the original folders to modify their content.
Whenever a user uploads documents that are not provided with the relevant metadata, the folder manager will create an email to the user asking him/her to update the submission including the relevant metadata. It is important that the users actually do so, for the documents to be correctly managed by the system and correctly inserted in the archival system.
Uploading a document without its metadata has to be absolutely discouraged, since it would void the usefulness of the whole system.
The system is based on a number of predefined settings that are used for suggesting metadata and organizing the views on the shared folders. For instance, by specifying an email or a name, the metadata editor will be able to infer the person this data refers to, provided the information is available in the predefined settings.
When no information about some secondary entity is not available, the system will prompt the user to provide all the relevant properties and relationships (but will not force the user to provide them). The new data will be automatically uploaded into the main knowledge base of the whole system, ready to be uploaded on the central server.
The central server will therefore collect all new information coming from the local offices, and create new versions of the knowledge base, which will be distributed to all local offices.
Every user must as often as possible verify that a new knowledge base is available and upload it on his/her personal computer, so that the new data is immediately available. This task cannot be performed automatically, since there is no daemon on the users' machines taking care of the automatic verification. Thus it will be up to the user to start the update process.
The two scenarios described in section 3.4 can now be discussed again in the hypothesis that the new tools are available and correctly used.
After having done that, she thinks it would be better if Bruce, from another agency, took part to the meeting as well, since he has just arrived and that could be a chance to meet and get to know all the people. Bruce and one of the addresses of the mailing list, Carla receive the same message as two different e-mails.
Upon receiving the message, Carla copies the report into a folder on her laptop, called "monthly reports", and renames it "June2004.doc", to avoid confusion with the other files in the same place. She then reads and modifies the reports, subtly changing the words in some paragraphs, then saves it and sends it back to Alice as a new version of the same document. Then, she remembers that Bruce, who has just arrived, has the needed expertise to provide feedback on the report, and thus sends an e-mail to him, asking for his opinion on the document "June2004.doc".
Therefore, Bruce finds on his computer two files, one called "report.doc", and the other "June2004.doc". A rapid check of their metadata is enough to understand that they are in fact two versions of the same document, and that June2004.doc is more recent and has been modified by Carla. By starting the "Compare Documents" feature of Microsoft Word, he is then able to check what in fact has changed and whether the changes are worthy and can be included in the final version of the document.
The overall task has taken but a few seconds, and no time or effort has been wasted.
After a few months,
Furthermore, she decides that David, her substitute, will need a crash introductions into the issues that are still open, and thus she creates on the server an additional folder, "Files for David", copies there all the important files that need to be closely examined by him, and renames them with meaningful names. She then sends an e-mail to David telling him that the files needed to learn about the local situation can be found on the server.
When David arrives, he logs on the server and examines the content of the shared directory. He checks the "FIS" folder, finds the "addressee" subfolder, and within it he finds the "David" folder. Only a few documents are stored there and need to be read, saving a huge amount of time.
David is not content yet, since the names
of these files reflect the organization of
Yet, when examining the files in the folder, he immediately notices that they are sorted and available according to their real dates, regardless of the file name he has decided. Thus he immediately discover the problem and proceeds to rename the files correctly.
As a result, David spends little or no time at all looking for the files or reading useless documents, and can concentrate on the relevant ones. The overall task has taken but a few minutes, and no time or effort has been wasted.
In this document we have outlined the need and the usefulness of handling metadata structures in the daily processes of document exchange among FIS officers.
This document has underlined the importance of metadata in search, organization, archival and daily management of documents in a complex organization, and has presented strong arguments in favor of adopting an ontology rather than a metadata schema, so that the metadata values can assume deeper meanings and more sophisticated applications can be implemented.
Next it has stressed the need for specialized tools, in particular for the specification of the metadata values embedded in those document types that allow such additions. The tools need to be implemented within a system architecture that allows the peculiarities of the FIS work processes to keep on existing, in particular those relying on shared directories and email messages.
The tools need to be implemented so that they interfere as little as possible with the daily activities of FIS officers, by working behind the scenes most of the times and providing useful suggestions and predefined values as systematically as possible. Yet, for those little impact that these tools may have, this document stresses how it is important that the officers fill in correctly and as specifically as possible the relevant values, as they will be vital in the correct functioning of the overall system.
Following this white paper a full and detailed design of the mentioned tools must follow, detailing in full the functionalities and the usability requirements that they need to fulfill. Upon acceptance of the design, a prototype implementation and a complete user testing need to be performed. The user testing will be done on real users working in a local FIS office, and careful attention will be given to their daily interaction with the new tools, in order to improve their usability, intrusiveness and completeness. After the end of this phase, a full-scale implementation and deployment is most probably in order.
http://dublincore.org/documents/usageguide/
http://lsi.research.telcordia.com/lsi/papers/execsum.html
http://www.abanet.org/lpm/lpt/articles/ftr07044.html
http://www.alphaworks.ibm.com/contentnr/semanticsfaqs
http://www.alphaworks.ibm.com/contentnr/semanticsfaqs
http://www.amazon.com/exec/obidos/tg/browse/-/3654
http://www.cio-dpi.gc.ca/im-gi/references/meta-standard/meta-standard_e.rtf
http://www.current.org/tech/tech0209metadata.html
http://www.fcw.com/supplements/dwkm/2002/sup-meta-08-05-02.asp
http://www.montague.com/abstracts/iceberg.html
http://www.montague.com/review/meta.html
http://www.ontopia.net/topicmaps/materials/tm-vs-thesauri.html
http://www.sadc-fanr.org.zw/rrsu/clrnghse/metadata1.htm
http://www.ukoln.ac.uk/metadata/desire/overview/rev_pre.htm
http://www.w3.org/2001/12/semweb-fin/w3csw
http://www.w3.org/TR/webont-req/
In this section we provide a synoptic
comparison of the metadata schema proposed by a number of bodies relevant to
the FIS organization. They are the Dublin Core, the UN-ARMS Standard on
recordkeeping metadata, the OCHA IT section proposal dated
Elements in bold represent first class elements, which can be further subdivided in sub-elements or subparts of the same element. Whenever appropriate and discoverable, fields expressing the same kind of metadata have been aligned. Of course, there might be instances where the fields do actually mean similar or identical things, but we could not discern them from their description.
It should be noted that he OCHA/FIS proposal shows an additional organization of the fields in four super-elements, "Content Properties", "Intellectual properties", "Instantiation Properties", and "Metadata Properties" that appear to be extremely useful yet dissimilar from all other proposals. This distinction has been eliminated in the synoptic view presented here, but have been an important organization tool in our proposal described in appendix B.
|
UN-ARMS |
OCHA IT
Section |
OCHA/FIS |
Identifier |
Identifier |
Identifier |
ID |
bibliographicCitation |
|
|
|
|
System ID |
ID |
|
|
Fileplan ID |
Record ID |
|
|
|
Version |
|
|
|
UN Record reference |
|
|
|
|
Filename/URL |
|
|
|
|
Title |
Title |
Title |
Title |
Alternative |
|
Alternative |
|
|
|
|
|
Subject |
Subject |
Subject |
Subject |
|
|
OCHA classification |
|
|
|
|
Topic |
|
|
|
Keywords |
|
|
|
|
Source |
|
|
Source |
|
|
|
|
Description |
Description |
Description |
Description |
abstract |
|
|
|
tableOfContent |
|
|
|
|
|
|
|
Coverage |
|
Coverage |
Coverage |
Spatial |
|
|
Spatial |
Temporal |
|
|
Temporal |
|
|
|
Box_Extent |
|
|
Continent |
|
|
|
Region |
|
|
|
Country |
|
|
|
Admin 1 |
|
|
|
Admin 2 |
|
|
|
Admin 3 |
|
|
|
Admin 4 |
|
|
|
Place Name |
|
|
|
Settlement Type |
|
|
|
|
Spatial Representation |
|
|
Representation Type |
Representation Type |
|
|
Reference System |
Reference System |
|
|
Resolution Scale |
Resolution scale |
|
|
Resolution Cellsize |
Resolution cellsize |
|
|
North Limit |
|
|
|
East Limit |
|
|
|
South Limit |
|
|
|
West Limit |
|
|
|
|
|
|
|
Source |
IP |
Creator |
Creator |
Creator |
Creator |
Publisher |
|
Publisher |
Publisher |
Contributor |
|
Contributor |
Contributor |
|
|
|
|
Rights |
|
Rights |
Rights |
accessrights |
|
Access rights |
|
license |
|
Responsible |
|
rightsholder |
|
Copyright |
|
|
|
|
|
date |
Date |
Date |
Date |
created |
Created |
Created |
created |
available |
|
Published |
available |
modified |
|
Modified |
|
datecopyrighted |
|
Copyrighted |
|
dateaccepted |
|
|
|
datesubmitted |
|
|
|
issued |
|
|
|
valid |
|
|
|
|
Acquired |
|
|
|
Declared |
|
|
|
Opened |
|
|
|
Closed |
|
|
|
Cut-off |
|
|
|
|
Start |
|
|
|
End |
|
|
|
|
|
Provenance |
|
|
|
|
|
|
|
Audience |
Addressee |
Audience |
|
|
|
Addressee |
|
|
|
Business Area |
|
education level |
|
|
|
mediator |
|
|
|
|
|
|
|
Type |
Type |
Type |
|
|
|
Object Type |
|
|
|
Content Type |
|
|
|
|
|
|
|
Status |
|
|
|
|
|
|
|
Priority |
|
|
|
|
|
relation |
Relation |
Relation |
|
conformsTo |
|
|
|
hasformat |
|
|
|
haspart |
|
|
|
hasversion |
|
|
|
isformatof |
|
|
|
ispartof |
|
|
|
isreferencedby |
|
|
|
isreplacedby |
|
is replaced by |
|
isrequiredby |
|
|
|
isversionof |
|
|
|
references |
|
|
|
replaces |
|
replaces |
|
requires |
|
|
|
|
|
relates to |
|
|
Copy |
|
|
|
Child |
|
|
|
Parent |
|
|
|
Reason for redaction |
|
|
|
Rendition |
|
|
|
See Also |
|
|
|
Paper Folder |
|
|
|
|
|
|
language |
Language |
Language |
Language |
|
|
|
|
|
Location |
Location |
|
|
Home location |
|
|
|
Current location |
|
|
|
|
Virtual |
|
|
|
Physical |
|
|
|
|
|
|
Security & Access |
|
|
|
Marking |
|
|
|
Descriptor |
|
|
|
Marking expiration |
|
|
|
Custodian |
|
|
|
Individual Access list |
|
|
|
Group Access list |
|
|
|
Previous marking |
|
|
|
Change date |
|
|
|
Disclosability |
|
|
|
|
|
|
|
Disposal |
Disposal |
|
|
Schedule ID |
|
|
|
Action |
|
|
|
Time period |
|
|
|
Driving event |
|
|
|
External event |
|
|
|
Disposal date |
Disposal date |
|
|
Authorized by |
Authorized by |
|
|
Comment |
|
|
|
Transfer Destination |
|
|
|
Transfer Status |
|
|
|
Review date |
Review date |
|
|
Review Comments |
Review Comments |
|
|
Last review date |
|
|
|
Reviewer details |
|
|
|
Reviewer comments |
|
|
|
|
|
|
Format |
Format |
Format |
|
extent |
|
File size |
|
medium |
|
Storage format |
|
|
|
# words |
|
|
|
|
Image size |
|
|
width |
width |
|
|
height |
height |
|
|
resolution |
|
|
|
duration |
|
|
|
encoding |
|
|
|
Display size |
|
|
|
|
DB Instantiation |
|
|
|
Attribute Info |
|
|
|
Sample size |
|
|
|
|
|
Preservation |
|
|
|
|
|
|
|
Function |
|
|
|
Function |
|
|
|
Activity |
|
|
|
Third Level |
|
|
|
|
|
|
|
Aggregation |
|
|
|
|
|
|
|
|
|
Metadata specific |
|
|
|
Stamp |
|
|
|
Author |
|
|
|
CharacterSet |
In this appendix we present the ontology we suggest to implement for the needs of the FIS offices.
The ontology we propose in this section heavily draws from the metadata schemas presented in Appendix A, for two different reasons: on the one hand, if a schema has expressed the need for some fields, most probably there is the need for these values in some application we might not be aware of; secondly, if a metadata collection needs to be converted to or from one of these schemas, it is important to provide a guidance in the conversion, and thus an explicit mention of the correct use of any value must be provided. We will be careful not to impose mandatory values to the most exoteric fields, but more or less they should be all represented in some forms or another.
Another fine distinction is between mandatory, optional and automatic fields. In this ontology we propose three independent constraints for the description of metadata fields: optionality, authorship and cardinality.
The optionality specifies whether a value needs to be included in the data for the metadata structure to be correct. It does not imply that the author, or indeed any human, has to explicitly determine the value and write it down in any form: in fact, in most cases, mandatory fields will be added automatically by the editor or some other applications. Possible values for optionality are:
· Mandatory: a value is necessary for the structure to be correct, and at least one value must be provided at all costs. The system will not accept metadata structures lacking the value.
· Recommended: a value always exists for the field and should be specified for the complete management of the metadata structure, but in some cases it might be much too troublesome and exacting on the system or the user to find out at the moment. The system will accept structures without such values, but will raise warnings and list the metadata structure as defective and incomplete.
· Optional: a value does not always exist for all documents, but in some cases it is necessary for correct management of the structure. The system will accept structures without such values, and some functionality may not be available for documents lacking this data.
· Accessory: a value may or may not exist for this field, but its absence does not prevent any fundamental use of the metadata structure. If the system or the user has a value to add, this field is the right place to do so, but nothing is hurt if no value is specified.
The authorship specifies who or what is in charge of providing the corresponding value for the field. It does not imply that a value must be provided, but it specifies whether it can be computed, suggested or found out and by whom. Possible values for authorship are:
· Fully automatic: the system has enough information to derive the value by its own initiative. The user may or may not be informed of such a field, but has no chance of modifying the value proposed by the system.
· Derived: the system has drawn the value from a related entity. The value by itself cannot be modified, but the related entity can be changed by the user, correspondingly modifying the derived value
· Suggested: the system has a number of ways to propose a value (defaults, similar documents in the past, inference from content, filename, preferences, etc.). As much as possible the value will be correct, but the user is meant to check the proposed value and modify it in case it is wrong.
· Anyone: any individual or system in the chain of management of the document can add the information, which does not have to be known at save time. The author is invited to provide a meaningful value, but he/she might not be the most competent person to provide such information.
· Fully manual: the author, at save time, is the most appropriate source for a value for the field. The system cannot even suggest a value, except by providing the last entered value in a similar document, with no reliable way to tell whether it is correct or not.
As for cardinality, all fields of the proposed structure are multiple: they might be repeated as many times as needed. Depending on the type of field, the set of equally named fields may mean that all of them are to be considered as correct values as collectively, or that any of them can be considered as a correct value individually. For instance, the author relationship needs to be considered collectively (if several authors are specified, then all of them are authors), while the identifier property as individual (if several identifiers are specified, then any of them can be used as an identifier).
A set of tools proposing to impact as little as possible with the daily activities of busy individuals will try to express as many fully automatic fields as possible, and as few fully manual fields as possible, especially for non accessory elements. By doing so, the metadata editing process using the appropriate tool should be reduced to the examination of some suggested values and the modification of those few that are incorrect, for a total of some tens of seconds overall.
The ontology we propose here contains, as mentioned, entities (or classes) that are composed of properties and associations. A property refers to a string, while an association refers to another entity in a relational way. String values might not be empty, and in some cases they might be subject to constraints, such as being numbers or dates. Values for associations need to be identifiers of entities, and it is an error to specify an association to a missing entity. Both properties and associations have a label (the name of the property or association) and might have a further specifier for adding detail to the actual label.
The ontology is composed of XXX main classes, and XXX secondary ones. Each class is described in the following pages
Description
An abstract concept for a collection of meaningful data. A document does not exist in reality, but is related to one or more Files that provide concrete existence to the document. This is connected to the Content/Instantiation specification presented by the OCHA/FIS metadata structure.
Many properties and associations one would think necessary for a document are in fact attributed to files, whenever they may change from version to version. This in particular is true for dates and authors.
Properties
A string uniquely identifying the document. This identifier is shared by all the instances (e.g. versions) of the same document.
constraints |
Mandatory, automatic, individual |
format |
Opaque string |
specifier |
ID, UN Record reference |
The name given to the resource. Typically, the title will be a name by which the resource is formally known. This value is shared by all the instances (e.g. versions) of the same document.
constraints |
Recommended, Suggested, individual |
format |
Any string |
specifier |
Alternative |
An account of the content of the document. Description may include but is not limited to: an abstract, table of contents, reference to a graphical representation of content or a free-text account of the content.
constraints |
Recommended, suggested, individual |
format |
Long string |
specifier |
Abstract, table of content |
The human language of the content of the document or of its metadata.
constraints |
Mandatory, suggested, collective |
format |
Two-letter code (RFC3066) |
specifier |
Content , metadata |
The whole set of information related to the rights management for the document
constraints |
Optional, suggested, collective |
format |
Determined by the specifier |
specifier |
AccessRights, License |
Associations
A number of electronic entities that represent the instantiation of a version or variant of the document
related to |
File |
constraints |
Recommended, suggested, collective |
specifier |
|
The lack of a file association makes the current document a virtual document, as in the case of a network document that is not stored in any form on the system, or a document that has yet to be written.
Any number of terms related to the content of the document, as taken from a controlled vocabulary
related to |
Term |
constraints |
Recommended, suggested, collective |
specifier |
Type, Classification, Keyword, Topic, Audience, Priority |
The association subject collects a number of metadata fields as specified in other metadata schemas. In particular all those fields that represents qualities and characteristics of the document that are taken from different controlled vocabularies.
A direct reference to a related document.
related to |
Document |
constraints |
Optional, suggested, collective |
specifier |
References, replaces, requires, ispartof |
Inverse relations are reversed direct relations and belong to the related document. Thus the specification that document A is referenced by document B is transformed into an association according to which document B references document A.
The extent or scope of the content of the resource. Coverage will typically include spatial location (a place name or geographic coordinates), temporal period (a period label, date, or date range) or jurisdiction (such as a named administrative entity).
related to |
Location |
constraints |
Optional, suggested, collective |
specifier |
Spatial, temporal |
The whole set of information regarding security, disposal, preservation of the document. This association is directly derived from UN-ARMS elements "security & Access", "Disposal" and "Preservation". They will be developed should the need arise.
related to |
Management |
constraints |
Optional, suggested, individual |
specifier |
Access, disposal, preservation |
Description
The actual entity that contains a specific version or variant of a document. This is a real object with a location and concrete accessible content. This is connected to the Content/Instantiation specification presented by the OCHA/FIS metadata structure.
Properties
A string uniquely identifying the file. This identifier is specific to each single version or variant of the document.
constraints |
Mandatory, automatic, individual |
format |
Opaque string |
specifier |
Record ID |
A numerical value that is assigned to the file so that each version and variant can be identified uniquely. The version number has the property that if the creation date of file X of a document is subsequent to the creation date of file Y of the same document, then the version number of X is greater than the version number of Y.
constraints |
mandatory, automatic, individual |
format |
Opaque string |
specifier |
|
A numerical value that is computed to uniquely identify the content of a file. This value takes into consideration all the bytes of the content to determine a unique value: files with different HashValues are different files, regardless of what the metadata associated to it says.
constraints |
mandatory, automatic, individual |
format |
Opaque string |
specifier |
|
The position(s) where a copy of the file can be found. If the file is electronic, then the filename or URL needs to be specified. If it is physical document, then the name of the physical location must be specified.
constraints |
Mandatory, suggested, individual |
format |
Filename or URL or descriptive string (for physical locations) |
specifier |
filename , URL, physical |
A date associated with an event in the life cycle of the resource.
constraints |
recommended, suggested, collective |
format |
Date (ISO8601: YYYY-MM-DD) |
specifier |
Created, modified, published, copyrighted, removed |
The current evolution status of the document that the file represents
constraints |
Mandatory, suggested, individual |
format |
"draft", "complete", "commented", "published", "disposed" |
specifier |
filename , URL, physical |
Information about the format of the resource.
constraints |
recommended, suggested, individual |
format |
Determined by the specifier |
specifier |
StorageFormat (MIME type), Size (KB), Dimensions (height x width x resolution), duration (seconds), encoding (MIME type) |
An account of the content of the individual files. Description may include but is not limited to: an abstract, table of contents, reference to a graphical representation of content or a free-text account of the content.
constraints |
Accessory, suggested, individual |
format |
Long string |
specifier |
|
Associations
The identifier of the document this file is a version of variant of.
related to |
Document |
constraints |
Recommended, automatic, individual |
specifier |
|
The lack of a document association makes the current file an orphan, which is an error condition that needs to be signaled to authors and document managers.
Any individual or organization who has a role in the generation of the document. Organizations can be authors of a document only if no individual can be identified as the real author.
related to |
Person, Organization |
constraints |
Mandatory, suggested, collective |
specifier |
Creator, Publisher, Contributor, Rightsholder |
Description
An individual that is responsible for some action on documents or terms. The list of persons is meant to be provided with the system and automatically updated according to changes and needs. Users are not meant to modify it on a daily basis.
Properties
A string uniquely identifying the person.
constraints |
Optional, manual, individual |
format |
Opaque string |
specifier |
|
The name of the person
constraints |
Optional, manual, individual |
format |
String |
specifier |
FirstName, Surname, MiddleName, MiddleInitial, title, suffix |
An individual contact point for the person. If the person shares a contact point (e.g. an email address or a telephone) with other employees of the same organization, then such contact property should be specified for the organization only.
constraints |
Optional, manual, individual |
format |
String |
specifier |
Email, telephone, address |
The gender of the person. This is to allow for gender-related customization of text.
constraints |
Recommended, manual, individual |
format |
"M", "F", "U" (for unknown) |
specifier |
|
Associations
The identifier of the organization to which the person belongs.
related to |
Organization |
constraints |
Recommended, manual, collective |
specifier |
Role |
The current physical location of the person as much as it can be determined
related to |
Location |
constraints |
Recommended, manual, collective |
specifier |
|
Description
An organization that takes part to the FIS activities (including FIS itself). The list of organizations is meant to be provided with the system and automatically updated according to changes and needs. Users are not meant to modify it on a daily basis.
Properties
A string uniquely identifying the organization. This identifier is specific to each single version or variant of the document.
constraints |
Optional, manual, individual |
format |
Opaque string |
specifier |
|
The name of the organization
constraints |
Optional, manual, individual |
format |
String |
specifier |
Short, Full, Official |
An individual contact point for the person. If the person shares a contact point (e.g. an email address or a telephone) with other employees of the same organization, then such contact property should be specified for the organization only.
constraints |
Optional, manual, individual |
format |
String |
specifier |
Email, telephone, address |
Associations
The identifier of the organization to which this organization is but a substructure.
related to |
Organization |
constraints |
Optional, manual, collective |
specifier |
|
Description
A simple word or list of words clearly and unambiguously determining a concept. Terms belong to vocabularies and relate to each other in terms of the relationships specified for thesauri (see section 1.2). The list of terms is meant to be provided with the system and automatically updated according to changes and needs. Users are not meant to modify it on a daily basis.
Properties
The word(s) of the term
constraints |
mandatory, manual, individual |
format |
String |
specifier |
|
The full expansion of the concept justifying the term. It may make sense to specify if only for preferred terms (i.e., those without a USE relationship), but it is possible to specify it for non-preferred terms, too.
constraints |
Optional, manual, individual |
format |
String |
specifier |
expansion, description |
An optional URL referencing a document where the term is introduced or explained in detail.
constraints |
Accessory, manual, individual |
format |
URL |
specifier |
|
Associations
Any other term this terms is related to according to thesaurus-specific relationships
related to |
Term |
constraints |
Optional, manual, individual |
specifier |
BT, NT, USE, UF, RT, TT (see section 1.2 for details) |
Description
The spatial or temporal characteristics of a given event, place, object. The list of locations is meant to be provided with the system and automatically updated according to changes and needs. Users are not meant to modify it on a daily basis.
Properties
A string uniquely identifying the location.
constraints |
Mandatory, automatic, individual |
format |
Opaque string |
specifier |
|
The name of the location.
constraints |
Recommended, manual, individual |
format |
String |
specifier |
|
A value in some recognized metric for the spatial identification of the location. The SpatialCoordinate relies on some absolute metric (e.g., longitude, latitude).
constraints |
optional, manual, collective |
format |
String or coordinate |
specifier |
Coordinate |
A value in some recognized metric for the temporal identification of the location. The TemporalCoordinate is identified by its date or the range in which it existed.
constraints |
optional, manual, collective |
format |
date |
specifier |
From, To |
Associations
Other locations to compare this with, for better identification
related to |
Term |
constraints |
Optional, manual, individual |
specifier |
Continent, Region, Country, AdminStructure |
The type of event, place, object being described.
related to |
Location |
constraints |
Optional, manual, individual |
specifier |
Continent, Region, Country, AdminStructure, |
Description
The whole set of information regarding security, disposal, preservation of the document. This entity is directly derived from UN-ARMS elements "security & Access", "Disposal" and "Preservation". They will be developed should the need arise.
Properties
Association
[1] http://lsi.research.telcordia.com/lsi/papers/execsum.html
[2] http://www.ukoln.ac.uk/metadata/desire/overview/rev_pre.htm
[3] http://www.current.org/tech/tech0209metadata.html
[4] http://www.abanet.org/lpm/lpt/articles/ftr07044.html
[5] http://www.montague.com/abstracts/iceberg.html
[6] http://www.sadc-fanr.org.zw/rrsu/clrnghse/metadata1.htm
[7] http://www.montague.com/review/meta.html
[8] http://www.cio-dpi.gc.ca/im-gi/references/meta-standard/meta-standard_e.rtf
[9] http://www.fcw.com/supplements/dwkm/2002/sup-meta-08-05-02.asp
[10] http://dublincore.org/documents/usageguide/
[12] http://www.alphaworks.ibm.com/contentnr/semanticsfaqs
[13] http://www.w3.org/2001/12/semweb-fin/w3csw
[14] http://www.w3.org/TR/webont-req/
[15] http://www.alphaworks.ibm.com/contentnr/semanticsfaqs
[16] http://www.amazon.com/exec/obidos/tg/browse/-/3654