The Information Infrastructure Technology and Applications (IITA) Working
Group, the highest level of the country's National Information Infrastructure
(NII) technical committee, held an invited workshop in May 1995 to define a
research agenda for digital libraries. The shared vision is an entire net of
distributed repositories in which objects of any type and any size can be
organized and searched within and across different indexed collections.
The
ultimate goal, as described in the IITA report, is the Grand Challenge of
Digital Libraries:
"deep semantic interoperability -- the ability of a user to access, consistently and coherently, similar (though autonomously defined and managed) classes of digital objects and services, distributed across heterogeneous repositories, with federating or mediating software compensating for site-by-site variations...Achieving this will require breakthroughs in description as well as retrieval, object interchange and object retrieval protocols. Issues here include the definition and use of metadata and its capture or computation from objects (both textual and multimedia), the use of computed descriptions of objects, federation and integration of heterogeneous repositories with disparate semantics, clustering and automatic hierarchical organization of information, and algorithms for automatic rating, ranking, and evaluation of information quality, genre, and other properties.""The use of computed descriptions of (multimedia) objects" and "clustering and automatic hierarchical organization of information" present pressing scientific and engineering problems that have a significant potential impact on the US society in this era of the Internet and distributed, multimedia computing.
The traditional approach to creating classification systems and knowledge
sources in library science and classical AI is often considered top-down since
knowledge representations and formats are pre-defined by human experts or
trained librarians and the process of generating knowledge is structured and
well-defined. A complementary bottom-up approach to knowledge creation has been
suggested by researchers in machine learning, statistical analysis, and neural
networks.
General-purpose clustering algorithms have been in existence since
the 1960s. Hierarchical and non-hierarchical clustering algorithms have been
developed primarily for numeric analysis purposes. Some of these algorithms have
then been adopted in digital library applications. Several factor analysis based
techniques such as latent semantic indexing (LSI) and multi-dimensional scaling
(MDS) have also been adopted in textual analysis. More recently, neural network
based self-organizing map (SOM) textual classification systems have been
reported by Chen et al. in several digital library applications. Clustering and
classification techniques for real-life, large-scale collections are critically
needed for developing knowledge structures for next-generation digital
libraries.
Most of the more robust clustering algorithms (e.g., Voorhees
method, Ward's method) are computationally intensive, and they often are O(N^2)
or O(N^3) in complexity, where N is the number of objects to be clustered.
Digital library science is data-intensive, one in which performance measure is
often based on the amount of data (number of objects or size of collections).
The computational complexity of such algorithms has caused severe engineering
problems, especially for mid-to-large-scale applications (from millions to
billions of digital records, TBs collection). The largest calculation that we
were able to perform on a 32-node Covex Exemplar and a 32-node SGI Origin2000
(2GBs RAM and 200 GBs of disk), respectively, was based on the neural network
SOM algorithm for an Internet collection of about 100K web pages.
Many researchers have attempted to optimize clustering algorithms, especially
for sparse textual analysis applications. Dmitri and Chen developed a scaleable
SOM algorithm that took advantage of the sparseness of coordinates in document
input vectors and thereby reduced the SOM computational complexity by several
orders of magnitude. The resulting complexity of the algorithm is proportional
to the average number of non-zero coordinates in an input vector, instead of the
total number of input vector coordinates. We believe the same ``keyword
sparseness'' optimization principle observed in textual applications could be
applied in the optimization of other conventional clustering algorithms as well.
In addition, hardware advancement in recent years, especially the
development of shared memory multi-processors, has made large-scale,
data-intensive, parallel analysis a promising approach. We have recently
reported our experiment in using parallel supercomputers to analyze millions of
engineering abstracts and automatically generate engineering concept spaces
(thesauri). Implementation of algorithms developed in the 1960s and 1970s may be
feasible because of algorithm optimization and hardware advances.
Conventional approaches to addressing information overload and
interoperability problems are manual in nature, requiring human experts as
information intermediaries to create knowledge structures and/or classification
systems (e.g., the National Library of Medicine's Unified Medical Language
System, UMLS) in order to bridge vocabulary differences. As information content
and collections become even larger and more dynamic (thus rendering manual
knowledge structures more difficult to create), we believe a system-aided,
algorithmic, bottom-up approach to organizing digital content is needed.
One
of the critical scientific objectives of the digital library community for the
next several years is to:
"Develop scalable and efficient statistical and AI clustering algorithms to create classification systems based on large-scale (millions and billions) domain-specific digital library collections."To be able to automatically analyze digital objects from the order of 10^5 (100K) to the order of 10^9 (billion), a proportional 1,000 to 10,000 fold increase in computing resources is needed -- in cycles, RAM, and disk space. An intermediate phase of 10 to 100 fold increase in computing resources will enable computation for 10^6 (million) to 10^7 (10 millions) of digital objects.