In this study the selection of the core set of WWW sites was derived
from a set of initial searches (using the DEC Alta Vista WWW search
engine[Digital Equipment Corp.1995]). The focus of the search was geographic
information systems, earth sciences, and satellite remote
sensing. This area was chosen because of my familiarity with the
topic, and also the interesting observation from the Inktomi analysis
that the most frequently referenced WWW location was the Xerox PARC
map browser (http://pubweb.parc.xerox.com/map/). To limit the initial
set of items, the search submitted to Alta Vista (using the
advanced search mode) was to find ``link:pubweb.parc.xerox.com/map AND
link:xtreme.gsfc.nasa.gov'', that is, to find a set of WWW documents
containing links to both the Xerox Map browser, and the home page for
NASA's AVHRR (Advanced Very High Resolution Radiometer) remote sensing
projects.
This initial search resulted in a set of 115 WWW pages containing all or most of the elements. These were scanned and the apparently relevant pages were retrieved and stored for further analysis, yielding 43 pages in the areas of geography, GIS, Earth Sciences, and remote sensing. These included many ``bibliography'' pages from services like Yahoo, or those maintained by individuals interested in one of more of these topics. All of the links to other pages were extracted from these 43 pages and combined in a single file, this resulted in 7209 individual URLs. The URLs were sort into alphabetical order and edited to eliminate links that occurred in less than 3 of the citing documents. Citations that appeared to be outside of the topical boundaries set for the study were also eliminated. The editing resulted in a set of 332 potential candidates for the ``core'' set. These were then retrieved and examined using the Netscape WWW browser and appropriate sites were collected in a ``hotlist'' reducing the size of the core set to 125 WWW documents. This set was considered still too large, so the ``Best'' sites of the set (based on my own judgement, with frequent corroboration from various ``best of the Web'' awards given to some pages), reducing the final core set to the 34 sites listed in Table 1.
Having obtained a core set of WWW site in the area of Earth Sciences,
Remote Sensing and Geographic Information Systems
, the next step was
to produce a raw cocitation matrix. This stage requires the ability
to search for ``citing documents'', that is, those with links to the
items in the core set and also the ability to conduct the many searches
required (for any core set of size N, there are
searches
required - one for each pair of items in the core set).
In author and journal cocitation analysis researchers must use the online versions of Science Citation Index, Social Science Citation Index, or Arts and Humanities Citation Index for this stage, because those databases are the only place where citation information can be found, and were cocitation searching is possible (see White WHITE86A). For this study the DEC Alta Vista search engine, with its ability to search for documents containing particular URL ``links'' to a given document was used for the same purpose.
To carry out the many searches needed for the raw cocitation matrix, a ``web robot'' was programmed to automatically submit the searches based on an input set of URLs (from the core set) and to capture the resulting frequency information for further analysis. The robot was designed to be ``polite'' and to pause after each search, to avoid monopolizing the search service (although Alta Vista handles several million requests per day, a persistent robot might be a nuisance). It was also designed to be persistent and to retry searches that failed to complete (after another pause). The search was carried out for all 544 searches representing each unique URL pair from the list of 34 core set items. The searching required about 5 hours to run.