Making available databases of academic web links to the world research community
This project was created in response to the need for research into web links: including web link mining, and the creation of link metrics. It is aimed at providing the raw data and software for researchers to analyse link structures without having to rely upon commercial search engines, and without having to run their own web crawler. This site will contain all of the following.
Complete databases of link structures of collections of academic web sites.
Files of summary statistics about the link databases.
Software tools for researchers to extract the information that they are particularly interested in.
Descriptions of the methodologies used to crawl the web so that the information provided can be critically evaluated.
Files of information used in the web crawling process.
Slow Internet connection? We will send researchers the databases for free upon receipt of a self-addressed (unstamped) parcel containing an empty CD-case. We will pay postage and supply the CD without charge (it would be too little money to bother with anyway).
Databases - Tools for mining the data - Crawling methodology - Web link research - Research group
These programs should run on most versions of Windows. Please email if there is any problem. Some of the programs may take a long time to run (days if you have a slow computer and are processing the large database files). Expect a more comprehensive collection of tools soon. We are sorry for the awful interfaces provided on the programs but are happy to advise researchers on which programs will be useful to conduct the type of analysis that they are interested in.
A link to an online journal article is expected shortly, based on this preprint. Additional crawling issues and techniques are discussed in the following article.
Thelwall, M. (2001) A Web Crawler Design for Data Mining, Journal of Information Science 27(5), 319-326.
For our publications, please see the Statistical Cybermetrics Research Group home page. There is a large list of related work available on the web site of the e-journal Cybermetrics. A much bigger Unix-based archive that is similar in spirit is available at http://www.archive.org/.
This project is run by the Statistical Cybermetrics Research Group at the University of Wolverhampton. We do not charge for any of the data or tools placed here because we feel that we have an obligation to make our raw data available for free since we collected it for free from the Web sites covered. The crawling is resource intensive and time-consuming so we are unfortunately not able to respond to requests such as "please crawl country X". If any bodies, such as national research agencies, would like to see their countries' universities included, then this will involve a charge. We would expect, but not insist, that the data resulting from such an arrangement would be subsequently made available on this site, also without charge. We are currently bidding for funding for Web link mining research projects that involve crawling countries and expect that this site will grow as a result.
For more information or to notify errors please email m.thelwall@wlv.ac.uk