Home | Browse | Search | Credits | About
Register | User Area | DL-Harvest | Help
DLIST

Focused crawls, tunneling, and digital libraries

Bergmark, Donna and Lagoze, Carl and Sbityakov, Alex (2002) Focused crawls, tunneling, and digital libraries.

Full text available as:
PDF - Requires Adobe Acrobat Reader or other PDF viewer.

Abstract

Crawling the Web to build collections of documents related to pre-specified topics became an active area of research during the late 1990’s, crawler technology having been developed for use by search engines. Now, Web crawling is being seriously considered as an important strategy for building large scale digital libraries. This paper covers some of the crawl technologies that might be exploited for collection building. For example, to make such collection-building crawls more effective, focused crawling was developed, in which the goal was to make a “best-first” crawl of the Web. We are using powerful crawler software to implement a focused crawl but use tunneling to overcome some of the limitations of a pure best-first approach. Tunneling has been described by others as not only prioritizing links from pages according to the page’s relevance score, but also estimating the value of each link and prioritizing them as well. We add to this mix by devising a tunneling focused crawling strategy which evaluates the current crawl direction on the fly to determine when to terminate a tunneling activity. Results indicate that a combination of focused crawling and tunneling could be an effective tool for building digital libraries.

EPrint Type:Preprint
Keywords:Web Crawling, Mercator
Subjects:Digital Libraries
ID Code:78
Deposited On:20 July 2002
Alternative Locations:http://mercator.comm.nsdlib.org/CollectionBuilding/ECDLpaper.pdf
Eprint Statistics:View statistics for this eprint
Tell A Colleague:Tell a colleague about it.

1. Lagoze (ed.), C., Arms, W., Gan, S., Hillmann, D., Ingram, C., Krafft, D., Marisa,

R., Phipps, J., Saylor, J., Terrizzi, C.: Core services in the architecture of the

National Digital Library for science education NSDL). In: Proceedings of the

Second ACM/IEEE-CS Joint Conference on Digital Libraries, Portland, OR (2002)

2. Zia, L.L.: The NSF national science, technology, engineering, and mathematics

education digital library (NSDL) program: New projects and a project report.

D-Lib Magazine: The Magazine of Digital Library Research 7 (2001)

3. Arms, W.: Automated digital libraries: How effectively can computers be used for

the skill tasks of professional librarianship. D-Lib Magazine: The Magazine of Digital

Library Research (2000) <http://www.dlib.org/dlib/july00/arms/07arms.

html>.

4. Bergmark, D.: Collection synthesis. In: Proceedings of the Second ACM/IEEECS

Joint Conference on Digital Libraries, Portland OR (2002) Available: <http:

//mercator.comm.nsdlib.org/CollectionBuilding/bergmark-paper.pdf>.

5. Chakrabarti, S., van den Berg, M., Dom, B.: Focused crawling: a new approach to

topic-specific Web resource discovery. In: Proceedings of the Eighth International

World-Wide Web Conference., Toronto, Canada (1999) 545–562 Available: <http:

//www8.org/w8-papers/5a-search-query/crawling/index.html> and <http://

www.cs.berkeley.edu/~soumen/doc/www99focus/> Current as of August 2001.

6. Belew, R.K.: Finding Out About. Cambridge Press (2001)

7. Salton, G.: Automatic Information Organization and Retrieval. McGraw-Hill, New

York (1968)

8. Bergmark, D.: Using high performance systems to build collections for a digital library.

In: Proceedings of the 2002 International Conference on Parallel Processing

Workshops (ICPP 2002 Workshops), Vancouver, Canada (2002) Preprint available

at <http://mercator.comm.nsdlib.org/CollectionBuilding/DCADL_bergmark.

ps>.

9. Pirolli, P., Pitkow, J., Rao, R.: Silk from a sow’s ear: Extracting usable structures

from the Web. (1996) Available: <http://www.acm.org/pubs/articles/

proceedings/chi/238286/p118-pirolli/p118-pirolli.html>.

10. Kleinberg, J.: Authoritative sources in a hyperlinked environment. Journal of the

ACM 46 (1999) 604–632

11. Brin, S., Page, L.: The anatomy of a large-scale hypertextualWeb search engine. In:

Proceedings of the 7th International World Wide Web Conference (WWW7), Brisbane,

Australia (1998) Available online at <http://www7.scu.edu.au/programme/

fullpapers/1921/com1921.htm>, (current as of 28 Feb. 2001).

12. Gibson, D., Kleinberg, J., Raghavan, P.: Inferring Web communities from link

topology. In: Proceedings of the 9th ACM Conference on Hypertext and Hypermedia:

Links, Objects, Time and Space – Structure in Hypermedia Systems

(hypertext’98, Pittsburg, PA). (1998) 225–234

13. Chakrabarti, S., van den Berg, M., Dom, B.: Distributed hypertext resource discovery

through examples. In: Proceedings of the 25th VLDB Conference, Edinburgh,

Scotland, Morgan-Kaufman (1999) 375–386

14. Rennie, J., McCallum, A.: Using reinforcement learning to spider the Web ef-

ficiently. In: Proceedings of the International Conference on Machine Learning

(ICML). (1999)

15. Menczer, F., Belew, R.K. In: Adaptive Retrieval Agents: Internalizing Local Context

and Scaling up to the Web. (1999) 1–45 Republished in Machine Learning,

39(2/3) pp. 203–242, 2000.

16. Menczer, F., Pant, G., Srinivasan, P.: Evaluating topic-driven Web crawlers. In:

SIGIR ’01, September 9–12, New Orleans, La. USA (2001)

17. Mukherjea, S.: WTMS: A system for collecting and analyzing topic-specific Web

information. In: Proceedings of the 9th International World WideWeb Conference:

The Web: The Next Generation, Amsterdam, Elsevier (2000) Available: <http:

//www9.org/w9cdrom/293/293.html> (current as of August 2001).

18. Chakrabarti, S.: Recent results in automatic Web resource discovery. ACM

Computing Surveys (1999) Available: <http://www.acm.org/pubs/articles/

journals/surveys/1999-31-43es/a17-chakrabarti/a17-chakrabarti.pdf>.

19. Diligenti, M., Coetzee, F., Lawrence, S., Giles, C., Gori, M.: Focused crawling

using context graphs. In: Proceedings of the 26th International Conference on

Very Large Databases. (2000)

20. Heydon, A., Najork, M.: Mercator: A scalable, extensible Web crawler. World

Wide Web 2 (1999)

21. Najork, M., Heydon, A.: High-performance Web crawling. Technical

Report Research Report 173, Compaq SRC (2001) Available at

<http://gatekeeper.research.compaq.com/pub/DEC/SRC/research-reports/

abstracts/src-rr-173.html>.

22. Davison, B.D.: Topical locality in the Web. In: Proceedings of the 23rd Annual

International Conference on Research and Development in Information Retrieval

(SIGIR 2000), Athens, Greece, ACM (2000)

23. Joachimes, T.: A support vector method for learning ranking functions in information

retrieval (2002) Cornell University Colloqium.

24. Parsia, B.: A simple, prima facie argument in favor of the semantic web. MonkyFist

(2002) Available: <http://monkeyfist.com/articles/815>.

25. Kluev, V.: Compiling document collections from the Internet. SIGIR Forum 34

(2000) Available at <http://www.acm.org/sigir/forum/F2000/Kluev00.pdf>.

26. Han, E.H.S., Karypis, G.: Centroid-based document classification: Analysis &

experimental results. Technical Report 00-017, Computer Science, University of

Minnesota (2000)

27. Katz, V., Li, W.S.: Topic distillation on hierarchically categorized Web documents.

In: Proceedings of the 1999 Workshop on Knowledge and Data Engineering Exchange,

IEEE (1999)

EPrints dLIST, an open access archive for the Information Sciences, is supported by the School of Information Resources and Library Science and Learning Technologies Center, University of Arizona. Established in 2002, dLIST has a global Advisory Board and is a part of the Information Technology & Society Research Lab. Open Archives
Contact: Admin | Donate