Home | Browse | Search | Credits | About
Register | User Area | DL-Harvest | Help
DLIST

Introduction to the JASIST Special Topic Section on Web Retrieval and Mining: A Machine Learning Perspective

Chen, Hsinchun (2003) Introduction to the JASIST Special Topic Section on Web Retrieval and Mining: A Machine Learning Perspective. Journal of the American Society for Information Science & Technology 54(7):pp. 621-624.

Full text available as:
PDF - Requires Adobe Acrobat Reader or other PDF viewer.

Abstract

Research in information retrieval (IR) has advanced significantly in the past few decades. Many tasks, such as indexing and text categorization, can be performed automatically with minimal human effort. Machine learning has played an important role in such automation by learning various patterns such as document topics, text structures, and user interests from examples. In recent years, it has become increasingly difficult to search for useful information on the World Wide Web because of its large size and unstructured nature. Useful information and resources are often hidden in the Web. While machine learning has been successfully applied to traditional IR systems, it poses some new challenges to apply these algorithms to the Web due to its large size, link structure, diversity in content and languages, and dynamic nature. On the other hand, such characteristics of the Web also provide interesting patterns and knowledge that do not present in traditional information retrieval systems.

EPrint Type:Journal Article (Paginated)
Keywords:National Science Digital Library, NSDL, Artificial Intelligence Lab, AI Lab, Information Retrieval, Machine Learning
Subjects:Web Mining
World Wide Web
ID Code:415
Deposited On:16 August 2004
Alternative Locations:http://ai.bpa.arizona.edu/go/papers.html
Eprint Statistics:View statistics for this eprint
Tell A Colleague:Tell a colleague about it.

Amitay, E. (1998). Using common hypertext links to identify the best

phrasal description of target Web documents. In Proceedings of the

ACM SIGIR’98 Post-Conference Workshop on Hypertext Information

Retrieval for the Web, Melbourne, Australia, 1998.

Arasu, A., Cho, J., Garcia-Molina, H., Paepcke, A., & Raghavan, S. (2001).

Searching the web. ACM Transactions on Internet Technology, 1(1), 2–43.

Armstrong, R., Freitag, D., Joachims, T., & Mitchell, T. (1995). Web-

Watcher: a learning apprentice for the World Wide Web. In Proceedings

of the AAAI Spring Symposium on Information Gathering from Heterogeneous,

Distributed Environments, Stanford, CA, March 1995.

Baluja, S., Mittal, V., & Sukthankar, R. (1999). Applying machine learning

for high performance named-entity extraction. In Proceedings of the

Conference of the Pacific Association for Computational Linguistics,

Waterloo, Ontario, 1999.

Borthwick, A., Sterling, J., Agichtein, E., & Grishman, R. (1998). NYU:

description of the MENE named entity system as used in MUC-7. In

Proceedings of the Seventh Message Understanding Conference (MUC-

7), Washington, D.C., April 1998.

Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual Web

search engine. In Proceedings of the 7th WWW Conference, Brisbane,

Australia, April 1998.

Chakrabarti, S., van der Berg, M., & Dom, B. (1999). Focused crawling: a

new approach to topic-specific Web resource discovery. In Proceedings

of the 8th World Wide Web Conference, Toronto, May 1999.

Chen, H., Shankaranarayanan, G., Iyer, A., & She, L. (1998). A machine

learning approach to inductive query by examples: an experiment using

relevance feedback, ID3, genetic algorithms, and simulated annealing.

Journal of the American Society for Information Science, 49(8), 693–

705.

Chinchor, N.A. (1998). Overview of MUC-7/MET-2. In Proceedings of the

Seventh Message Understanding Conference (MUC-7), Washington,

D.C., April 1998.

Cho, J., Garcia-Molina, H., Page, L. (1998). Efficient crawling through

URL ordering. In Proceedings of the 7th WWW Conference, Brisbane,

Australia, April 1998.

Etzioni, O. (1996). The World Wide Web: quagmire or gold mine. Communications

of the ACM, 39(11), 65–68.

Fuhr, N., & Buckley, C. (1991). A probabilistic learning approach for

document indexing. ACM Transactions on Information Systems, 9,

223–248.

Fuhr, N., & Pfeifer, U. (1994). Probabilistic information retrieval as a

combination of abstraction, inductive learning, and probabilistic assumption.

ACM Transactions on Information Systems, 12(1), 92–115.

Goldberg, D., Nichols, D., Oki, B., & Terry, D. (1992). Using collaborative

filtering to weave an information tapestry. Communications of the ACM,

35(12), 61–69.

Green, C.L., & Edwards, P. (1996). Using machine learning to enhance

software tools for internet information management. In Proceedings of

the AAAI-96 Workshop on Internet-Based Information Systems

(pp. 48–55), Menlo Park, CA, AAAI, 1996.

Hurst, M. (2001). Layout and language: challenges for table understanding

on the Web. In Proceedings of the 1st International Workshop on Web

Document Analysis (pp. 27–30), Seattle, WA, September 2001.

Ide, E. (1971). New experiments in relevance feedback. In G. Salton (Ed.),

The SMART retrieval system—experiments in automatic document

processing. Englewood Cliffs, NJ: Prentice-Hall, pp. 337–354.

Iwayama, M., & Tokunaga, T. (1995). Cluster-based text categorization: a

comparison of category search strategies. In Proceedings of the 18th

Annual International ACM Conference on Research and Development in

Information Retrieval (SIGIR’95) (pp. 273–281), Seattle, WA, July

1995.

Joachims, T. (1998). Text categorization with support vector machines:

learning with many relevant features. In Proceedings of the European

Conference on Machine Learning, Berlin, 1998, pp. 137–142.

Kahle, B. (1997). Preserving the Internet. Scientific American, March

1997, 82–83.

Kleinberg, J. (1998). Authoritative sources in a hyperlinked environment.

In Proceedings of the 9th ACM-SIAM Symposium on Discrete Algorithms,

San Francisco, CA, January 1998, pp. 668–677.

Kohonen, T. (1995). Self-organizing maps. Berlin: Springer-Verlag.

Kohonen, T., Kaski, S., Lagus, K., Saloja¨rvi, J., Honkela, J., Paatero, V., &

Saarela, A. (2000). Self organization of a massive document collection.

IEEE Transactions on Neural Networks, Special Issue on Neural Networks

for Data Mining and Knowledge Discovery, 11(3), 574–585.

Koller, D., & Sahami, M. (1997). Hierarchically classifying documents

using very few words. In Proceedings of the 14th International Conference

on Machine Learning (ICML’97) (pp. 170–178), Nashville, TN,

1997.

Konstan, J.A., Miller, B., Maltz, D., Herlocker, J., Gordon, L., & Riedl, J.

(1997). GroupLens: applying collaborative filtering to usenet news.

Communications of the ACM, 40(3), 77–87.

Kosala, R., & Blockeel, H. (2000). Web mining research: a survey. ACM

SIGKDD Explorations, 2(1), 1–15.

Lam, S.L.Y, & Lee, D.L. (1999). Feature reduction for neural network

based text categorization. In Proceedings of the International Conference

on Database Systems for Advanced Applications (DASFAA ’99)

(pp. 195–202), Hsinchu, Taiwan, April 1999.

Lawrence, S., & Giles, C.L. (1999). Accessibility of information on the

Web. Nature, 400, 107–109.

Lewis, D.D., & Ringuette, M. (1994). A comparison of two learning

algorithms for text categorization. In Proceedings of the Third Annual

Symposium on Document Analysis and Information Retrieval

(SDAIR’94) (pp. 81–93), 1994.

Lin, X., Soergel, D., & Marchionini, G. (1991). A self-organizing semantic

map for information retrieval. In Proceedings of the 14th International

ACM SIGIR Conference on Research and Development in Information

Retrieval (SIGIR’91) (pp. 262–269), Chicago, IL, 1991.

Lyman, P., & Varian, H.R. (2000). How much information. [Online].

Available at http://www.sims.berkeley.edu/how-much-info/. February

20, 2001.

Maes, P. (1994). Agents that reduce work and information overload.

Communications of the ACM, 37(7), 31–40.

Masand, B., Linoff, G., & Waltz, D. (1992). Classifying news stories using

memory based reasoning. In Proceeedings of the 15th Annual International

ACM Conference on Research and Development in Information

Retrieval (SIGIR’92) (pp. 59–64), Copenhagen, Denmark, 1992.

McCallum, A., Nigam, K., Rennie, J., & Seymore, K. (1999). A machine

learning approach to building domain-specific search engines. In Proceedings

of the International Joint Conference on Artificial Intelligence

(IJCAI-99) (pp. 662–667), Stockholm, Sweden, 1999.

Miller, S., Crystal, M., Fox, H., Ramshaw, L., Schwartz, R., Stone, R.,

Weischedel, R., and the Annotation Group (1998). BBN: Description of

the SIFT system as used in MUC-7. In Proceedings of the Seventh

Message Understanding Conference (MUC-7), Washington, D.C., April

1998.

Ng, H.T., Goh, W.B., & Low, K.L. (1997). Feature selection, perceptron

learning, and a usability case study for text categorization. In Proceedings

of the 20th Annual International ACM Conference on Research and

Development in Information Retrieval (SIGIR’97) (pp. 67–73), Philadelphia,

PA, 1997.

Orwig, R., Chen, H., & Nunamaker, J.F. (1997). A graphical self-organizing

approach to classifying electronic meeting output. Journal of the

American Society for Information Science, 48(2), 157–170.

Rennie, J., & McCallum, A.K. (1999). Using reinforcement learning to

spider the Web efficiently. In Proceedings of the 16th International

Conference on Machine Learning (ICML-99) (pp. 335–343), Bled, Slovenia,

1999.

Rocchio, J.J. (1971). Relevance feedback in information retrieval. In G.

Salton (ed.), The SMART Retrieval System—Experiments in automatic

document processing. Englewood Cliffs, NJ: Prentice-Hall, pp. 337–

354.

Salton, G. (1989). Automatic text processing. Reading, MA: Addison-

Wesley.

Vapnik, V. (1998). Statistical learning theory. Chichester, GB: Wiley.

Ward, J. (1963). Hierarchical grouping to optimize an objection function.

Journal of the American Statistical Association, 58, 236–244.

Wasfi, A.M.A. (1999). Collecting user access patterns for building user

profiles and collaborative filtering. In Proceedings of the 1999 International

Conference on Intelligent User Interfaces (IUI’99) (pp. 57–64),

Los Angeles, CA, 1999.

Wiener, E., Pedersen, J. O., & Weigend, A.S. (1995). A neural network

approach to topic spotting. In Proceedings of the 4th Annual Symposium

on Document Analysis and Information Retrieval (SDAIR’95) (pp.

317–332), Las Vegas, NV, 1995.

Yang, Y., & Liu, X. (1999). A re-examination of text categorization

methods. In Proceedings of the 22nd Annual International ACM Conference

on Research and Development in Information Retrieval (SIGIR’

99) (pp. 42–49), Berkeley, CA, 1999.

Zamir, O., & Etzioni, O. (1999). Grouper: a dynamic clustering interface to

Web search results. In Proceedings of the 8th World Wide Web Conference,

Toronto, May 1999.

EPrints dLIST, an open access archive for the Information Sciences, is supported by the School of Information Resources and Library Science and Learning Technologies Center, University of Arizona. Established in 2002, dLIST has a global Advisory Board and is a part of the Information Technology & Society Research Lab. Open Archives
Contact: Admin | Donate