Home | Browse | Search | Credits | About
Register | User Area | DL-Harvest | Help
DLIST

A Scalable Self-organizing Map Algorithm for Textual Classification: A Neural Network Approach to Thesaurus Generation

Roussinov, Dmitri G. and Chen, Hsinchun (1998) A Scalable Self-organizing Map Algorithm for Textual Classification: A Neural Network Approach to Thesaurus Generation. Communication and Cognition in Artificial Intelligence Journal 15(1-2):pp. 81-111.

Full text available as:
HTML

Abstract

The rapid proliferation of textual and multimedia online databases, digital libraries, Internet servers, and intranet services has turned researchers' and practitioners' dream of creating an information-rich society into a nightmare of info-gluts. Many researchers believe that turning an info-glut into a useful digital library requires automated techniques for organizing and categorizing large-scale information. This paper presents research in which we sought to develop a scaleable textual classification and categorization system based on the Kohonen's self-organizing feature map (SOM) algorithm. In our paper, we show how self-organization can be used for automatic thesaurus generation. Our proposed data structure and algorithm took advantage of the sparsity of coordinates in the document input vectors and reduced the SOM computational complexity by several order of magnitude. The proposed Scaleable SOM (SSOM) algorithm makes large-scale textual categorization tasks a possibility. Algorithmic intuition and the mathematical foundation of our research are presented in detail. We also describe three benchmarking experiments to examine the algorithm's performance at various scales: classification of electronic meeting comments, Internet homepages, and the Compendex collection.

EPrint Type:Journal Article (Paginated)
Keywords:National Science Digital Library, NSDL, Artificial Intelligence Lab, AI Lab, Evaluation
Subjects:Knowledge Organization
Classification
ID Code:460
Deposited On:04 September 2004
Alternative Locations:http://ai.bpa.arizona.edu/go/papers.html
Eprint Statistics:View statistics for this eprint
Tell A Colleague:Tell a colleague about it.

1. C. Apte, F. Damerau, and S. M. Weiss. Automated learning of decision rules for text categorization. ACM Transactions on Information Systems, 12(3):231-251, July 1994.

2. C. Apte, F. Damerau, and S. M. Weiss. Towards language independent automated learning of text categorization models. In Proceedings of the 17th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval, pages 3-12, Dublin, Ireland, 1994.

3. Beryl Atkins and Beth Levin. Admitting impediments. in lexical acquisition: exploiting on-line resources to build a lexicon. In U. Zernik, editor, Lexical Acquisition: exploiting on-line resources to build a lexicon, page 233, Hillsdale, New Jersey, 1991. Lawrence Erbaum Associates.

4. R. A. Botafogo. Cluster analysis for hypertext systems. In Proceedings of the 16th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval, pages 116-125, Pittsburgh, PA, 1993.

5. R. Burgin. The retrieval effectiveness of five clustering algorithms as a function of indexing exhaustiveness. Journal of the American Society for Information Science, 46(8):562-572, September 1995.

6. M. Caudill. A little knowledge is a dangerous thing. AI Expert, 8(6):16-22, June 1993.

7. H. Chen. Machine learning for information retrieval: neural networks, symbolic learning, and genetic algorithms. Journal of the American Society for Information Science, 46(3):194-216, April 1995.

8. H. Chen, Schuffels C., and Orwig R. Internet categorization and search: A machine learning approach. Journal of Visual Communication and Image Representation, 7:88-102, 1996.

9. H. Chen, A. Houston, J. Yen, and J. F. Nunamaker. Toward intelligent meeting agents. IEEE COMPUTER, 29(8):62-70, August 1996.

10. H. Chen, P. Hsu, R. Orwig, L. Hoopes, and J. F. Nunamaker. Automatic concept classification of text from electronic meetings. Communications of the ACM, 37(10):56-73, October 1994.

11. H. Chen, B. R. Schatz, T. D. Ng, J. P. Martinez, A. J. Kirchhoff, and C. Lin. A parallel computing approach to creating engineering concept spaces for semantic retrieval: The Illinois Digital Library Initiative Project. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(8):771-782, August 1996.

12. H. Chen, B. R. Schatz, T. Yim, and D. Fye. Automatic thesaurus generation for an electronic community system. Journal of the American Society for Information Science, 46(3):175-193, April 1995.

13. H. Chen, C. Schuffels, and R. Orwig. Internet categorization and search: a machine learning approach. Journal of Visual Communications and Image Representation, 7(1):88-102, March 1996.

14. H. Chen and M. Yang. Self-organizing map optimization using Exemplar supercomputers. In Center for Management of Information, University of Arizona, Working Paper, CMI-WPS 96-15, 1996.

15. J. Dalton and A. Deshmane. Artificial neural networks. IEEE Potentials, 10(2):33-36, April 1991.

16. V. Demian and J. C. Mignot. Implementation of the self-organizing feature map on parallel computers. In L. Bouge, M. Cosnard, Y. Robert, and D. Trystram, editors, Proceedings of the Second Joint International Conference on Vector and Parallel Processing, pages 775-776, Berlin, Heidelberg, 1992. Springer.

17. R. O. Duda and P. E. Hart. Pattern Classification and Scene Analysis. John Wiley & Sons, Inc., New York, NY, 1973.

18. B. Everitt. Cluster Analysis. Second Edition, Heinemann Educational Books, London, England, 1980.

19. George W. Furnas, L.M. Gomez Tomas K. Landauer, and Susan T.Dumais. The vocabulary problem in human-system communication. Communications of the ACM, 30:964-971, 1987.

20. G. Grefenstette. Explorations in Automatic Thesaurus Discovery. Kluwer Academic Publishers, Moston, MA, 1994.

21. A. Hiotis. Inside a self-organizing map. AI Expert, 8(4):38-43, April 1993.

22. T. Honkela, S. Kaski, K. Lagus, and T. Kohonen. Newsgroup exploration with WEBSOM method and browsing interface. In Report A32, Helsinki University of Technology, January 1996.

23. J. J. Hopfield. Neural network and physical systems with collective computational abilities. Proceedings of the National Academy of Science, USA, 79(4):2554-2558, 1982.

24. H. Ichiki, M. Hagiwara, and N. Nakagawa. Self-organizing multi-layer semantic maps. In Proceedings of International Conference on Neural Networks, pages 357-360, Seattle, WA, July 1991.

25. M. Iwayama and T. Tokunaga. Cluster-based text categorization: a comparison of category search strategies. In Proceedings of the 18th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval, pages 273-280, Seattle, WA, 1995.

26. T. Kohonen. Self-Organization and Associative Memory. Third Edition, Springer-Verlag, Berlin Heidelberg, 1989.

27. T. Kohonen. Self-Organization Maps. Springer-Verlag, Berlin Heidelberg, 1995.

28. P. Koikkalainen. Fast deterministic self-organizing maps. In F. Fogelman-Soulié and P. Gallinari, editors, Proceedings of the International Conference on Artificial Neural Networks, pages 63-68, Nanterre, France, 1995.

29. K. L. Kwok. Query learning using ANN with adaptive architecture. In Lawrence A. Birmbaum and Gregg C. Collins, editors, Machine Learning: proceedings of the eight International Workshop (ML91), pages 260-264, SanMateo, CA, 1991. Morgan Kaufmann Publishers Inc.

30. D. D. Lewis and W. A. Gale. A sequential algorithm for training text classifiers. In Proceedings of the 17th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval, pages 3-12, Dublin, Ireland, 1994.

31. E. D. Liddy, W. Paik, and E. S. Yu. Text categorization for multiple users based on semantic features from a machine-readable dictionary. ACM Transactions on Information Systems, 12(3):278-295, July 1994.

32. X. Lin, D. Soergel, and G. Marchionini. A self-organizing semantic map for information retrieval. In Proceedings of the Fourteenth Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval, pages 262-269, Chicago, IL, October 13-16 1991.

33. R. P. Lippmann. An introduction to computing with neural networks. IEEE Acoustics Speech and Signal Processing Magazine, 4(2):4-22, April 1987.

34. K. J. MacLeod and W. Robertson. A neural algorithm for document clustering. Information Processing & Management, 27(4):337-346, 1991.

35. B. S. Manjunath and W. Y. Ma. Texture features for browsing and retrieval of image data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(8):837-841, August 1996.

36. W. S. McCulloch and W. Pitts. A logical calculus of the ideas immanent in neural nets. Bulletin of Mathematical Biophysics, 5:115-137, 1943.

37. D. Merkl and A. M. Tjoa. The representation of semantic similarity between documents by using maps: application of an artificial neural network to organize software libraries. In Proceedings of the General Assembly Conference and Congress of the International Federation for Information and Documentation, 1994.

38. R. Mikkulainen. Subsymbolic Natural Language Processing: An Integrated Model of Scripts, Lexicon, and Memory. The MIT Press, Cambridge MA, 1993.

39. R. Orwig, H. Chen, and J. F. Nunamaker. A graphical, self-organizing approach to classifying electronic meeting output. Journal of the American Society for Information Science, 48(2):157-170, February 1997.

40. E. Rasmussen. Clustering algorithms. In Information Retrieval: Data Structures and Algorithms, W. B. Frakes and R. Baeza-Yates, Editors, Prentice Hall, Englewood Cliffs, NJ, 1992.

41. E. Riloff and W. Lehnert. Information extraction as a basis for high-precision text classification. ACM Transactions on Information Systems, 12(3):296-337, July 1994.

42. H. Ritter and T. Kohonen. Self-organizing semantic maps. Biological Cybernetics, 61:241-254, 1989.

43. J. S. Rodrigues and L. B. Almeida. Improving the learning speed in topological maps of patterns. In Proceedngs of International Conference on Neural Networks, pages 813-816, Dordrecht, Netherlands, 1990. Kluwer Academic Publishers.

44. D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error propagation. In Parallel Distributed Processing, pages 318-362, D. E. Rumelhart, J. L. McClelland, and the PDP Research Group, Editors, The MIT Press, Cambridge, MA, 1986.

45. Justeson John S. and Slava M. Katz. Co-occurrences of anonymous adjectives and their contexts. Computational Linguistics, 17:1-19, 1991.

46. G. Salton. Automatic Text Processing. Addison-Wesley Publishing Company, Inc., Reading, MA, 1989.

47. B. R. Schatz and H. Chen. Building large-scale digital libraries. IEEE COMPUTER, 29(5):22-27, May 1996.

48. B. R. Schatz, B. Mischo, T. Cole, J. Hardin, A. Bishop, and H. Chen. Federating repositories of scientific literature. IEEE COMPUTER, 29(5):28-36, May 1996.

49. H. Schutze, D. A. Hull, and J. O. Pedersen. A comparison of classifiers and document representation for the routing problem. In Proceedings of the 18th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval, pages 229-237, Seattle, WA, 1995.

50. Lewis P. A. W., P. B. Baxendale, and J. L. Bennet. Statistical discrimination of the synonymy/antonymy relationship between words. Journal of the ACM, 14:20-44, 1967.

EPrints dLIST, an open access archive for the Information Sciences, is supported by the School of Information Resources and Library Science and Learning Technologies Center, University of Arizona. Established in 2002, dLIST has a global Advisory Board and is a part of the Information Technology & Society Research Lab. Open Archives
Contact: Admin | Donate