Gauging Similarity with n-Grams: Language-Independent Categorization of Text

See allHide authors and affiliations

Science  10 Feb 1995:
Vol. 267, Issue 5199, pp. 843-848
DOI: 10.1126/science.267.5199.843


A language-independent means of gauging topical similarity in unrestricted text is described. The method combines information derived from n-grams (consecutive sequences of n characters) with a simple vector-space technique that makes sorting, categorization, and retrieval feasible in a large multilingual collection of documents. No prior information about document content or language is required. Context, as it applies to document similarity, can be accommodated by a well-defined procedure. When an existing document is used as an exemplar, the completeness and accuracy with which topically related documents are retrieved is comparable to that of the best existing systems. The results of a formal evaluation are discussed, and examples are given using documents in English and Japanese.