Index: src/java/org/apache/lucene/analysis/package.html =================================================================== --- src/java/org/apache/lucene/analysis/package.html (revision 546696) +++ src/java/org/apache/lucene/analysis/package.html (working copy) @@ -5,6 +5,91 @@ -API and code to convert text into indexable tokens. +

API and code to convert text into indexable/searchable tokens. Covers {@link org.apache.lucene.analysis.Analyzer} and related classes.

+

Parsing? Tokenization? Analysis!

+

+Lucene, indexing and search library, accepts only plain text input. +

+

Parsing

+

+Applications that build their search capabilities upon Lucene may support documents in various formats - HTML, XML, PDF, Word - just to name a few. +Lucene does not care about the Parsing of these and other document formats, and it is the responsibility of the +application using Lucene to use an appropriate Parser to convert the original format into plain text, before passing that plain text to Lucene. +

+

Tokenization

+

+Plain text passed to Lucene for indexing goes through a process generally called tokenization - namely breaking of the +input text into small indexing elements - Tokens. The way that the input text is broken into tokens very +much dictates the further search capabilities of the index into which that text was added. Sentences +beginnings and endings can be identified to provide for more accurate phrase and proximity searches. +In addition, if, for instance, the word "bikes" is replaced by the word "bike", it would now be possible to search with the query "bike" and +find both documents containing "bike" and those containing "bikes". (However if applied very simply it would not be possible to +distinguish between these two cases.) This replacement of "bikes" by "bike" is usually done by a Stemmer. +

+Clearly, just simply breaking the input text into tokens is not enough - a deeper Analysis is needed, +providing for several functions, including (but not limited to): +

+

+

Core Analysis

+

+ The analysis package provides the mechanism to convert Strings and Readers into tokens that can be indexed by Lucene. There + are three main classes in the package from which all analysis processes are derived. These are: +

+

+

Hints, Tips and Traps

+

+ The synergy between {@link org.apache.lucene.analysis.Analyzer} and {@link org.apache.lucene.analysis.Tokenizer} + is sometimes confusing. To ease on this confusion, some clarifications: +

+

+

Lucene Java provides a number of analysis capabilities, the most commonly used one being the {@link + org.apache.lucene.analysis.standard.StandardAnalyzer}. Many applications will have a long and industrious life with nothing more + than the StandardAnalyzer. However, there are a few other classes/packages that are worth mentioning: +

    +
  1. {@link org.apache.lucene.analysis.PerFieldAnalyzerWrapper} -- Most Analyzers perform the same operation on all + {@link org.apache.lucene.document.Field}s. The PerFieldAnalyzerWrapper can be used to associate a different Analyzer with different + {@link org.apache.lucene.document.Field}s.
  2. +
  3. The contrib/analyzers library located at the root of the Lucene distribution has a number of different Analyzer implementations to solve a variety + of different problems related to searching. Many of the Analyzers are designed to analyze non-English languages.
  4. +
  5. The contrib/snowball library located at the root of the Lucene distribution has Analyzer and TokenFilter implementations for a variety of Snowball stemmers. See http://snowball.tartarus.org for more information.
  6. +
  7. There are a variety of Tokenizer and TokenFilter implementations in this package. Take a look around, chances are someone has implemented what you need.
  8. +
+

+

Analysis is one of the main causes of performance degradation during indexing. Simply put, the more you analyze the slower the indexing (in most cases). + Perhaps your application would be just fine using the simple {@link org.apache.lucene.analysis.WhitespaceTokenizer} combined with a + {@link org.apache.lucene.analysis.StopFilter}.

+

Implementing your own Analyzer

+

Creating your own Analyzer is straightforward. It usually involves either wrapping an existing Tokenizer and set of TokenFilters to create a new Analyzer +or creating both the Analyzer and a Tokenizer or TokenFilter. Before pursuing this approach, you may find it worthwhile +to explore the contrib/analyzers library and/or ask on the java-user@lucene.apache.org mailing list first to see if what you need already exists. +If you are still committed to creating your own Analyzer or TokenStream derivation (Tokenizer or TokenFilter) have a look at +the source code of any one of the many samples located in this package.