Index: lucene/src/java/org/apache/lucene/analysis/package.html
===================================================================
--- lucene/src/java/org/apache/lucene/analysis/package.html	(revision 1222880)
+++ lucene/src/java/org/apache/lucene/analysis/package.html	(working copy)
@@ -39,8 +39,19 @@
 For instance, sentences beginnings and endings can be identified to provide for more accurate phrase 
 and proximity searches (though sentence identification is not provided by Lucene).
 <p>
-In some cases simply breaking the input text into tokens is not enough &ndash; a deeper <i>Analysis</i> may be needed.
-There are many post tokenization steps that can be done, including (but not limited to):
+  In some cases simply breaking the input text into tokens is not enough
+  &ndash; a deeper <i>Analysis</i> may be needed. Lucene includes both
+  pre- and post-tokenization analysis facilities.
+</p>
+<p>
+  Pre-tokenization analysis can include (but is not limited to) stripping
+  HTML markup, and transforming or removing text matching arbitrary patterns
+  or sets of fixed strings.
+</p>
+<p>
+  There are many post-tokenization steps that can be done, including 
+  (but not limited to):
+</p>
 <ul>
   <li><a href="http://en.wikipedia.org/wiki/Stemming">Stemming</a> &ndash; 
       Replacing of words by their stems. 
@@ -68,14 +79,35 @@
   <ul>
     <li>{@link org.apache.lucene.analysis.Analyzer} &ndash; An Analyzer is responsible for building a {@link org.apache.lucene.analysis.TokenStream} which can be consumed
     by the indexing and searching processes.  See below for more information on implementing your own Analyzer.</li>
-    <li>{@link org.apache.lucene.analysis.Tokenizer} &ndash; A Tokenizer is a {@link org.apache.lucene.analysis.TokenStream} and is responsible for breaking
-    up incoming text into tokens. In most cases, an Analyzer will use a Tokenizer as the first step in
-    the analysis process.</li>
+    <li>
+      {@link org.apache.lucene.analysis.CharStream
+      }/{@link org.apache.lucene.analysis.CharFilter} &ndash;
+      A CharStream adds character offset correction functionality over
+      {@link java.io.Reader}.  All Tokenizers accept a CharStream instead of 
+      Reader as input, which enables arbitrary character based filtering
+      before tokenization. 
+      The {@link org.apache.lucene.analysis.CharStream#correctOffset} method
+      fixes offsets to account for removal or insertion of characters, so that
+      the offsets reported in the tokens match the character offsets of the
+      original Reader.
+
+      CharFilter extends CharStream to enable chaining, just as
+      {@link org.apache.lucene.analysis.TokenFilter} (see below) extends
+      {@link org.apache.lucene.analysis.TokenStream} to enable chaining. 
+    </li>
+    <li>
+      {@link org.apache.lucene.analysis.Tokenizer} &ndash; A Tokenizer is a 
+      {@link org.apache.lucene.analysis.TokenStream} and is responsible for
+      breaking up incoming text into tokens. In most cases, an Analyzer will
+      use a Tokenizer as the first step in the analysis process.  However,
+      to modify text prior to tokenization, use a CharStream subclass (see
+      above).
+    </li>
     <li>{@link org.apache.lucene.analysis.TokenFilter} &ndash; A TokenFilter is also a {@link org.apache.lucene.analysis.TokenStream} and is responsible
     for modifying tokens that have been created by the Tokenizer.  Common modifications performed by a
     TokenFilter are: deletion, stemming, synonym injection, and down casing.  Not all Analyzers require TokenFilters</li>
   </ul>
-  <b>Lucene 2.9 introduces a new TokenStream API. Please see the section "New TokenStream API" below for more details.</b>
+  <b>Lucene 2.9 introduced a new TokenStream API. Please see the section "New TokenStream API" below for more details.</b>
 </p>
 <h2>Hints, Tips and Traps</h2>
 <p>
@@ -166,11 +198,18 @@
   </ol>
 </p>
 <h2>Implementing your own Analyzer</h2>
-<p>Creating your own Analyzer is straightforward. It usually involves either wrapping an existing Tokenizer and  set of TokenFilters to create a new Analyzer
-or creating both the Analyzer and a Tokenizer or TokenFilter.  Before pursuing this approach, you may find it worthwhile
-to explore the contrib/analyzers library and/or ask on the java-user@lucene.apache.org mailing list first to see if what you need already exists.
-If you are still committed to creating your own Analyzer or TokenStream derivation (Tokenizer or TokenFilter) have a look at
-the source code of any one of the many samples located in this package.
+<p>
+  Creating your own Analyzer is straightforward. Your Analyzer can wrap
+  existing analysis components &mdash; CharFilter(s) <i>(optional)</i>, a
+  Tokenizer, and TokenFilter(s) <i>(optional)</i> &mdash; or components you
+  create, or a combination of existing and newly created components.  Before
+  pursuing this approach, you may find it worthwhile to explore the
+  contrib/analyzers library and/or ask on the 
+  <a href="http://lucene.apache.org/java/docs/mailinglists.html"
+      >java-user@lucene.apache.org mailing list</a> first to see if what you
+  need already exists. If you are still committed to creating your own
+  Analyzer, have a look at the source code of any one of the many samples
+  located in this package.
 </p>
 <p>
   The following sections discuss some aspects of implementing your own analyzer.
@@ -220,16 +259,19 @@
    that query. But also the phrase query "blue sky" would find that document.
 </p>
 <p>   
-   If this behavior does not fit the application needs,
-   a modified analyzer can be used, that would increment further the positions of
-   tokens following a removed stop word, using
+   If this behavior does not fit the application needs, a modified analyzer can
+   be used, that would increment further the positions of tokens following a
+   removed stop word, using
    {@link org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute#setPositionIncrement(int)}.
-   This can be done with something like:
+   This can be done with something like the following (note, however, that 
+   {@link org.apache.lucene.analysis.StopFilter} natively includes this 
+   capability by subclassing 
+   {@link org.apache.lucene.analysis.FilteringTokenFilter}):
    <PRE class="prettyprint">
       public TokenStream tokenStream(final String fieldName, Reader reader) {
         final TokenStream ts = someAnalyzer.tokenStream(fieldName, reader);
         TokenStream res = new TokenStream() {
-          TermAttribute termAtt = addAttribute(TermAttribute.class);
+          CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
           PositionIncrementAttribute posIncrAtt = addAttribute(PositionIncrementAttribute.class);
         
           public boolean incrementToken() throws IOException {
@@ -237,7 +279,7 @@
             while (true) {
               boolean hasNext = ts.incrementToken();
               if (hasNext) {
-                if (stopWords.contains(termAtt.term())) {
+                if (stopWords.contains(termAtt.toString())) {
                   extraIncrement++; // filter this word
                   continue;
                 } 
@@ -282,22 +324,48 @@
 <h3>Attribute and AttributeSource</h3> 
 Lucene 2.9 therefore introduces a new pair of classes called {@link org.apache.lucene.util.Attribute} and
 {@link org.apache.lucene.util.AttributeSource}. An Attribute serves as a
-particular piece of information about a text token. For example, {@link org.apache.lucene.analysis.tokenattributes.TermAttribute}
+particular piece of information about a text token. For example, {@link org.apache.lucene.analysis.tokenattributes.CharTermAttribute}
  contains the term text of a token, and {@link org.apache.lucene.analysis.tokenattributes.OffsetAttribute} contains the start and end character offsets of a token.
 An AttributeSource is a collection of Attributes with a restriction: there may be only one instance of each attribute type. TokenStream now extends AttributeSource, which
 means that one can add Attributes to a TokenStream. Since TokenFilter extends TokenStream, all filters are also
 AttributeSources.
 <p>
-	Lucene now provides six Attributes out of the box, which replace the variables the Token class has:
-	<ul>
-	  <li>{@link org.apache.lucene.analysis.tokenattributes.TermAttribute}<p>The term text of a token.</p></li>
-  	  <li>{@link org.apache.lucene.analysis.tokenattributes.OffsetAttribute}<p>The start and end offset of token in characters.</p></li>
-	  <li>{@link org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute}<p>See above for detailed information about position increment.</p></li>
-	  <li>{@link org.apache.lucene.analysis.tokenattributes.PayloadAttribute}<p>The payload that a Token can optionally have.</p></li>
-	  <li>{@link org.apache.lucene.analysis.tokenattributes.TypeAttribute}<p>The type of the token. Default is 'word'.</p></li>
-	  <li>{@link org.apache.lucene.analysis.tokenattributes.FlagsAttribute}<p>Optional flags a token can have.</p></li>
-	</ul>
+	Lucene now provides six Attributes out of the box, which replace the
+  variables the Token class has:
 </p>
+<table>
+  <tr>
+    <td>{@link org.apache.lucene.analysis.tokenattributes.CharTermAttribute}</td>
+    <td>The term text of a token.</td>
+  </tr>
+  <tr>
+    <td>{@link org.apache.lucene.analysis.tokenattributes.OffsetAttribute}</td>
+    <td>The start and end offset of a token in characters.</td>
+  </tr>
+  <tr>
+    <td>{@link org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute}</td>
+    <td>See above for detailed information about position increment.</td>
+  </tr>
+  <tr>
+    <td>{@link org.apache.lucene.analysis.tokenattributes.PayloadAttribute}</td>
+    <td>The payload that a Token can optionally have.</td>
+  </tr>
+  <tr>
+    <td>{@link org.apache.lucene.analysis.tokenattributes.TypeAttribute}</td>
+    <td>The type of the token. Default is 'word'.</td>
+  </tr>
+  <tr>
+    <td>{@link org.apache.lucene.analysis.tokenattributes.FlagsAttribute}</td>
+    <td>Optional flags a token can have.</td>
+  </tr>
+  <tr>
+    <td>{@link org.apache.lucene.analysis.tokenattributes.KeywordAttribute}</td>
+    <td>
+      Keyword-aware TokenStreams/-Filters skip modification of tokens that
+      return true from this attribute's isKeyword() method. 
+    </td>
+  </tr>
+</table>
 <h3>Using the new TokenStream API</h3>
 There are a few important things to know in order to use the new API efficiently which are summarized here. You may want
 to walk through the example below first and come back to this section afterwards.
@@ -340,28 +408,35 @@
 utilizes the new custom attribute, and call it PartOfSpeechTaggingFilter.
 <h4>Whitespace tokenization</h4>
 <pre class="prettyprint">
-public class MyAnalyzer extends Analyzer {
+public class MyAnalyzer extends ReusableAnalyzerBase {
 
-  public TokenStream tokenStream(String fieldName, Reader reader) {
-    TokenStream stream = new WhitespaceTokenizer(reader);
-    return stream;
+  private Version matchVersion;
+  
+  public MyAnalyzer(Version matchVersion) {
+    this.matchVersion = matchVersion;
   }
+
+  @Override
+  protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
+    return new TokenStreamComponents(new WhitespaceTokenizer(matchVersion, reader));
+  }
   
   public static void main(String[] args) throws IOException {
     // text to tokenize
     final String text = "This is a demo of the new TokenStream API";
     
-    MyAnalyzer analyzer = new MyAnalyzer();
+    Version matchVersion = Version.LUCENE_XY; // Substitute desired Lucene version for XY
+    MyAnalyzer analyzer = new MyAnalyzer(matchVersion);
     TokenStream stream = analyzer.tokenStream("field", new StringReader(text));
     
-    // get the TermAttribute from the TokenStream
-    TermAttribute termAtt = stream.addAttribute(TermAttribute.class);
+    // get the CharTermAttribute from the TokenStream
+    CharTermAttribute termAtt = stream.addAttribute(CharTermAttribute.class);
 
     stream.reset();
     
     // print all tokens until stream is exhausted
     while (stream.incrementToken()) {
-      System.out.println(termAtt.term());
+      System.out.println(termAtt.toString());
     }
     
     stream.end()
@@ -370,7 +445,7 @@
 }
 </pre>
 In this easy example a simple white space tokenization is performed. In main() a loop consumes the stream and
-prints the term text of the tokens by accessing the TermAttribute that the WhitespaceTokenizer provides. 
+prints the term text of the tokens by accessing the CharTermAttribute that the WhitespaceTokenizer provides. 
 Here is the output:
 <pre>
 This
@@ -384,13 +459,15 @@
 API
 </pre>
 <h4>Adding a LengthFilter</h4>
-We want to suppress all tokens that have 2 or less characters. We can do that easily by adding a LengthFilter 
-to the chain. Only the tokenStream() method in our analyzer needs to be changed:
+We want to suppress all tokens that have 2 or less characters. We can do that
+easily by adding a LengthFilter to the chain. Only the
+<code>createComponents()</code> method in our analyzer needs to be changed:
 <pre class="prettyprint">
-  public TokenStream tokenStream(String fieldName, Reader reader) {
-    TokenStream stream = new WhitespaceTokenizer(reader);
-    stream = new LengthFilter(stream, 3, Integer.MAX_VALUE);
-    return stream;
+  @Override
+  protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
+    final Tokenizer source = new WhitespaceTokenizer(matchVersion, reader);
+    TokenStream result = new LengthFilter(source, 3, Integer.MAX_VALUE);
+    return new TokenStreamComponents(source, result);
   }
 </pre>
 Note how now only words with 3 or more characters are contained in the output:
@@ -404,51 +481,127 @@
 </pre>
 Now let's take a look how the LengthFilter is implemented (it is part of Lucene's core):
 <pre class="prettyprint">
-public final class LengthFilter extends TokenFilter {
+public final class LengthFilter extends FilteringTokenFilter {
 
-  final int min;
-  final int max;
+  private final int min;
+  private final int max;
   
-  private TermAttribute termAtt;
+  private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
 
   /**
    * Build a filter that removes words that are too long or too
    * short from the text.
    */
-  public LengthFilter(TokenStream in, int min, int max)
-  {
-    super(in);
+  public LengthFilter(boolean enablePositionIncrements, TokenStream in, int min, int max) {
+    super(enablePositionIncrements, in);
     this.min = min;
     this.max = max;
-    termAtt = addAttribute(TermAttribute.class);
   }
   
   /**
-   * Returns the next input Token whose term() is the right len
+   * Build a filter that removes words that are too long or too
+   * short from the text.
+   * @deprecated Use {@link #LengthFilter(boolean, TokenStream, int, int)} instead.
    */
-  public final boolean incrementToken() throws IOException
-  {
-    assert termAtt != null;
-    // return the first non-stop word found
-    while (input.incrementToken()) {
-      int len = termAtt.termLength();
-      if (len >= min && len <= max) {
+  @Deprecated
+  public LengthFilter(TokenStream in, int min, int max) {
+    this(false, in, min, max);
+  }
+
+  @Override
+  public boolean accept() throws IOException {
+    final int len = termAtt.length();
+    return (len >= min && len <= max);
+  }
+}
+</pre>
+<p>
+  In LengthFilter, the CharTermAttribute is added and stored in the instance
+  variable <code>termAtt</code>.  Remember that there can only be a single
+  instance of CharTermAttribute in the chain, so in our example the
+  <code>addAttribute()</code> call in LengthFilter returns the
+  CharTermAttribute that the WhitespaceTokenizer already added.
+</p>
+<p>
+  The tokens are retrieved from the input stream in FilteringTokenFilter's 
+  <code>incrementToken()</code> method (see below), which calls LengthFilter's
+  <code>accept()</code> method. By looking at the term text in the
+  CharTermAttribute, the length of the term can be determined and too short or
+  too long tokens are skipped.  Note how <code>accept()</code> can efficiently
+  access the instance variable; no attribute lookup is neccessary. The same is
+  true for the consumer, which can simply use local references to the 
+  Attributes.
+</p>
+<p>
+  LengthFilter extends FilteringTokenFilter; its implementation is:
+</p>
+
+<pre class="prettyprint">
+public abstract class FilteringTokenFilter extends TokenFilter {
+
+  private final PositionIncrementAttribute posIncrAtt = addAttribute(PositionIncrementAttribute.class);
+  private boolean enablePositionIncrements; // no init needed, as ctor enforces setting value!
+
+  public FilteringTokenFilter(boolean enablePositionIncrements, TokenStream input){
+    super(input);
+    this.enablePositionIncrements = enablePositionIncrements;
+  }
+
+  /** Override this method and return if the current input token should be returned by {@link #incrementToken}. */
+  protected abstract boolean accept() throws IOException;
+
+  @Override
+  public final boolean incrementToken() throws IOException {
+    if (enablePositionIncrements) {
+      int skippedPositions = 0;
+      while (input.incrementToken()) {
+        if (accept()) {
+          if (skippedPositions != 0) {
+            posIncrAtt.setPositionIncrement(posIncrAtt.getPositionIncrement() + skippedPositions);
+          }
           return true;
+        }
+        skippedPositions += posIncrAtt.getPositionIncrement();
       }
-      // note: else we ignore it but should we index each part of it?
+    } else {
+      while (input.incrementToken()) {
+        if (accept()) {
+          return true;
+        }
+      }
     }
-    // reached EOS -- return null
+    // reached EOS -- return false
     return false;
   }
+
+  /**
+   * @see #setEnablePositionIncrements(boolean)
+   */
+  public boolean getEnablePositionIncrements() {
+    return enablePositionIncrements;
+  }
+
+  /**
+   * If <code>true</code>, this TokenFilter will preserve
+   * positions of the incoming tokens (ie, accumulate and
+   * set position increments of the removed tokens).
+   * Generally, <code>true</code> is best as it does not
+   * lose information (positions of the original tokens)
+   * during indexing.
+   * 
+   * <p> When set, when a token is stopped
+   * (omitted), the position increment of the following
+   * token is incremented.
+   *
+   * <p> <b>NOTE</b>: be sure to also
+   * set {@link QueryParser#setEnablePositionIncrements} if
+   * you use QueryParser to create queries.
+   */
+  public void setEnablePositionIncrements(boolean enable) {
+    this.enablePositionIncrements = enable;
+  }
 }
 </pre>
-The TermAttribute is added in the constructor and stored in the instance variable <code>termAtt</code>.
-Remember that there can only be a single instance of TermAttribute in the chain, so in our example the 
-<code>addAttribute()</code> call in LengthFilter returns the TermAttribute that the WhitespaceTokenizer already added. The tokens
-are retrieved from the input stream in the <code>incrementToken()</code> method. By looking at the term text
-in the TermAttribute the length of the term can be determined and too short or too long tokens are skipped. 
-Note how <code>incrementToken()</code> can efficiently access the instance variable; no attribute lookup
-is neccessary. The same is true for the consumer, which can simply use local references to the Attributes.
 
 <h4>Adding a custom Attribute</h4>
 Now we're going to implement our own custom Attribute for part-of-speech tagging and call it consequently 
@@ -477,7 +630,7 @@
 
 <pre class="prettyprint">
 public final class PartOfSpeechAttributeImpl extends AttributeImpl 
-                            implements PartOfSpeechAttribute{
+                            implements PartOfSpeechAttribute {
   
   private PartOfSpeech pos = PartOfSpeech.Unknown;
   
@@ -489,14 +642,17 @@
     return pos;
   }
 
+  @Override
   public void clear() {
     pos = PartOfSpeech.Unknown;
   }
 
+  @Override
   public void copyTo(AttributeImpl target) {
     ((PartOfSpeechAttributeImpl) target).pos = pos;
   }
 
+  @Override
   public boolean equals(Object other) {
     if (other == this) {
       return true;
@@ -509,6 +665,7 @@
     return false;
   }
 
+  @Override
   public int hashCode() {
     return pos.ordinal();
   }
@@ -520,18 +677,16 @@
 that tags every word with a leading upper-case letter as a 'Noun' and all other words as 'Unknown'.
 <pre class="prettyprint">
   public static class PartOfSpeechTaggingFilter extends TokenFilter {
-    PartOfSpeechAttribute posAtt;
-    TermAttribute termAtt;
+    PartOfSpeechAttribute posAtt = addAttribute(PartOfSpeechAttribute.class);
+    CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
     
     protected PartOfSpeechTaggingFilter(TokenStream input) {
       super(input);
-      posAtt = addAttribute(PartOfSpeechAttribute.class);
-      termAtt = addAttribute(TermAttribute.class);
     }
     
     public boolean incrementToken() throws IOException {
       if (!input.incrementToken()) {return false;}
-      posAtt.setPartOfSpeech(determinePOS(termAtt.termBuffer(), 0, termAtt.termLength()));
+      posAtt.setPartOfSpeech(determinePOS(termAtt.buffer(), 0, termAtt.length()));
       return true;
     }
     
@@ -545,16 +700,20 @@
     }
   }
 </pre>
-Just like the LengthFilter, this new filter accesses the attributes it needs in the constructor and
-stores references in instance variables. Notice how you only need to pass in the interface of the new
-Attribute and instantiating the correct class is automatically been taken care of.
-Now we need to add the filter to the chain:
+<p>
+  Just like the LengthFilter, this new filter stores references to the
+  attributes it needs in instance variables. Notice how you only need to pass
+  in the interface of the new Attribute and instantiating the correct class
+  is automatically taken care of.
+</p>
+<p>Now we need to add the filter to the chain in MyAnalyzer:</p>
 <pre class="prettyprint">
-  public TokenStream tokenStream(String fieldName, Reader reader) {
-    TokenStream stream = new WhitespaceTokenizer(reader);
-    stream = new LengthFilter(stream, 3, Integer.MAX_VALUE);
-    stream = new PartOfSpeechTaggingFilter(stream);
-    return stream;
+  @Override
+  protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
+    final Tokenizer source = new WhitespaceTokenizer(matchVersion, reader);
+    TokenStream result = new LengthFilter(source, 3, Integer.MAX_VALUE);
+    result = new PartOfSpeechTaggingFilter(result);
+    return new TokenStreamComponents(source, result);
   }
 </pre>
 Now let's look at the output:
@@ -577,8 +736,8 @@
     MyAnalyzer analyzer = new MyAnalyzer();
     TokenStream stream = analyzer.tokenStream("field", new StringReader(text));
     
-    // get the TermAttribute from the TokenStream
-    TermAttribute termAtt = stream.addAttribute(TermAttribute.class);
+    // get the CharTermAttribute from the TokenStream
+    CharTermAttribute termAtt = stream.addAttribute(CharTermAttribute.class);
     
     // get the PartOfSpeechAttribute from the TokenStream
     PartOfSpeechAttribute posAtt = stream.addAttribute(PartOfSpeechAttribute.class);
@@ -587,7 +746,7 @@
 
     // print all tokens until stream is exhausted
     while (stream.incrementToken()) {
-      System.out.println(termAtt.term() + ": " + posAtt.getPartOfSpeech());
+      System.out.println(termAtt.toString() + ": " + posAtt.getPartOfSpeech());
     }
     
     stream.end();
@@ -612,7 +771,7 @@
 as nouns if not the first word of a sentence (we know, this is still not a correct behavior, but hey, it's a good exercise). 
 As a small hint, this is how the new Attribute class could begin:
 <pre class="prettyprint">
-  public class FirstTokenOfSentenceAttributeImpl extends Attribute
+  public class FirstTokenOfSentenceAttributeImpl extends AttributeImpl
                    implements FirstTokenOfSentenceAttribute {
     
     private boolean firstToken;
@@ -625,6 +784,7 @@
       return firstToken;
     }
 
+    @Override
     public void clear() {
       firstToken = false;
     }
