Index: src/java/org/apache/lucene/analysis/package.html
===================================================================
--- src/java/org/apache/lucene/analysis/package.html	(revision 1223047)
+++ src/java/org/apache/lucene/analysis/package.html	(working copy)
@@ -39,8 +39,19 @@
 For instance, sentences beginnings and endings can be identified to provide for more accurate phrase 
 and proximity searches (though sentence identification is not provided by Lucene).
 <p>
-In some cases simply breaking the input text into tokens is not enough &ndash; a deeper <i>Analysis</i> may be needed.
-There are many post tokenization steps that can be done, including (but not limited to):
+  In some cases simply breaking the input text into tokens is not enough
+  &ndash; a deeper <i>Analysis</i> may be needed. Lucene includes both
+  pre- and post-tokenization analysis facilities.
+</p>
+<p>
+  Pre-tokenization analysis can include (but is not limited to) stripping
+  HTML markup, and transforming or removing text matching arbitrary patterns
+  or sets of fixed strings.
+</p>
+<p>
+  There are many post-tokenization steps that can be done, including 
+  (but not limited to):
+</p>
 <ul>
   <li><a href="http://en.wikipedia.org/wiki/Stemming">Stemming</a> &ndash; 
       Replacing of words by their stems. 
@@ -68,14 +79,34 @@
   <ul>
     <li>{@link org.apache.lucene.analysis.Analyzer} &ndash; An Analyzer is responsible for building a {@link org.apache.lucene.analysis.TokenStream} which can be consumed
     by the indexing and searching processes.  See below for more information on implementing your own Analyzer.</li>
-    <li>{@link org.apache.lucene.analysis.Tokenizer} &ndash; A Tokenizer is a {@link org.apache.lucene.analysis.TokenStream} and is responsible for breaking
-    up incoming text into tokens. In most cases, an Analyzer will use a Tokenizer as the first step in
-    the analysis process.</li>
+    <li>
+      {@link org.apache.lucene.analysis.CharStream}/CharFilter &ndash;
+      A CharStream adds character offset correction functionality over
+      {@link java.io.Reader}.  All Tokenizers accept a CharStream instead of 
+      Reader as input, which enables arbitrary character based filtering
+      before tokenization. 
+      The {@link org.apache.lucene.analysis.CharStream#correctOffset} method
+      fixes offsets to account for removal or insertion of characters, so that
+      the offsets reported in the tokens match the character offsets of the
+      original Reader.
+
+      CharFilter extends CharStream to enable chaining, just as
+      {@link org.apache.lucene.analysis.TokenFilter} (see below) extends
+      {@link org.apache.lucene.analysis.TokenStream} to enable chaining. 
+    </li>
+    <li>
+      {@link org.apache.lucene.analysis.Tokenizer} &ndash; A Tokenizer is a 
+      {@link org.apache.lucene.analysis.TokenStream} and is responsible for
+      breaking up incoming text into tokens. In most cases, an Analyzer will
+      use a Tokenizer as the first step in the analysis process.  However,
+      to modify text prior to tokenization, use a CharStream subclass (see
+      above).
+    </li>
     <li>{@link org.apache.lucene.analysis.TokenFilter} &ndash; A TokenFilter is also a {@link org.apache.lucene.analysis.TokenStream} and is responsible
     for modifying tokens that have been created by the Tokenizer.  Common modifications performed by a
     TokenFilter are: deletion, stemming, synonym injection, and down casing.  Not all Analyzers require TokenFilters</li>
   </ul>
-  <b>Lucene 2.9 introduces a new TokenStream API. Please see the section "New TokenStream API" below for more details.</b>
+  <b>Lucene 2.9 introduced a new TokenStream API. Please see the section "New TokenStream API" below for more details.</b>
 </p>
 <h2>Hints, Tips and Traps</h2>
 <p>
@@ -159,11 +190,18 @@
   </ol>
 </p>
 <h2>Implementing your own Analyzer</h2>
-<p>Creating your own Analyzer is straightforward. It usually involves either wrapping an existing Tokenizer and  set of TokenFilters to create a new Analyzer
-or creating both the Analyzer and a Tokenizer or TokenFilter.  Before pursuing this approach, you may find it worthwhile
-to explore the modules/analysis library and/or ask on the java-user@lucene.apache.org mailing list first to see if what you need already exists.
-If you are still committed to creating your own Analyzer or TokenStream derivation (Tokenizer or TokenFilter) have a look at
-the source code of any one of the many samples located in this package.
+<p>
+  Creating your own Analyzer is straightforward. Your Analyzer can wrap
+  existing analysis components &mdash; CharFilter(s) <i>(optional)</i>, a
+  Tokenizer, and TokenFilter(s) <i>(optional)</i> &mdash; or components you
+  create, or a combination of existing and newly created components.  Before
+  pursuing this approach, you may find it worthwhile to explore the
+  contrib/analyzers library and/or ask on the 
+  <a href="http://lucene.apache.org/java/docs/mailinglists.html"
+      >java-user@lucene.apache.org mailing list</a> first to see if what you
+  need already exists. If you are still committed to creating your own
+  Analyzer, have a look at the source code of any one of the many samples
+  located in this package.
 </p>
 <p>
   The following sections discuss some aspects of implementing your own analyzer.
@@ -213,11 +251,13 @@
    that query. But also the phrase query "blue sky" would find that document.
 </p>
 <p>   
-   If this behavior does not fit the application needs,
-   a modified analyzer can be used, that would increment further the positions of
-   tokens following a removed stop word, using
+   If this behavior does not fit the application needs, a modified analyzer can
+   be used, that would increment further the positions of tokens following a
+   removed stop word, using
    {@link org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute#setPositionIncrement(int)}.
-   This can be done with something like:
+   This can be done with something like the following (note, however, that 
+   StopFilter natively includes this capability by subclassing 
+   FilteringTokenFilter}:
    <PRE class="prettyprint">
       public TokenStream tokenStream(final String fieldName, Reader reader) {
         final TokenStream ts = someAnalyzer.tokenStream(fieldName, reader);
@@ -281,16 +321,42 @@
 means that one can add Attributes to a TokenStream. Since TokenFilter extends TokenStream, all filters are also
 AttributeSources.
 <p>
-	Lucene now provides six Attributes out of the box, which replace the variables the Token class has:
-	<ul>
-	  <li>{@link org.apache.lucene.analysis.tokenattributes.CharTermAttribute}<p>The term text of a token.</p></li>
-  	  <li>{@link org.apache.lucene.analysis.tokenattributes.OffsetAttribute}<p>The start and end offset of token in characters.</p></li>
-	  <li>{@link org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute}<p>See above for detailed information about position increment.</p></li>
-	  <li>{@link org.apache.lucene.analysis.tokenattributes.PayloadAttribute}<p>The payload that a Token can optionally have.</p></li>
-	  <li>{@link org.apache.lucene.analysis.tokenattributes.TypeAttribute}<p>The type of the token. Default is 'word'.</p></li>
-	  <li>{@link org.apache.lucene.analysis.tokenattributes.FlagsAttribute}<p>Optional flags a token can have.</p></li>
-	</ul>
+	Lucene now provides seven Attributes out of the box, which replace the
+  variables the Token class had:
 </p>
+<table>
+  <tr>
+    <td>{@link org.apache.lucene.analysis.tokenattributes.CharTermAttribute}</td>
+    <td>The term text of a token.</td>
+  </tr>
+  <tr>
+    <td>{@link org.apache.lucene.analysis.tokenattributes.OffsetAttribute}</td>
+    <td>The start and end offset of a token in characters.</td>
+  </tr>
+  <tr>
+    <td>{@link org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute}</td>
+    <td>See above for detailed information about position increment.</td>
+  </tr>
+  <tr>
+    <td>{@link org.apache.lucene.analysis.tokenattributes.PayloadAttribute}</td>
+    <td>The payload that a Token can optionally have.</td>
+  </tr>
+  <tr>
+    <td>{@link org.apache.lucene.analysis.tokenattributes.TypeAttribute}</td>
+    <td>The type of the token. Default is 'word'.</td>
+  </tr>
+  <tr>
+    <td>{@link org.apache.lucene.analysis.tokenattributes.FlagsAttribute}</td>
+    <td>Optional flags a token can have.</td>
+  </tr>
+  <tr>
+    <td>{@link org.apache.lucene.analysis.tokenattributes.KeywordAttribute}</td>
+    <td>
+      Keyword-aware TokenStreams/-Filters skip modification of tokens that
+      return true from this attribute's isKeyword() method. 
+    </td>
+  </tr>
+</table>
 <h3>Using the new TokenStream API</h3>
 There are a few important things to know in order to use the new API efficiently which are summarized here. You may want
 to walk through the example below first and come back to this section afterwards.
@@ -335,16 +401,23 @@
 <pre class="prettyprint">
 public class MyAnalyzer extends Analyzer {
 
-  public TokenStream tokenStream(String fieldName, Reader reader) {
-    TokenStream stream = new WhitespaceTokenizer(reader);
-    return stream;
+  private Version matchVersion;
+  
+  public MyAnalyzer(Version matchVersion) {
+    this.matchVersion = matchVersion;
   }
   
+  {@literal @Override}
+  protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
+    return new TokenStreamComponents(new WhitespaceTokenizer(matchVersion, reader));
+  }
+  
   public static void main(String[] args) throws IOException {
     // text to tokenize
     final String text = "This is a demo of the new TokenStream API";
     
-    MyAnalyzer analyzer = new MyAnalyzer();
+    Version matchVersion = Version.LUCENE_XY; // Substitute desired Lucene version for XY
+    MyAnalyzer analyzer = new MyAnalyzer(matchVersion);
     TokenStream stream = analyzer.tokenStream("field", new StringReader(text));
     
     // get the CharTermAttribute from the TokenStream
@@ -377,13 +450,15 @@
 API
 </pre>
 <h4>Adding a LengthFilter</h4>
-We want to suppress all tokens that have 2 or less characters. We can do that easily by adding a LengthFilter 
-to the chain. Only the tokenStream() method in our analyzer needs to be changed:
+We want to suppress all tokens that have 2 or less characters. We can do that
+easily by adding a LengthFilter to the chain. Only the
+<code>createComponents()</code> method in our analyzer needs to be changed:
 <pre class="prettyprint">
-  public TokenStream tokenStream(String fieldName, Reader reader) {
-    TokenStream stream = new WhitespaceTokenizer(reader);
-    stream = new LengthFilter(stream, 3, Integer.MAX_VALUE);
-    return stream;
+  {@literal @Override}
+  protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
+    final Tokenizer source = new WhitespaceTokenizer(matchVersion, reader);
+    TokenStream result = new LengthFilter(source, 3, Integer.MAX_VALUE);
+    return new TokenStreamComponents(source, result);
   }
 </pre>
 Note how now only words with 3 or more characters are contained in the output:
@@ -395,53 +470,119 @@
 TokenStream
 API
 </pre>
-Now let's take a look how the LengthFilter is implemented (it is part of Lucene's core):
+Now let's take a look how the LengthFilter is implemented:
 <pre class="prettyprint">
-public final class LengthFilter extends TokenFilter {
+public final class LengthFilter extends FilteringTokenFilter {
 
-  final int min;
-  final int max;
+  private final int min;
+  private final int max;
   
-  private CharTermAttribute termAtt;
+  private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
 
   /**
    * Build a filter that removes words that are too long or too
    * short from the text.
    */
-  public LengthFilter(TokenStream in, int min, int max)
-  {
-    super(in);
+  public LengthFilter(boolean enablePositionIncrements, TokenStream in, int min, int max) {
+    super(enablePositionIncrements, in);
     this.min = min;
     this.max = max;
-    termAtt = addAttribute(CharTermAttribute.class);
   }
   
-  /**
-   * Returns the next input Token whose term() is the right len
-   */
-  public final boolean incrementToken() throws IOException
-  {
-    assert termAtt != null;
-    // return the first non-stop word found
-    while (input.incrementToken()) {
-      int len = termAtt.length();
-      if (len >= min && len <= max) {
+  {@literal @Override}
+  public boolean accept() throws IOException {
+    final int len = termAtt.length();
+    return (len >= min && len <= max);
+  }
+}
+</pre>
+<p>
+  In LengthFilter, the CharTermAttribute is added and stored in the instance
+  variable <code>termAtt</code>.  Remember that there can only be a single
+  instance of CharTermAttribute in the chain, so in our example the
+  <code>addAttribute()</code> call in LengthFilter returns the
+  CharTermAttribute that the WhitespaceTokenizer already added.
+</p>
+<p>
+  The tokens are retrieved from the input stream in FilteringTokenFilter's 
+  <code>incrementToken()</code> method (see below), which calls LengthFilter's
+  <code>accept()</code> method. By looking at the term text in the
+  CharTermAttribute, the length of the term can be determined and too short or
+  too long tokens are skipped.  Note how <code>accept()</code> can efficiently
+  access the instance variable; no attribute lookup is neccessary. The same is
+  true for the consumer, which can simply use local references to the 
+  Attributes.
+</p>
+<p>
+  LengthFilter extends FilteringTokenFilter:
+</p>
+
+<pre class="prettyprint">
+public abstract class FilteringTokenFilter extends TokenFilter {
+
+  private final PositionIncrementAttribute posIncrAtt = addAttribute(PositionIncrementAttribute.class);
+  private boolean enablePositionIncrements; // no init needed, as ctor enforces setting value!
+
+  public FilteringTokenFilter(boolean enablePositionIncrements, TokenStream input){
+    super(input);
+    this.enablePositionIncrements = enablePositionIncrements;
+  }
+
+  /** Override this method and return if the current input token should be returned by {@literal {@link #incrementToken}}. */
+  protected abstract boolean accept() throws IOException;
+
+  {@literal @Override}
+  public final boolean incrementToken() throws IOException {
+    if (enablePositionIncrements) {
+      int skippedPositions = 0;
+      while (input.incrementToken()) {
+        if (accept()) {
+          if (skippedPositions != 0) {
+            posIncrAtt.setPositionIncrement(posIncrAtt.getPositionIncrement() + skippedPositions);
+          }
           return true;
+        }
+        skippedPositions += posIncrAtt.getPositionIncrement();
       }
-      // note: else we ignore it but should we index each part of it?
+    } else {
+      while (input.incrementToken()) {
+        if (accept()) {
+          return true;
+        }
+      }
     }
-    // reached EOS -- return null
+    // reached EOS -- return false
     return false;
   }
+
+  /**
+   * {@literal @see #setEnablePositionIncrements(boolean)}
+   */
+  public boolean getEnablePositionIncrements() {
+    return enablePositionIncrements;
+  }
+
+  /**
+   * If <code>true</code>, this TokenFilter will preserve
+   * positions of the incoming tokens (ie, accumulate and
+   * set position increments of the removed tokens).
+   * Generally, <code>true</code> is best as it does not
+   * lose information (positions of the original tokens)
+   * during indexing.
+   * 
+   * <p> When set, when a token is stopped
+   * (omitted), the position increment of the following
+   * token is incremented.
+   *
+   * <p> <b>NOTE</b>: be sure to also
+   * set org.apache.lucene.queryparser.classic.QueryParser#setEnablePositionIncrements if
+   * you use QueryParser to create queries.
+   */
+  public void setEnablePositionIncrements(boolean enable) {
+    this.enablePositionIncrements = enable;
+  }
 }
 </pre>
-The CharTermAttribute is added in the constructor and stored in the instance variable <code>termAtt</code>.
-Remember that there can only be a single instance of CharTermAttribute in the chain, so in our example the 
-<code>addAttribute()</code> call in LengthFilter returns the TermAttribute that the WhitespaceTokenizer already added. The tokens
-are retrieved from the input stream in the <code>incrementToken()</code> method. By looking at the term text
-in the CharTermAttribute the length of the term can be determined and too short or too long tokens are skipped. 
-Note how <code>incrementToken()</code> can efficiently access the instance variable; no attribute lookup
-is neccessary. The same is true for the consumer, which can simply use local references to the Attributes.
 
 <h4>Adding a custom Attribute</h4>
 Now we're going to implement our own custom Attribute for part-of-speech tagging and call it consequently 
@@ -470,7 +611,7 @@
 
 <pre class="prettyprint">
 public final class PartOfSpeechAttributeImpl extends AttributeImpl 
-                            implements PartOfSpeechAttribute{
+                            implements PartOfSpeechAttribute {
   
   private PartOfSpeech pos = PartOfSpeech.Unknown;
   
@@ -482,14 +623,17 @@
     return pos;
   }
 
+  {@literal @Override}
   public void clear() {
     pos = PartOfSpeech.Unknown;
   }
 
+  {@literal @Override}
   public void copyTo(AttributeImpl target) {
     ((PartOfSpeechAttributeImpl) target).pos = pos;
   }
 
+  {@literal @Override}
   public boolean equals(Object other) {
     if (other == this) {
       return true;
@@ -502,24 +646,23 @@
     return false;
   }
 
+  {@literal @Override}
   public int hashCode() {
     return pos.ordinal();
   }
 }
 </pre>
 This is a simple Attribute implementation has only a single variable that stores the part-of-speech of a token. It extends the
-new <code>AttributeImpl</code> class and therefore implements its abstract methods <code>clear(), copyTo(), equals(), hashCode()</code>.
+<code>AttributeImpl</code> class and therefore implements its abstract methods <code>clear(), copyTo(), equals(), hashCode()</code>.
 Now we need a TokenFilter that can set this new PartOfSpeechAttribute for each token. In this example we show a very naive filter
 that tags every word with a leading upper-case letter as a 'Noun' and all other words as 'Unknown'.
 <pre class="prettyprint">
   public static class PartOfSpeechTaggingFilter extends TokenFilter {
-    PartOfSpeechAttribute posAtt;
-    CharTermAttribute termAtt;
+  PartOfSpeechAttribute posAtt = addAttribute(PartOfSpeechAttribute.class);
+  CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
     
     protected PartOfSpeechTaggingFilter(TokenStream input) {
       super(input);
-      posAtt = addAttribute(PartOfSpeechAttribute.class);
-      termAtt = addAttribute(CharTermAttribute.class);
     }
     
     public boolean incrementToken() throws IOException {
@@ -538,16 +681,20 @@
     }
   }
 </pre>
-Just like the LengthFilter, this new filter accesses the attributes it needs in the constructor and
-stores references in instance variables. Notice how you only need to pass in the interface of the new
-Attribute and instantiating the correct class is automatically been taken care of.
-Now we need to add the filter to the chain:
+<p>
+  Just like the LengthFilter, this new filter stores references to the
+  attributes it needs in instance variables. Notice how you only need to pass
+  in the interface of the new Attribute and instantiating the correct class
+  is automatically taken care of.
+</p>
+<p>Now we need to add the filter to the chain in MyAnalyzer:</p>
 <pre class="prettyprint">
-  public TokenStream tokenStream(String fieldName, Reader reader) {
-    TokenStream stream = new WhitespaceTokenizer(reader);
-    stream = new LengthFilter(stream, 3, Integer.MAX_VALUE);
-    stream = new PartOfSpeechTaggingFilter(stream);
-    return stream;
+  {@literal @Override}
+  protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
+    final Tokenizer source = new WhitespaceTokenizer(matchVersion, reader);
+    TokenStream result = new LengthFilter(source, 3, Integer.MAX_VALUE);
+    result = new PartOfSpeechTaggingFilter(result);
+    return new TokenStreamComponents(source, result);
   }
 </pre>
 Now let's look at the output:
@@ -605,7 +752,7 @@
 as nouns if not the first word of a sentence (we know, this is still not a correct behavior, but hey, it's a good exercise). 
 As a small hint, this is how the new Attribute class could begin:
 <pre class="prettyprint">
-  public class FirstTokenOfSentenceAttributeImpl extends Attribute
+  public class FirstTokenOfSentenceAttributeImpl extends AttributeImpl
                    implements FirstTokenOfSentenceAttribute {
     
     private boolean firstToken;
@@ -618,6 +765,7 @@
       return firstToken;
     }
 
+    {@literal @Override}
     public void clear() {
       firstToken = false;
     }
