Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-3843

implement PositionLengthAttribute for all tokenstreams where its appropriate

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Patch Available
    • Major
    • Resolution: Unresolved
    • None
    • 4.9, 6.0
    • None
    • None
    • New

    Description

      LUCENE-3767 introduces PositionLengthAttribute, which extends the tokenstream API
      from a sausage to a real graph.

      Currently tokenstreams such as WordDelimiterFilter and SynonymsFilter theoretically
      work at a graph level, but then serialize themselves to a sausage, for example:

      wi-fi with WDF creates:
      wi(posinc=1), fi(posinc=1), wifi(posinc=0)

      So the lossiness is that the 'wifi' is simply stacked ontop of 'fi'

      PositionLengthAttribute fixes this by allowing a token to declare how far it "spans",
      so we don't lose any information.

      While the indexer currently can only support sausages anyway (and for performance reasons,
      this is probably just fine!), other tokenstream consumers such as queryparsers and suggesters
      such as LUCENE-3842 can actually make use of this information for better behavior.

      So I think its ideal if the TokenStream API doesn't reflect the lossiness of the index format,
      but instead keeps all information, and after LUCENE-3767 is committed we should fix tokenstreams
      to preserve this information for consumers that can use it.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              rcmuir Robert Muir
              Votes:
              4 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated: