Uploaded image for project: 'Jackrabbit Oak'
  1. Jackrabbit Oak
  2. OAK-1022

Add a custom Oak Lucene analizer

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 0.10
    • lucene
    • None

    Description

      Following OAK-1007 where I switched to a ClassicAnalizer, I realized that it introduced some subtle changes in tokenization behavior.

      For example there's a twist if the token contains a number.
      From the ClassicTokenizer api:

      Splits words at hyphens, unless there's a number in the token, in which case the whole token is interpreted as a product number and is not split.

      this means that a path token could be split either in 2 tokens if it has no numbers:

      /parent/child => 'parent', 'child'
      

      or just one if it has numbers:

      /p12345/p23456 => '/p12345/p23456'
      

      Also, I'd like to split alphanumeric tokens on '_' and on '.' as well.

      Attachments

        Activity

          People

            stillalex Alex Deparvu
            stillalex Alex Deparvu
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: