Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-2167

Implement StandardTokenizer with the UAX#29 Standard

    Details

    • Type: New Feature
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 3.1, 4.0-ALPHA
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      Patch Available

      Description

      It would be really nice for StandardTokenizer to adhere straight to the standard as much as we can with jflex. Then its name would actually make sense.

      Such a transition would involve renaming the old StandardTokenizer to EuropeanTokenizer, as its javadoc claims:

      This should be a good tokenizer for most European-language documents

      The new StandardTokenizer could then say

      This should be a good tokenizer for most languages.

      All the english/euro-centric stuff like the acronym/company/apostrophe stuff can stay with that EuropeanTokenizer, and it could be used by the european analyzers.

        Attachments

        1. LUCENE-2167.benchmark.patch
          34 kB
          Steve Rowe
        2. LUCENE-2167.benchmark.patch
          33 kB
          Steve Rowe
        3. LUCENE-2167.benchmark.patch
          31 kB
          Steve Rowe
        4. LUCENE-2167.patch
          885 kB
          Steve Rowe
        5. LUCENE-2167.patch
          831 kB
          Steve Rowe
        6. LUCENE-2167.patch
          874 kB
          Steve Rowe
        7. LUCENE-2167.patch
          887 kB
          Steve Rowe
        8. LUCENE-2167.patch
          588 kB
          Steve Rowe
        9. LUCENE-2167.patch
          529 kB
          Robert Muir
        10. LUCENE-2167.patch
          812 kB
          Robert Muir
        11. LUCENE-2167.patch
          746 kB
          Steve Rowe
        12. LUCENE-2167.patch
          859 kB
          Steve Rowe
        13. LUCENE-2167.patch
          53 kB
          Steve Rowe
        14. LUCENE-2167.patch
          50 kB
          Steve Rowe
        15. LUCENE-2167.patch
          50 kB
          Steve Rowe
        16. LUCENE-2167.patch
          49 kB
          Steve Rowe
        17. LUCENE-2167.patch
          47 kB
          Steve Rowe
        18. LUCENE-2167.patch
          46 kB
          Steve Rowe
        19. LUCENE-2167.patch
          56 kB
          Steve Rowe
        20. LUCENE-2167.patch
          56 kB
          Steve Rowe
        21. LUCENE-2167.patch
          2 kB
          Shyamal Prasad
        22. LUCENE-2167.patch
          3 kB
          Shyamal Prasad
        23. LUCENE-2167-jflex-tld-macro-gen.patch
          14 kB
          Uwe Schindler
        24. LUCENE-2167-jflex-tld-macro-gen.patch
          14 kB
          Uwe Schindler
        25. LUCENE-2167-jflex-tld-macro-gen.patch
          14 kB
          Uwe Schindler
        26. LUCENE-2167-lucene-buildhelper-maven-plugin.patch
          39 kB
          Steve Rowe
        27. standard.zip
          162 kB
          Robert Muir
        28. StandardTokenizerImpl.jflex
          14 kB
          Steve Rowe

          Issue Links

            Activity

              People

              • Assignee:
                steve_rowe Steve Rowe
                Reporter:
                shyamal Shyamal Prasad
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - 0.5h
                  0.5h
                  Remaining:
                  Remaining Estimate - 0.5h
                  0.5h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified