Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-2167

Implement StandardTokenizer with the UAX#29 Standard

Details

    • New Feature
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 3.1, 4.0-ALPHA
    • 3.1, 4.0-ALPHA
    • modules/analysis
    • None
    • Patch Available

    Description

      It would be really nice for StandardTokenizer to adhere straight to the standard as much as we can with jflex. Then its name would actually make sense.

      Such a transition would involve renaming the old StandardTokenizer to EuropeanTokenizer, as its javadoc claims:

      This should be a good tokenizer for most European-language documents

      The new StandardTokenizer could then say

      This should be a good tokenizer for most languages.

      All the english/euro-centric stuff like the acronym/company/apostrophe stuff can stay with that EuropeanTokenizer, and it could be used by the european analyzers.

      Attachments

        1. LUCENE-2167.benchmark.patch
          34 kB
          Steven Rowe
        2. LUCENE-2167.benchmark.patch
          33 kB
          Steven Rowe
        3. LUCENE-2167.benchmark.patch
          31 kB
          Steven Rowe
        4. LUCENE-2167.patch
          885 kB
          Steven Rowe
        5. LUCENE-2167.patch
          831 kB
          Steven Rowe
        6. LUCENE-2167.patch
          874 kB
          Steven Rowe
        7. LUCENE-2167.patch
          887 kB
          Steven Rowe
        8. LUCENE-2167.patch
          588 kB
          Steven Rowe
        9. LUCENE-2167.patch
          529 kB
          Robert Muir
        10. LUCENE-2167.patch
          812 kB
          Robert Muir
        11. LUCENE-2167.patch
          746 kB
          Steven Rowe
        12. LUCENE-2167.patch
          859 kB
          Steven Rowe
        13. LUCENE-2167.patch
          53 kB
          Steven Rowe
        14. LUCENE-2167.patch
          50 kB
          Steven Rowe
        15. LUCENE-2167.patch
          50 kB
          Steven Rowe
        16. LUCENE-2167.patch
          49 kB
          Steven Rowe
        17. LUCENE-2167.patch
          47 kB
          Steven Rowe
        18. LUCENE-2167.patch
          46 kB
          Steven Rowe
        19. LUCENE-2167.patch
          56 kB
          Steven Rowe
        20. LUCENE-2167.patch
          56 kB
          Steven Rowe
        21. LUCENE-2167.patch
          2 kB
          Shyamal Prasad
        22. LUCENE-2167.patch
          3 kB
          Shyamal Prasad
        23. LUCENE-2167-jflex-tld-macro-gen.patch
          14 kB
          Uwe Schindler
        24. LUCENE-2167-jflex-tld-macro-gen.patch
          14 kB
          Uwe Schindler
        25. LUCENE-2167-jflex-tld-macro-gen.patch
          14 kB
          Uwe Schindler
        26. LUCENE-2167-lucene-buildhelper-maven-plugin.patch
          39 kB
          Steven Rowe
        27. standard.zip
          162 kB
          Robert Muir
        28. StandardTokenizerImpl.jflex
          14 kB
          Steven Rowe

        Issue Links

          Activity

            People

              sarowe Steven Rowe
              shyamal Shyamal Prasad
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - 0.5h
                  0.5h
                  Remaining:
                  Remaining Estimate - 0.5h
                  0.5h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified