Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-2167

Implement StandardTokenizer with the UAX#29 Standard

Details

    • New Feature
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 3.1, 4.0-ALPHA
    • 3.1, 4.0-ALPHA
    • modules/analysis
    • None
    • Patch Available

    Description

      It would be really nice for StandardTokenizer to adhere straight to the standard as much as we can with jflex. Then its name would actually make sense.

      Such a transition would involve renaming the old StandardTokenizer to EuropeanTokenizer, as its javadoc claims:

      This should be a good tokenizer for most European-language documents

      The new StandardTokenizer could then say

      This should be a good tokenizer for most languages.

      All the english/euro-centric stuff like the acronym/company/apostrophe stuff can stay with that EuropeanTokenizer, and it could be used by the european analyzers.

      Attachments

        1. LUCENE-2167.patch
          3 kB
          Shyamal Prasad
        2. LUCENE-2167.patch
          2 kB
          Shyamal Prasad
        3. LUCENE-2167.patch
          56 kB
          Steven Rowe
        4. LUCENE-2167.patch
          56 kB
          Steven Rowe
        5. LUCENE-2167.patch
          46 kB
          Steven Rowe
        6. LUCENE-2167.patch
          47 kB
          Steven Rowe
        7. LUCENE-2167.patch
          49 kB
          Steven Rowe
        8. LUCENE-2167.benchmark.patch
          31 kB
          Steven Rowe
        9. LUCENE-2167.patch
          50 kB
          Steven Rowe
        10. LUCENE-2167.patch
          50 kB
          Steven Rowe
        11. LUCENE-2167.patch
          53 kB
          Steven Rowe
        12. LUCENE-2167-lucene-buildhelper-maven-plugin.patch
          39 kB
          Steven Rowe
        13. LUCENE-2167-jflex-tld-macro-gen.patch
          14 kB
          Uwe Schindler
        14. LUCENE-2167-jflex-tld-macro-gen.patch
          14 kB
          Uwe Schindler
        15. LUCENE-2167-jflex-tld-macro-gen.patch
          14 kB
          Uwe Schindler
        16. LUCENE-2167.patch
          859 kB
          Steven Rowe
        17. LUCENE-2167.patch
          746 kB
          Steven Rowe
        18. LUCENE-2167.benchmark.patch
          33 kB
          Steven Rowe
        19. standard.zip
          162 kB
          Robert Muir
        20. LUCENE-2167.patch
          812 kB
          Robert Muir
        21. LUCENE-2167.patch
          529 kB
          Robert Muir
        22. LUCENE-2167.benchmark.patch
          34 kB
          Steven Rowe
        23. StandardTokenizerImpl.jflex
          14 kB
          Steven Rowe
        24. LUCENE-2167.patch
          588 kB
          Steven Rowe
        25. LUCENE-2167.patch
          887 kB
          Steven Rowe
        26. LUCENE-2167.patch
          874 kB
          Steven Rowe
        27. LUCENE-2167.patch
          831 kB
          Steven Rowe
        28. LUCENE-2167.patch
          885 kB
          Steven Rowe

        Issue Links

          Activity

            People

              sarowe Steven Rowe
              shyamal Shyamal Prasad
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - 0.5h
                  0.5h
                  Remaining:
                  Remaining Estimate - 0.5h
                  0.5h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified

                  Slack

                    Issue deployment