Uploaded image for project: 'Lucy'
  1. Lucy
  2. LUCY-196

UAX #29 tokenizer

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • None
    • 0.3.0 (incubating)
    • Analysis
    • None

    Description

      It would be nice to have a default tokenizer in core. A tokenizer based on the Unicode word boundaries defined in UAX #29 Unicode Text Segmentation seems like a good choice. That's also how Lucene's StandardTokenizer works.

      See the following thread on lucy-dev
      http://mail-archives.apache.org/mod_mbox/incubator-lucy-dev/201111.mbox/browser

      Also see
      http://unicode.org/reports/tr29/#Word_Boundaries

      Attachments

        Activity

          People

            nwellnhof Nikolas Wellnhofer
            nwellnhof Nikolas Wellnhofer
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: