Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-417

StandardTokenizer has problems with comma-separated values

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Incomplete
    • 1.4
    • None
    • modules/analysis
    • None
    • Operating System: other
      Platform: Other

    • 35971

    Description

      The StandardTokenizer assumes that if a phrase contains a comma and at least one
      digit, the phrase has to be a number. We are trying to index comma-separated
      values of SAP R/3 trancation codes along with standard text. Many of these code
      contain digits, e.g. "VA01" or "SE80". While tokenizing text containing these
      codes, lucene recognizes a comma-separated list of them as a digit, e.g.
      "VA01,VA02,VA03". The grammar should be modified to recognize numbers correctly
      (e.g. containing only digits).

      Attachments

        Activity

          People

            Unassigned Unassigned
            smaugg@gmx.net André Wolf
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: