Uploaded image for project: 'OpenNLP'
  1. OpenNLP
  2. OPENNLP-1099

Is this a typical tokenization issue?

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Not A Problem
    • None
    • 1.8.1
    • Lemmatizer
    • None

    Description

      I am testing openNLP and found some significant tokenization issue involving punctuation.

      Thank you Costco!
      i love costco!
      I love Costco!!
      FUCK IKEA.

      In all these cases, the last punctuation is not split so "Costco!" and "IKEA." are treated as one token. This looks like a systematic problem.

      Attachments

        Activity

          People

            Unassigned Unassigned
            martinmin martin
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: