Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-18374

Incorrect words in StopWords/english.txt

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 2.0.1
    • Fix Version/s: 2.2.0
    • Component/s: ML
    • Labels:

      Description

      I was just double checking english.txt for list of stopwords as I felt it was taking out valid tokens like 'won'. I think issue is english.txt list is missing apostrophe character and all character after apostrophe. So "won't" becam "won" in that list; "wouldn't" is "wouldn" .

      Here are some incorrect tokens in this list:

      won
      wouldn
      ma
      mightn
      mustn
      needn
      shan
      shouldn
      wasn
      weren

      I think ideal list should have both style. i.e. won't and wont both should be part of english.txt as some tokenizer might remove special characters. But 'won' is obviously shouldn't be in this list.

      Here's list of snowball english stop words:
      http://snowball.tartarus.org/algorithms/english/stop.txt

        Attachments

          Activity

            People

            • Assignee:
              yuhaoyan yuhao yang
              Reporter:
              tenstriker nirav patel
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: