Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-3614

Filter on minimum occurrences of a term in IDF

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • None
    • 1.2.0
    • MLlib

    Description

      The IDF class in MLlib does not provide the capability of defining a minimum number of documents a term should appear in the corpus. The idea is to have a cutoff variable which defines this minimum occurrence value, and the terms which have lower frequency are ignored.

      Mathematically,
      IDF(t,D)=log( (|D|+1)/(DF(t,D)+1) ), for DF(t,D) >=minimumOccurance

      where,
      D is the total number of documents in the corpus
      DF(t,D) is the number of documents that contain the term t
      minimumOccurance is the minimum number of documents the term appears in the document corpus

      This would have an impact on accuracy as terms that appear in less than a certain limit of documents, have low or no importance in TFIDF vectors.

      Attachments

        Activity

          People

            rnowling R J Nowling
            jatinpreet Jatinpreet Singh
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: