Uploaded image for project: 'Mahout'
  1. Mahout
  2. MAHOUT-126

Prepare document vectors from the text

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.2
    • 0.2
    • None
    • None

    Description

      Clustering algorithms presently take the document vectors as input. Generating these document vectors from the text can be broken in two tasks.

      1. Create lucene index of the input plain-text documents
      2. From the index, generate the document vectors (sparse) with weights as TF-IDF values of the term. With lucene index, this value can be calculated very easily.

      Presently, I have created two separate utilities, which could possibly be invoked from another class.

      Attachments

        1. MAHOUT-126.patch
          10 kB
          Shashikant Kore
        2. mahout-126-benson.patch
          11 kB
          Benson Margulies
        3. MAHOUT-126.patch
          41 kB
          Grant Ingersoll
        4. MAHOUT-126.patch
          50 kB
          Grant Ingersoll
        5. MAHOUT-126.patch
          41 kB
          Grant Ingersoll
        6. MAHOUT-126-no-normalization.patch
          2 kB
          David Leo Wright Hall
        7. MAHOUT-126-no-normalization.patch
          1 kB
          David Leo Wright Hall
        8. MAHOUT-126-TF.patch
          4 kB
          David Leo Wright Hall
        9. MAHOUT-126-null-entry.patch
          0.8 kB
          David Leo Wright Hall

        Issue Links

          Activity

            People

              gsingers Grant Ingersoll
              kshashi Shashikant Kore
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: