Uploaded image for project: 'Mahout'
  1. Mahout
  2. MAHOUT-1178

GSOC 2013: Improve Lucene support in Mahout

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Major
    • Resolution: Won't Fix
    • None
    • 0.10.0
    • None

    Description

      [via Ted Dunning]

      It should be possible to view a Lucene index as a matrix. This would
      require that we standardize on a way to convert documents to rows. There
      are many choices, the discussion of which should be deferred to the actual
      work on the project, but there are a few obvious constraints:

      a) it should be possible to get the same result as dumping the term vectors
      for each document each to a line and converting that result using standard
      Mahout methods.

      b) numeric fields ought to work somehow.

      c) if there are multiple text fields that ought to work sensibly as well.
      Two options include dumping multiple matrices or to convert the fields
      into a single row of a single matrix.

      d) it should be possible to refer back from a row of the matrix to find the
      correct document. THis might be because we remember the Lucene doc number
      or because a field is named as holding a unique id.

      e) named vectors and matrices should be used if plausible.

      Attachments

        1. MAHOUT-1178.patch
          11 kB
          Gokhan Capan
        2. MAHOUT-1178-TEST.patch
          7 kB
          Gokhan Capan

        Activity

          People

            gokhancapan Gokhan Capan
            dfilimon Dan Filimon
            Votes:
            1 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: