Uploaded image for project: 'Mahout'
  1. Mahout
  2. MAHOUT-1292

lucene2seq should validate the 'id' field

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.8
    • Fix Version/s: 0.9
    • Component/s: Integration
    • Labels:

      Description

      Lucene2seq creates only one sequencefile, rather than a file for each document in the index.

      Running lucene2seq on my Solr (4.3) index produces a file with a header and, it seems, the field I specified from the index, concatenated for all the documents. After running this through seq2sparse and rowid (to prepare for cvb), the resulting matrix has only one row, though it should create one row per document.

      This issue prevents, at least, data from a lucene index from being easily used as input for cvb. Lucene.vector is also currently inadequate: the keys to its sequence files are LongWriteable, and rowid will not convert only Text to IntWriteable, as is necessary for the keys in cvb.

        Attachments

        1. MAHOUT-1292.patch
          12 kB
          Frank Scholten

          Activity

            People

            • Assignee:
              smarthi Suneel Marthi
              Reporter:
              lmerkhofer Liz Merkhofer
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: