Uploaded image for project: 'Mahout'
  1. Mahout
  2. MAHOUT-590

add TSV (Tab Separate Value) input file support to SequenceFilesFromDirectory

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 0.4
    • 0.5
    • classic
    • None
    • Mac OS X 10.6.6, java version "1.6.0_22"
      RHL Linux 2.6.18

    Description

      I would like to add TSV (Tab Separated Value) input file type support to SequenceFilesFromDirectory.

      Here is my real use case:

      I have 36M records of input, each of which consists of ID and CONTENT and various other attributes, and I wanted to convert them to sequence files for clustering records by term vectors of CONTENT. However the problem is since I cannot create 36M files under my home directory due to quota limit that is up to 50k files, I was not able to convert them to sequence files by SequenceFilesFromDirectory utility... Meanwhile, source data format is TSV where each line of a file includes ID\tCONTENT\t... as it is suitable for Pig and most hadoop stream programs to process as input and output. NOTE: CONTENT size is up to around 2k bytes. Hence I feel better TSV support by SequenceFilesFromDirectory directly instead of taking two steps; TSV to text files and text files to Sequence files.

      I'm attaching the patch.

      Hope this makes sense to other folks.

      Attachments

        1. MAHOUT-590.patch
          23 kB
          Isabel Drost-Fromm
        2. MAHOUT-590.patch
          26 kB
          Isabel Drost-Fromm
        3. 0001-added-TSV-input-file-support.patch
          13 kB
          Shige Takeda

        Activity

          People

            srowen Sean R. Owen
            smtakeda Shige Takeda
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: