Uploaded image for project: 'Mahout'
  1. Mahout
  2. MAHOUT-590

add TSV (Tab Separate Value) input file support to SequenceFilesFromDirectory


    • Type: Improvement
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 0.4
    • Fix Version/s: 0.5
    • Component/s: Integration
    • Labels:
    • Environment:

      Mac OS X 10.6.6, java version "1.6.0_22"
      RHL Linux 2.6.18


      I would like to add TSV (Tab Separated Value) input file type support to SequenceFilesFromDirectory.

      Here is my real use case:

      I have 36M records of input, each of which consists of ID and CONTENT and various other attributes, and I wanted to convert them to sequence files for clustering records by term vectors of CONTENT. However the problem is since I cannot create 36M files under my home directory due to quota limit that is up to 50k files, I was not able to convert them to sequence files by SequenceFilesFromDirectory utility... Meanwhile, source data format is TSV where each line of a file includes ID\tCONTENT\t... as it is suitable for Pig and most hadoop stream programs to process as input and output. NOTE: CONTENT size is up to around 2k bytes. Hence I feel better TSV support by SequenceFilesFromDirectory directly instead of taking two steps; TSV to text files and text files to Sequence files.

      I'm attaching the patch.

      Hope this makes sense to other folks.


        1. MAHOUT-590.patch
          26 kB
          Isabel Drost-Fromm
        2. MAHOUT-590.patch
          23 kB
          Isabel Drost-Fromm
        3. 0001-added-TSV-input-file-support.patch
          13 kB
          Shige Takeda



            • Assignee:
              srowen Sean Owen
              smtakeda Shige Takeda
            • Votes:
              0 Vote for this issue
              1 Start watching this issue


              • Created: