Uploaded image for project: 'Mahout'
  1. Mahout
  2. MAHOUT-590

add TSV (Tab Separate Value) input file support to SequenceFilesFromDirectory

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 0.4
    • Fix Version/s: 0.5
    • Component/s: Integration
    • Labels:
      None
    • Environment:

      Mac OS X 10.6.6, java version "1.6.0_22"
      RHL Linux 2.6.18

      Description

      I would like to add TSV (Tab Separated Value) input file type support to SequenceFilesFromDirectory.

      Here is my real use case:

      I have 36M records of input, each of which consists of ID and CONTENT and various other attributes, and I wanted to convert them to sequence files for clustering records by term vectors of CONTENT. However the problem is since I cannot create 36M files under my home directory due to quota limit that is up to 50k files, I was not able to convert them to sequence files by SequenceFilesFromDirectory utility... Meanwhile, source data format is TSV where each line of a file includes ID\tCONTENT\t... as it is suitable for Pig and most hadoop stream programs to process as input and output. NOTE: CONTENT size is up to around 2k bytes. Hence I feel better TSV support by SequenceFilesFromDirectory directly instead of taking two steps; TSV to text files and text files to Sequence files.

      I'm attaching the patch.

      Hope this makes sense to other folks.

      1. 0001-added-TSV-input-file-support.patch
        13 kB
        Shige Takeda
      2. MAHOUT-590.patch
        26 kB
        Isabel Drost-Fromm
      3. MAHOUT-590.patch
        23 kB
        Isabel Drost-Fromm

        Activity

        Hide
        smtakeda Shige Takeda added a comment -

        patch is attached.

        Show
        smtakeda Shige Takeda added a comment - patch is attached.
        Hide
        smtakeda Shige Takeda added a comment -

        A test case I ran against the real data worked well:

        $MAHOUT_HOME/bin/mahout seqdirectory --inputType TSV --input tsv_input --output tsv_output --keyColumn 0 --valueColumn 21

        This will convert TSV files under tsv_input directory with key in the 0th column and value in the 21th column into sequence files and store into tsv_output.

        I verified the results with this command:
        $MAHOUT_HOME/bin/mahout org.apache.mahout.utils.SequenceFileDumper --seqFile tsv_output/chunk-0

        Show
        smtakeda Shige Takeda added a comment - A test case I ran against the real data worked well: $MAHOUT_HOME/bin/mahout seqdirectory --inputType TSV --input tsv_input --output tsv_output --keyColumn 0 --valueColumn 21 This will convert TSV files under tsv_input directory with key in the 0th column and value in the 21th column into sequence files and store into tsv_output. I verified the results with this command: $MAHOUT_HOME/bin/mahout org.apache.mahout.utils.SequenceFileDumper --seqFile tsv_output/chunk-0
        Hide
        srowen Sean Owen added a comment -

        I'm less convinced on this patch. The tool does one thing – writes files to HDFS as (filename,contents) pairs. This is a fairly different use case being bolted on to the same tool: extract 2 columns from a tab-separated file and write those to HDFS. The fact that there is now a command line switch to select between these two separate modes kind of confirms this.

        It's fine code but sounds like it should be a separate tool. And there are an infinite number of possible file input formats one might want to transform and write to HDFS. It could be hard to justify a tool for one such use case.

        Show
        srowen Sean Owen added a comment - I'm less convinced on this patch. The tool does one thing – writes files to HDFS as (filename,contents) pairs. This is a fairly different use case being bolted on to the same tool: extract 2 columns from a tab-separated file and write those to HDFS. The fact that there is now a command line switch to select between these two separate modes kind of confirms this. It's fine code but sounds like it should be a separate tool. And there are an infinite number of possible file input formats one might want to transform and write to HDFS. It could be hard to justify a tool for one such use case.
        Hide
        smtakeda Shige Takeda added a comment -

        Sean, thanks for review and comments! For now, it is not critical but I can live with my private branch, which is slightly different from trunk
        Actually the original version was a separate code SequnceFilesFromTsv.java and is able to extract multiple columns, though, I just simplified it for review.

        Btw, I recall somebody mentioned in the mailing list that the interfaces of those tools could be improved for general cases, i.e., DB, HDFS, etc, hopefully he comes up with better idea that meet most needs.

        Thank you.

        Show
        smtakeda Shige Takeda added a comment - Sean, thanks for review and comments! For now, it is not critical but I can live with my private branch, which is slightly different from trunk Actually the original version was a separate code SequnceFilesFromTsv.java and is able to extract multiple columns, though, I just simplified it for review. Btw, I recall somebody mentioned in the mailing list that the interfaces of those tools could be improved for general cases, i.e., DB, HDFS, etc, hopefully he comes up with better idea that meet most needs. Thank you.
        Hide
        lancenorskog Lance Norskog added a comment -

        I did something similar for Hadoop: https://issues.apache.org/jira/browse/MAPREDUCE-2208.

        It grabs various combinations of the input columns. Later I've realized I want a date parser, and the ability to handle Wikipedia format:
        id value value
        id value
        id value value value

        It's an easy problem at first, but then it needs more and more features. The end result would be something modular instead of 'GroupLensDataModel' and 'Jester*' etc.

        Lance

        Show
        lancenorskog Lance Norskog added a comment - I did something similar for Hadoop: https://issues.apache.org/jira/browse/MAPREDUCE-2208 . It grabs various combinations of the input columns. Later I've realized I want a date parser, and the ability to handle Wikipedia format: id value value id value id value value value It's an easy problem at first, but then it needs more and more features. The end result would be something modular instead of 'GroupLensDataModel' and 'Jester*' etc. Lance
        Hide
        isabel Isabel Drost-Fromm added a comment -

        I agree the problem should be solved in a different tool. However I think there might be a way to reduce code duplication on the user-side. Attaching a highly re-structured version of the original patch for review.

        Show
        isabel Isabel Drost-Fromm added a comment - I agree the problem should be solved in a different tool. However I think there might be a way to reduce code duplication on the user-side. Attaching a highly re-structured version of the original patch for review.
        Hide
        isabel Isabel Drost-Fromm added a comment -

        Please apply with git -p1 ...

        Show
        isabel Isabel Drost-Fromm added a comment - Please apply with git -p1 ...
        Hide
        isabel Isabel Drost-Fromm added a comment -

        Please ignore last comment. What I actually meant: Please apply with patch -p1 as patch was created with git ...

        Show
        isabel Isabel Drost-Fromm added a comment - Please ignore last comment. What I actually meant: Please apply with patch -p1 as patch was created with git ...
        Hide
        isabel Isabel Drost-Fromm added a comment -

        Updated version.

        Show
        isabel Isabel Drost-Fromm added a comment - Updated version.
        Hide
        isabel Isabel Drost-Fromm added a comment -

        Patch committed.

        Show
        isabel Isabel Drost-Fromm added a comment - Patch committed.
        Hide
        hudson Hudson added a comment -

        Integrated in Mahout-Quality #688 (See https://hudson.apache.org/hudson/job/Mahout-Quality/688/)
        MAHOUT-590 - added support for configurable directory-/ file layout for
        SequenceFilesFromDirectory job. Includes a demo for tab separated value
        formatted files.

        Show
        hudson Hudson added a comment - Integrated in Mahout-Quality #688 (See https://hudson.apache.org/hudson/job/Mahout-Quality/688/ ) MAHOUT-590 - added support for configurable directory-/ file layout for SequenceFilesFromDirectory job. Includes a demo for tab separated value formatted files.
        Hide
        saku125 sakurai added a comment -

        Hi, I downloaded mahout-distribution-0.7-src.tar.gz and after unzip everything & compile it, i can run mahout. i wannt use your patch so that i can specify TSV through "mahout seqdirectory --inputType TSV". Can you teach me how to apply the patch? I did not check out the trunk. Thank you.

        Show
        saku125 sakurai added a comment - Hi, I downloaded mahout-distribution-0.7-src.tar.gz and after unzip everything & compile it, i can run mahout. i wannt use your patch so that i can specify TSV through "mahout seqdirectory --inputType TSV". Can you teach me how to apply the patch? I did not check out the trunk. Thank you.

          People

          • Assignee:
            srowen Sean Owen
            Reporter:
            smtakeda Shige Takeda
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development