Mahout
  1. Mahout
  2. MAHOUT-833

Make conversion to sequence files map-reduce

    Details

      Description

      Given input that is on HDFS, the SequenceFilesFrom****.java classes should be able to do their work in parallel.

      1. MAHOUT-833-final.patch
        61 kB
        Josh Patterson
      2. MAHOUT-833.patch
        13 kB
        Josh Patterson
      3. MAHOUT-833.patch
        34 kB
        Suneel Marthi

        Activity

        Hide
        Josh Patterson added a comment -

        What are the most common expectations around how we expect input files of this type to occur?

        I ask that to better take an angle on how to feed pathnames to map tasks to subdivide the work.

        Depending on factors like:

        • "lots of directories, few files per directory"
        • " few directories, lots of files per dir"

        Currently the code is built around "tagging along" on the FileSystem.ListStatus( ... ) recursive filter code path, but the MR version will have to be different.

        One approach I've kicked around is that you could just walk the directory list and then hash each entry out into a group so regardless of directory each map task gets a (generally) even number of documents to process, but out of the box that doesnt consider trying to keep all of the files in one directory in the same sequence file. Does that matter here? I want to say no, but then again, why not ask.

        Show
        Josh Patterson added a comment - What are the most common expectations around how we expect input files of this type to occur? I ask that to better take an angle on how to feed pathnames to map tasks to subdivide the work. Depending on factors like: "lots of directories, few files per directory" " few directories, lots of files per dir" Currently the code is built around "tagging along" on the FileSystem.ListStatus( ... ) recursive filter code path, but the MR version will have to be different. One approach I've kicked around is that you could just walk the directory list and then hash each entry out into a group so regardless of directory each map task gets a (generally) even number of documents to process, but out of the box that doesnt consider trying to keep all of the files in one directory in the same sequence file. Does that matter here? I want to say no, but then again, why not ask.
        Hide
        Joe Prasanna Kumar added a comment -

        Josh,
        For the SequenceFilesFromDirectory, the doc comment says "Converts a directory of text documents into SequenceFiles of Specified chunkSize". so we are anywayz expecting text documents and the output format says docid => content. I am thinking that
        1) we should use a custom InputFormat which will parse the data according to specified options. For eg, we can extend the FileInputFormat and specifying isSplitable() to be false. So each file will be consumed by Mapper as 1 whole file. The map function will process the file according to the options and emit key value pairs.
        2) I guess we wont really need a Reducer.
        3) The driver will use setOutputFormatClass(SequenceFileOutputFormat.class) to write the key,values from Mapper as SequenceFile

        The same approach would go for SequenceFilesFromMailArchives where we can have
        1) A separate InputFormat class that will have a RecordReader which will split each mail message as a separate Key, Value pair for consumption by Mapper. Mapper will further parse the message according to the options and emit the proper KV pairs.
        2) I guess we wont really need a Reducer.
        3) The driver will use setOutputFormatClass(SequenceFileOutputFormat.class) to write the key,values from Mapper as SequenceFile

        Team,
        If this approach looks rite, I can submit a patch for this. Please let me know.

        Appreciate any feedbacks,
        Joe.

        Show
        Joe Prasanna Kumar added a comment - Josh, For the SequenceFilesFromDirectory, the doc comment says "Converts a directory of text documents into SequenceFiles of Specified chunkSize". so we are anywayz expecting text documents and the output format says docid => content. I am thinking that 1) we should use a custom InputFormat which will parse the data according to specified options. For eg, we can extend the FileInputFormat and specifying isSplitable() to be false. So each file will be consumed by Mapper as 1 whole file. The map function will process the file according to the options and emit key value pairs. 2) I guess we wont really need a Reducer. 3) The driver will use setOutputFormatClass(SequenceFileOutputFormat.class) to write the key,values from Mapper as SequenceFile The same approach would go for SequenceFilesFromMailArchives where we can have 1) A separate InputFormat class that will have a RecordReader which will split each mail message as a separate Key, Value pair for consumption by Mapper. Mapper will further parse the message according to the options and emit the proper KV pairs. 2) I guess we wont really need a Reducer. 3) The driver will use setOutputFormatClass(SequenceFileOutputFormat.class) to write the key,values from Mapper as SequenceFile Team, If this approach looks rite, I can submit a patch for this. Please let me know. Appreciate any feedbacks, Joe.
        Hide
        Josh Patterson added a comment -

        I think for "SequenceFilesFromDirectory" with FileInputFormat you would run into the issue where each file in the directory would generate a map task, and if you had no reducer, each file would be in a separate output sequence file, which would create lots of relatively small files.

        This also has the downside of not leveraging tasks setup/teardown time; Although the reduce side could generate the sequence files, ideally we'd like to see each mapper process more files per task.

        An alternative approach:

        • On client side (pre-MR), list files recursively using HDFS api. Output to a file.
        • Use the NLineInputFormat against that file to split among multiple mappers

        JP

        Show
        Josh Patterson added a comment - I think for "SequenceFilesFromDirectory" with FileInputFormat you would run into the issue where each file in the directory would generate a map task, and if you had no reducer, each file would be in a separate output sequence file, which would create lots of relatively small files. This also has the downside of not leveraging tasks setup/teardown time; Although the reduce side could generate the sequence files, ideally we'd like to see each mapper process more files per task. An alternative approach: On client side (pre-MR), list files recursively using HDFS api. Output to a file. Use the NLineInputFormat against that file to split among multiple mappers JP
        Hide
        Raphael Cendrillon added a comment -

        Is this still open? If so I could take a look.

        Show
        Raphael Cendrillon added a comment - Is this still open? If so I could take a look.
        Hide
        Grant Ingersoll added a comment -

        I think Josh was working on something, but I haven't seen a patch. I'd wait for him to chime in.

        Show
        Grant Ingersoll added a comment - I think Josh was working on something, but I haven't seen a patch. I'd wait for him to chime in.
        Hide
        Josh Patterson added a comment -

        if you got a patch, throw it up. I am scheduled to try and get this out the door this week (been on the road a ton lately), I dont want to keep someone else from contributing.

        There are a few more similar to this, if you get a patch up first, I'll move on to one of those.

        Show
        Josh Patterson added a comment - if you got a patch, throw it up. I am scheduled to try and get this out the door this week (been on the road a ton lately), I dont want to keep someone else from contributing. There are a few more similar to this, if you get a patch up first, I'll move on to one of those.
        Hide
        Joe Prasanna Kumar added a comment -

        Raphael,

        For SequenceFilesFromMailArchives, I started writing a new InputFormat and RecordReader that would parse each of the mail messages and output them into a Sequence File. It is in a very raw state and since I am travelling for work, i wont be able to do much with it for a month. so if you have already started on it, please feel free to upload a patch.. If you are interested in looking at what i've written so far, let me know and i'll cleanup a bit, add some comments and email you or something.

        reg
        Joe.

        Show
        Joe Prasanna Kumar added a comment - Raphael, For SequenceFilesFromMailArchives, I started writing a new InputFormat and RecordReader that would parse each of the mail messages and output them into a Sequence File. It is in a very raw state and since I am travelling for work, i wont be able to do much with it for a month. so if you have already started on it, please feel free to upload a patch.. If you are interested in looking at what i've written so far, let me know and i'll cleanup a bit, add some comments and email you or something. reg Joe.
        Hide
        Josh Patterson added a comment -

        Well, it works in MR form, but a quick questions:

        • how important is the "chunksize" parameter in the context of mapreduce? do we want to carry this option over, where in a map-only job you'd generally just expect to get a bunch of outputs from each split?

        I can add this feature, but it seems like the mechanic works here a bit differently and wanted to make sure it made sense.

        JP

        Show
        Josh Patterson added a comment - Well, it works in MR form, but a quick questions: how important is the "chunksize" parameter in the context of mapreduce? do we want to carry this option over, where in a map-only job you'd generally just expect to get a bunch of outputs from each split? I can add this feature, but it seems like the mechanic works here a bit differently and wanted to make sure it made sense. JP
        Hide
        Josh Patterson added a comment -

        Ok, I went ahead and added the chunksize param in, it only inflates over the param slightly based on the prefix and adding the paths to each key.

        On another note, what is the best way to kick off this job?

        1. add another command to the mahout bash script prop file?

        2. add a flag to the existing "/bin/mahout seqdirectory" setup that would kickoff the MR job instead of the serial process, something like:

        /bin/mahout seqdirectory -mr [ more options ]

        JP

        Show
        Josh Patterson added a comment - Ok, I went ahead and added the chunksize param in, it only inflates over the param slightly based on the prefix and adding the paths to each key. On another note, what is the best way to kick off this job? 1. add another command to the mahout bash script prop file? 2. add a flag to the existing "/bin/mahout seqdirectory" setup that would kickoff the MR job instead of the serial process, something like: /bin/mahout seqdirectory -mr [ more options ] JP
        Hide
        Frank Scholten added a comment -

        Josh: Perhaps you can use a -xm / --method flag to specify MR or sequential execution method. This convention is used by most clustering jobs.

        Show
        Frank Scholten added a comment - Josh: Perhaps you can use a -xm / --method flag to specify MR or sequential execution method. This convention is used by most clustering jobs.
        Hide
        Joe Prasanna Kumar added a comment -

        Hey Josh,
        Just wondering if you already have the patch created for SequenceFilesFromDirectory and SequenceFilesFromMailArchives ? If so would you be able to plz upload the same. i am really interested in walking through your approach.
        Joe.

        Show
        Joe Prasanna Kumar added a comment - Hey Josh, Just wondering if you already have the patch created for SequenceFilesFromDirectory and SequenceFilesFromMailArchives ? If so would you be able to plz upload the same. i am really interested in walking through your approach. Joe.
        Hide
        Josh Patterson added a comment -

        This patch adds in a codepath for running the SequenceFilesFromDirectory.java sequential process as a MR job. I haven't had time to do the mail archives one (its slightly different), but I'll look at that one. Joe asked about my design so I'll post this now, and in a few days, I'll post a patch that includes the mail archives one.

        Show
        Josh Patterson added a comment - This patch adds in a codepath for running the SequenceFilesFromDirectory.java sequential process as a MR job. I haven't had time to do the mail archives one (its slightly different), but I'll look at that one. Joe asked about my design so I'll post this now, and in a few days, I'll post a patch that includes the mail archives one.
        Hide
        Josh Patterson added a comment -

        Joe,
        Quick overview of the patch as is:

        • only does the SequenceFilesFromDirectory.java codepath, does not address the mail part (yet).
        • old codepath recurses through the dirs wtih fs.listStatus() and writes into a single ChunkedWriter for the sequence file
        • since the dirs can have subdirs, had to include a function that built a recursive list of subdirs based on the input path
        • since we had lots of small file paths, I ended up subclassing CombineFileInputFormat for MultiTextFileInputFormat
        • the chunkSize param was a bit of a trick in MR, thought I was going to have to do it by hand in MR, but ended up going with "mapred.max.split.size"
        • tested on the reuters extracted files that are used in some of the demos since it has around 21k smallish text files to work from
        • the JobSplitWriter started complaining about "max block locations exceeded for split", which caused me to set "mapreduce.job.max.split.locations" to a very large number in the job conf
        • all of the changes are localized in the integration module in o.a.m.text
        • new vs old MR API
          • looking at AbstractJob.prepareJob(), I can see that most of Mahout's MR Jobs are using the newer MR api. I tried to accommodate that same pattern here.
          • unfortunately, Hadoop 0.20.205 does not currently have a class for CombineFileInputFormat
          • currently the code works with the old API specifically because of this issue, I'm looking at filing a JIRA with Hadoop for this
        Show
        Josh Patterson added a comment - Joe, Quick overview of the patch as is: only does the SequenceFilesFromDirectory.java codepath, does not address the mail part (yet). old codepath recurses through the dirs wtih fs.listStatus() and writes into a single ChunkedWriter for the sequence file since the dirs can have subdirs, had to include a function that built a recursive list of subdirs based on the input path since we had lots of small file paths, I ended up subclassing CombineFileInputFormat for MultiTextFileInputFormat for a great explanation of how CombineFileInputFormat works and how its used: http://lucene.472066.n3.nabble.com/help-on-CombineFileInputFormat-td781357.html basically: each split is a bunch of small file input paths so each mapper gets fed a lot of files (we dont want each mapper looking at a single file like we'd normally see with TextInputFormat) the chunkSize param was a bit of a trick in MR, thought I was going to have to do it by hand in MR, but ended up going with "mapred.max.split.size" tested on the reuters extracted files that are used in some of the demos since it has around 21k smallish text files to work from the JobSplitWriter started complaining about "max block locations exceeded for split", which caused me to set "mapreduce.job.max.split.locations" to a very large number in the job conf all of the changes are localized in the integration module in o.a.m.text new vs old MR API looking at AbstractJob.prepareJob(), I can see that most of Mahout's MR Jobs are using the newer MR api. I tried to accommodate that same pattern here. unfortunately, Hadoop 0.20.205 does not currently have a class for CombineFileInputFormat currently the code works with the old API specifically because of this issue, I'm looking at filing a JIRA with Hadoop for this
        Hide
        Josh Patterson added a comment -

        This patch has functionality for the MR versions of both SequenceFilesFromDirectory and SequenceFilesFromMailArchives.

        A few notes:

        • I couldnt find a place in the serial version of SequenceFilesFromMailArchives that was actually turning on block compression for the sequence files explicily in code. This could be done by the conf files in hadoop 0.20.205, but it wasnt being done in code afaik
        • the serial version of SequenceFilesFromMailArchives seems to not be working correctly in trunk; It does pass tests, but when its run on a .gz file from the ASF mail archives it reports 0 records extracted. The MR version works as intended in this patch, but I did not yet change the serial version.
        • the structure of SequenceFilesFromMailArchives (MR version) maintains as much of the same functionality / code as I could muster from the serial version. To use the FileLineIterable in the mbox parsing code, I had to change add a constructor, for instance.
        • ended up using old MR api because of needing certain functionality that had not yet been ported as of 0.20.205
        Show
        Josh Patterson added a comment - This patch has functionality for the MR versions of both SequenceFilesFromDirectory and SequenceFilesFromMailArchives. A few notes: I couldnt find a place in the serial version of SequenceFilesFromMailArchives that was actually turning on block compression for the sequence files explicily in code. This could be done by the conf files in hadoop 0.20.205, but it wasnt being done in code afaik the serial version of SequenceFilesFromMailArchives seems to not be working correctly in trunk; It does pass tests, but when its run on a .gz file from the ASF mail archives it reports 0 records extracted. The MR version works as intended in this patch, but I did not yet change the serial version. the structure of SequenceFilesFromMailArchives (MR version) maintains as much of the same functionality / code as I could muster from the serial version. To use the FileLineIterable in the mbox parsing code, I had to change add a constructor, for instance. ended up using old MR api because of needing certain functionality that had not yet been ported as of 0.20.205
        Hide
        Robin Anil added a comment -

        Josh, its an old patch so naturally it didn't apply. Would you be able to update it by next week ? I will leave this on the 0.8 path. Its good to have in the release.

        Show
        Robin Anil added a comment - Josh, its an old patch so naturally it didn't apply. Would you be able to update it by next week ? I will leave this on the 0.8 path. Its good to have in the release.
        Hide
        Suneel Marthi added a comment - - edited

        Interim patch, modified to be compatible with present Mahout codebase and upgraded to use new MR API, not done with the unit test changes yet. Will continue work on this.

        Show
        Suneel Marthi added a comment - - edited Interim patch, modified to be compatible with present Mahout codebase and upgraded to use new MR API, not done with the unit test changes yet. Will continue work on this.
        Hide
        Grant Ingersoll added a comment -

        The patch seems to be missing the WholeFileRecordReader.

        Show
        Grant Ingersoll added a comment - The patch seems to be missing the WholeFileRecordReader.
        Hide
        Suneel Marthi added a comment -

        Uploaded patch for review: https://reviews.apache.org/r/11774/diff/2/

        Show
        Suneel Marthi added a comment - Uploaded patch for review: https://reviews.apache.org/r/11774/diff/2/
        Hide
        Suneel Marthi added a comment -

        Final working patch for review: https://reviews.apache.org/r/11774/diff/3/

        Show
        Suneel Marthi added a comment - Final working patch for review: https://reviews.apache.org/r/11774/diff/3/
        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #2100 (See https://builds.apache.org/job/Mahout-Quality/2100/)
        MAHOUT-833: Make conversion to sequence files map-reduce - Checking in, tests pass (Revision 1495863)

        Result = FAILURE
        smarthi :
        Files :

        • /mahout/trunk/core/src/main/java/org/apache/mahout/common/AbstractJob.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/common/iterator/FileLineIterable.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/common/iterator/FileLineIterator.java
        • /mahout/trunk/integration/src/main/java/org/apache/mahout/text/MultipleTextFileInputFormat.java
        • /mahout/trunk/integration/src/main/java/org/apache/mahout/text/SequenceFilesFromDirectory.java
        • /mahout/trunk/integration/src/main/java/org/apache/mahout/text/SequenceFilesFromDirectoryMapper.java
        • /mahout/trunk/integration/src/main/java/org/apache/mahout/text/SequenceFilesFromMailArchives.java
        • /mahout/trunk/integration/src/main/java/org/apache/mahout/text/SequenceFilesFromMailArchivesMapper.java
        • /mahout/trunk/integration/src/main/java/org/apache/mahout/text/WholeFileRecordReader.java
        • /mahout/trunk/integration/src/test/java/org/apache/mahout/text/SequenceFilesFromMailArchivesTest.java
        • /mahout/trunk/integration/src/test/java/org/apache/mahout/text/TestSequenceFilesFromDirectory.java
        Show
        Hudson added a comment - Integrated in Mahout-Quality #2100 (See https://builds.apache.org/job/Mahout-Quality/2100/ ) MAHOUT-833 : Make conversion to sequence files map-reduce - Checking in, tests pass (Revision 1495863) Result = FAILURE smarthi : Files : /mahout/trunk/core/src/main/java/org/apache/mahout/common/AbstractJob.java /mahout/trunk/core/src/main/java/org/apache/mahout/common/iterator/FileLineIterable.java /mahout/trunk/core/src/main/java/org/apache/mahout/common/iterator/FileLineIterator.java /mahout/trunk/integration/src/main/java/org/apache/mahout/text/MultipleTextFileInputFormat.java /mahout/trunk/integration/src/main/java/org/apache/mahout/text/SequenceFilesFromDirectory.java /mahout/trunk/integration/src/main/java/org/apache/mahout/text/SequenceFilesFromDirectoryMapper.java /mahout/trunk/integration/src/main/java/org/apache/mahout/text/SequenceFilesFromMailArchives.java /mahout/trunk/integration/src/main/java/org/apache/mahout/text/SequenceFilesFromMailArchivesMapper.java /mahout/trunk/integration/src/main/java/org/apache/mahout/text/WholeFileRecordReader.java /mahout/trunk/integration/src/test/java/org/apache/mahout/text/SequenceFilesFromMailArchivesTest.java /mahout/trunk/integration/src/test/java/org/apache/mahout/text/TestSequenceFilesFromDirectory.java
        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #2105 (See https://builds.apache.org/job/Mahout-Quality/2105/)
        MAHOUT-833: Make conversion to sequence files map-reduce (Revision 1496532)

        Result = SUCCESS
        smarthi :
        Files :

        • /mahout/trunk/CHANGELOG
        Show
        Hudson added a comment - Integrated in Mahout-Quality #2105 (See https://builds.apache.org/job/Mahout-Quality/2105/ ) MAHOUT-833 : Make conversion to sequence files map-reduce (Revision 1496532) Result = SUCCESS smarthi : Files : /mahout/trunk/CHANGELOG
        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #2109 (See https://builds.apache.org/job/Mahout-Quality/2109/)
        MAHOUT-833: Make conversion to sequence files map-reduce (changes based on feedback from code review) (Revision 1496929)
        MAHOUT-833: Make conversion to sequence files map-reduce - (changes based on feedback from review). (Revision 1496927)

        Result = SUCCESS
        smarthi :
        Files :

        • /mahout/trunk/core/src/main/java/org/apache/mahout/common/HadoopUtil.java

        smarthi :
        Files :

        • /mahout/trunk/integration/src/main/java/org/apache/mahout/text/SequenceFilesFromDirectory.java
        • /mahout/trunk/integration/src/main/java/org/apache/mahout/text/SequenceFilesFromDirectoryMapper.java
        • /mahout/trunk/integration/src/main/java/org/apache/mahout/text/SequenceFilesFromMailArchives.java
        • /mahout/trunk/integration/src/main/java/org/apache/mahout/text/SequenceFilesFromMailArchivesMapper.java
        • /mahout/trunk/integration/src/test/java/org/apache/mahout/text/SequenceFilesFromMailArchivesTest.java
        • /mahout/trunk/integration/src/test/java/org/apache/mahout/text/TestSequenceFilesFromDirectory.java
        Show
        Hudson added a comment - Integrated in Mahout-Quality #2109 (See https://builds.apache.org/job/Mahout-Quality/2109/ ) MAHOUT-833 : Make conversion to sequence files map-reduce (changes based on feedback from code review) (Revision 1496929) MAHOUT-833 : Make conversion to sequence files map-reduce - (changes based on feedback from review). (Revision 1496927) Result = SUCCESS smarthi : Files : /mahout/trunk/core/src/main/java/org/apache/mahout/common/HadoopUtil.java smarthi : Files : /mahout/trunk/integration/src/main/java/org/apache/mahout/text/SequenceFilesFromDirectory.java /mahout/trunk/integration/src/main/java/org/apache/mahout/text/SequenceFilesFromDirectoryMapper.java /mahout/trunk/integration/src/main/java/org/apache/mahout/text/SequenceFilesFromMailArchives.java /mahout/trunk/integration/src/main/java/org/apache/mahout/text/SequenceFilesFromMailArchivesMapper.java /mahout/trunk/integration/src/test/java/org/apache/mahout/text/SequenceFilesFromMailArchivesTest.java /mahout/trunk/integration/src/test/java/org/apache/mahout/text/TestSequenceFilesFromDirectory.java
        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #2131 (See https://builds.apache.org/job/Mahout-Quality/2131/)
        MAHOUT-833: removed code comment. (Revision 1500310)

        Result = SUCCESS
        smarthi :
        Files :

        • /mahout/trunk/core/src/main/java/org/apache/mahout/common/iterator/FileLineIterable.java
        Show
        Hudson added a comment - Integrated in Mahout-Quality #2131 (See https://builds.apache.org/job/Mahout-Quality/2131/ ) MAHOUT-833 : removed code comment. (Revision 1500310) Result = SUCCESS smarthi : Files : /mahout/trunk/core/src/main/java/org/apache/mahout/common/iterator/FileLineIterable.java

          People

          • Assignee:
            Suneel Marthi
            Reporter:
            Grant Ingersoll
          • Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development