Mahout
  1. Mahout
  2. MAHOUT-249

Make WikipediaXmlSplitter able to write the chunks directly to HDFS or S3

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 0.2
    • Fix Version/s: 0.3
    • Component/s: Classification
    • Labels:
      None

      Description

      By using the Hadoop FS abstraction it should be possible to avoid writing the chunks on the local hard drive before uploading them to HDFS or S3.

      1. MAHOUT-249-2.patch
        7 kB
        Olivier Grisel
      2. MAHOUT-249-v2.patch
        5 kB
        Olivier Grisel
      3. MAHOUT-249-WikipediaXMLSplitterHDFS.patch
        5 kB
        Olivier Grisel

        Activity

        Hide
        Olivier Grisel added a comment -

        New version of the patch with better command line options help

        Show
        Olivier Grisel added a comment - New version of the patch with better command line options help
        Hide
        Olivier Grisel added a comment -

        Right for the extra /. It's harmless but ugly in the source code, please remove it.

        As for the second question, I confirm that local files still work as previously (without url scheme) or using explicitly the file:/// scheme. I will attach a new patch to make this more explicit in the commandline options help.

        Show
        Olivier Grisel added a comment - Right for the extra /. It's harmless but ugly in the source code, please remove it. As for the second question, I confirm that local files still work as previously (without url scheme) or using explicitly the file:/// scheme. I will attach a new patch to make this more explicit in the commandline options help.
        Hide
        Sean Owen added a comment -

        Ready to commit this too: is there an extra slash in the path you now construct? looks like there is a "//chunk" in there. Also does this still work with local files, does a "file://" URL work?

        Show
        Sean Owen added a comment - Ready to commit this too: is there an extra slash in the path you now construct? looks like there is a "//chunk" in there. Also does this still work with local files, does a "file://" URL work?
        Hide
        Olivier Grisel added a comment -

        new version of the patch to provide credentials to both s3:// and s3n:// URL schemes.

        Show
        Olivier Grisel added a comment - new version of the patch to provide credentials to both s3:// and s3n:// URL schemes.
        Hide
        Olivier Grisel added a comment -

        Patch attached. Note that by default the old behaviour is preserved (chunks are created on the local FS without CRC checksums).

        Show
        Olivier Grisel added a comment - Patch attached. Note that by default the old behaviour is preserved (chunks are created on the local FS without CRC checksums).

          People

          • Assignee:
            Olivier Grisel
            Reporter:
            Olivier Grisel
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development