Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-6901

Parsing large compressed files with HADOOP-1722 spawns multiple mappers per file

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 0.21.0
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Environment:

      Hadoop v0.20.2 + HADOOP-1722

    • Tags:
      HADOOP-1722 large compressed file AuotInputFormat

      Description

      This was originally discovered while using Dumbo to parse a very large (2G) compressed file. By default, Dumbo will attempt to use the AutoInputFormat as the input format.

      Here is my use case:

      I have a large (2Gb) compressed file. It's compressed using the default method, which is Gzip based and is unsplittable. Due to the size, the default implementation of AutoInputFormat says that this file is splittable. As a result, this file is split into about 35 parts, and each one is assigned to a Map job.

      However, since the file itself is unsplittable, each Map job winds up parsing the file again from the beginning. This basically means my job has 35x the data, and takes 35x long to run.

      If I set "-inputformat text", this problem does not appear in dumbo. If I manually call the streaming jar and use AutoInputFormat, this
      problem appears.

      Looking at the code in AutoInputFormat, it appears that it uses the default isSplittable() method from InputFormat, which indicates everything is splittable. I think that this class should define it's own isSplittable method similar to what is defined in the TextInputFormat class, which basically says it's splittable if it's not compressed.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                tsetem Rick Weber
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:

                  Time Tracking

                  Estimated:
                  Original Estimate - 24h
                  24h
                  Remaining:
                  Remaining Estimate - 24h
                  24h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified