Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-1752

Implement getFileBlockLocations in HarFilesystem

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.23.0
    • Component/s: harchive
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      To efficiently run map reduce on the data that has been HAR'ed it will be great to actually implement getFileBlockLocations for a given filename.
      This way the JobTracker will have information about data locality and will schedule tasks appropriately.
      I believe the overhead introduced by doing lookups in the index files can be smaller than that of copying data over the wire.
      Will upload the patch shortly, but would love to get some feedback on this. And any ideas on how to test it are very welcome.

      1. MR-1752.patch
        2 kB
        Dmytro Molkov
      2. MAPREDUCE-1752.3.patch
        11 kB
        Patrick Kling
      3. MAPREDUCE-1752.2.patch
        5 kB
        Dmytro Molkov

        Issue Links

          Activity

          Hide
          dhruba borthakur added a comment -

          Sounds like a good idea. +1

          The idea is to make the contents of a Har file work well with FileInputFormat or CombineFileInputFormat, isn't it? In that case, you can see TestCombineFileInputFormat and see if u can extend it to test the case when the input file(s) are har files.

          Show
          dhruba borthakur added a comment - Sounds like a good idea. +1 The idea is to make the contents of a Har file work well with FileInputFormat or CombineFileInputFormat, isn't it? In that case, you can see TestCombineFileInputFormat and see if u can extend it to test the case when the input file(s) are har files.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Hi Dmytro, are you still working on this? Will you upload a patch soon?

          Show
          Tsz Wo Nicholas Sze added a comment - Hi Dmytro, are you still working on this? Will you upload a patch soon?
          Hide
          Dmytro Molkov added a comment -

          Attaching a general idea for the patch. I will work on the unittest for this one soon.
          Let me know if there are obvious problems with this approach.

          Show
          Dmytro Molkov added a comment - Attaching a general idea for the patch. I will work on the unittest for this one soon. Let me know if there are obvious problems with this approach.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Dmytro, the patch does not compiled.

          Show
          Tsz Wo Nicholas Sze added a comment - Dmytro, the patch does not compiled.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Also, the approach is quite expensive. It requires

          1. read masterIndex
            • fs.open(masterIndex)
            • fs.getFileStatus(masterIndex)
            • read from datanode
          2. read archiveIndex
            • fs.open(archiveIndex)
            • read from datanode
          3. fs.getFileStatus(part)
          Show
          Tsz Wo Nicholas Sze added a comment - Also, the approach is quite expensive. It requires read masterIndex fs.open(masterIndex) fs.getFileStatus(masterIndex) read from datanode read archiveIndex fs.open(archiveIndex) read from datanode fs.getFileStatus(part)
          Hide
          Dmytro Molkov added a comment -

          Nicholas, I do see that this approach is somewhat expensive. However it gives us the locality when we are running a job.
          And this time will only be added to the job setup time, right?

          I guess my approach was making it right and then looking at the ways we can optimize it rather then trying to hack up a fast solution right from the start.
          Do you have any other ideas that may be worth exploring?

          Show
          Dmytro Molkov added a comment - Nicholas, I do see that this approach is somewhat expensive. However it gives us the locality when we are running a job. And this time will only be added to the job setup time, right? I guess my approach was making it right and then looking at the ways we can optimize it rather then trying to hack up a fast solution right from the start. Do you have any other ideas that may be worth exploring?
          Hide
          Rodrigo Schmidt added a comment -

          I've been following this discussion.

          I think Dmytro's idea makes a lot of sense, specially for big jobs that read from big files. In such cases, the performance gains in having local reads would easily compensate for the extra delay at setup time.

          The idea behind it is to use files stored in hadoop archives as input for mapreduce jobs. I don't think this method will be used elsewhere.

          Using har to store mapreduce files that are stable (won't change anymore) but still necessary for read queries is a huge win for the namenode scalability, since it reduces the number of objects it has to store in memory.

          Show
          Rodrigo Schmidt added a comment - I've been following this discussion. I think Dmytro's idea makes a lot of sense, specially for big jobs that read from big files. In such cases, the performance gains in having local reads would easily compensate for the extra delay at setup time. The idea behind it is to use files stored in hadoop archives as input for mapreduce jobs. I don't think this method will be used elsewhere. Using har to store mapreduce files that are stable (won't change anymore) but still necessary for read queries is a huge win for the namenode scalability, since it reduces the number of objects it has to store in memory.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > I guess my approach was making it right and then looking at the ways we can optimize it rather then trying to hack up a fast solution right from the start.
          > Do you have any other ideas that may be worth exploring?

          Yes, your approach totally make sense. A potential improvement would be caching the masterIndex, archiveIndex and all the file statuses since a client calls getBlockLocation(..) multiple times for submitting a job.

          Show
          Tsz Wo Nicholas Sze added a comment - > I guess my approach was making it right and then looking at the ways we can optimize it rather then trying to hack up a fast solution right from the start. > Do you have any other ideas that may be worth exploring? Yes, your approach totally make sense. A potential improvement would be caching the masterIndex, archiveIndex and all the file statuses since a client calls getBlockLocation(..) multiple times for submitting a job.
          Hide
          Dmytro Molkov added a comment -

          Finally got back to this JIRA.
          Attached is the patch that we tested internally and are currently using. It does have the overhead of initial job submission, but it gives you locality for when you run the job which is a reasonable tradeoff.

          We were thinking of taking it one step further eventually when the splits created by the job client on the job submission can have part files of the har directly. So that the only piece of infrastructure that will be accessing har index file will be the client and the mr tasks will go directly after the specific offsets inside of part files of har. But this seems like another JIRA.

          Show
          Dmytro Molkov added a comment - Finally got back to this JIRA. Attached is the patch that we tested internally and are currently using. It does have the overhead of initial job submission, but it gives you locality for when you run the job which is a reasonable tradeoff. We were thinking of taking it one step further eventually when the splits created by the job client on the job submission can have part files of the har directly. So that the only piece of infrastructure that will be accessing har index file will be the client and the mr tasks will go directly after the specific offsets inside of part files of har. But this seems like another JIRA.
          Hide
          Dmytro Molkov added a comment -

          I will submit the patch for hudson testing. When someone has time I would appreciate if you could review it.

          Show
          Dmytro Molkov added a comment - I will submit the patch for hudson testing. When someone has time I would appreciate if you could review it.
          Hide
          Patrick Kling added a comment -

          There is something really strange about the semantics of the offsets and lengths returned by this. Consider the following part file consisting of 3 blocks containing a file f starting at offset 896 with length 512:

          +---------------+
          | ...           |
          +---------------+
          0           
          
          +-----------+---+
          | ...       | f |
          +-----------+---+
          512         896
          
          +-----------+---+
          | f         |...|
          +-----------+---+
          1024        1408
          

          Calling getFileBlockLocations on this file will return 2 LocatedBlocks: b1=<offset=0, length=512>, b2=<offset=512, length=512>. This indicates that b1 contains the first 512 bytes of the block, even though in fact it only contains the first 128 bytes. This is a problem when the client uses these LocatedBlocks to detect whether a portion of f has been corrupted.

          I can think of 2 possible ways of fixing this:

          1) Fix the offset of the returned blocks by subtracting hstatus.getStartIndex() (i.e., the offset of f in the part file) from the block offset. This would return b1=<offset=-384, length=512> and b2=<offset=128, length=512>, indicating to the client that the first 384 bytes of b1 are not part of 1 and correctly indicating the length of each block. In a way, this is similar to how FSNamesystem.getBlockLocations returns entire blocks even if the caller asks for a range that covers only part of these blocks.

          2) Fix the length on the first block returned to reflect the portion of f that is contained in this block, i.e., return b1=<offset=128, length=128>, b2=<offset=128, length=512>. This seems somewhat less clean to me but avoids negative offsets. Also, it would break the convention that all blocks of a file with the exception of the last block are the same length.

          Show
          Patrick Kling added a comment - There is something really strange about the semantics of the offsets and lengths returned by this. Consider the following part file consisting of 3 blocks containing a file f starting at offset 896 with length 512: +---------------+ | ... | +---------------+ 0 +-----------+---+ | ... | f | +-----------+---+ 512 896 +-----------+---+ | f |...| +-----------+---+ 1024 1408 Calling getFileBlockLocations on this file will return 2 LocatedBlocks: b1=<offset=0, length=512>, b2=<offset=512, length=512>. This indicates that b1 contains the first 512 bytes of the block, even though in fact it only contains the first 128 bytes. This is a problem when the client uses these LocatedBlocks to detect whether a portion of f has been corrupted. I can think of 2 possible ways of fixing this: 1) Fix the offset of the returned blocks by subtracting hstatus.getStartIndex() (i.e., the offset of f in the part file) from the block offset. This would return b1=<offset=-384, length=512> and b2=<offset=128, length=512>, indicating to the client that the first 384 bytes of b1 are not part of 1 and correctly indicating the length of each block. In a way, this is similar to how FSNamesystem.getBlockLocations returns entire blocks even if the caller asks for a range that covers only part of these blocks. 2) Fix the length on the first block returned to reflect the portion of f that is contained in this block, i.e., return b1=<offset=128, length=128>, b2=<offset=128, length=512>. This seems somewhat less clean to me but avoids negative offsets. Also, it would break the convention that all blocks of a file with the exception of the last block are the same length.
          Hide
          Dmytro Molkov added a comment -

          In the second case it would of course be b1=<offset=0, length=128>.

          I personally like the second way of fixing it more, since it gives predictable offsets. For the file f the block locations would start with offset 0 and the total length would sum up to the total length of the file. The problem with it might be that the block location of the first block will have length different from the actual block length in this file.
          The way block locations are returned currently each of them except for the last one will have the length of the block and start at the offset which is a multiple of the block length. And even when I call getBlockLocations with offset and length different from 0, status.getLength() I am not guaranteed to get the result where the sum of length would be equal to length and the smallest offset of the block location would be equal to the offset provided.

          That said I think that the second approach fits better into this system unless having block of different lengths will be a problem.

          Show
          Dmytro Molkov added a comment - In the second case it would of course be b1=<offset=0, length=128>. I personally like the second way of fixing it more, since it gives predictable offsets. For the file f the block locations would start with offset 0 and the total length would sum up to the total length of the file. The problem with it might be that the block location of the first block will have length different from the actual block length in this file. The way block locations are returned currently each of them except for the last one will have the length of the block and start at the offset which is a multiple of the block length. And even when I call getBlockLocations with offset and length different from 0, status.getLength() I am not guaranteed to get the result where the sum of length would be equal to length and the smallest offset of the block location would be equal to the offset provided. That said I think that the second approach fits better into this system unless having block of different lengths will be a problem.
          Hide
          Patrick Kling added a comment -

          I have updated Dmytro's patch based on the second solution discussed above. In addition, this code also fixes the length of the last block location corresponding to the requested range. For the example above, it returns the following (as verified by one of the new test cases):

          b0 = <offset=0, length=128>
          b1 = <offset=128, length=384>

          Show
          Patrick Kling added a comment - I have updated Dmytro's patch based on the second solution discussed above. In addition, this code also fixes the length of the last block location corresponding to the requested range. For the example above, it returns the following (as verified by one of the new test cases): b0 = <offset=0, length=128> b1 = <offset=128, length=384>
          Hide
          Patrick Kling added a comment -

          Mahadev/Nicholas, could one of you please have a look at this patch?

          ant test-patch results:

               [exec] +1 overall.  
               [exec] 
               [exec]     +1 @author.  The patch does not contain any @author tags.
               [exec] 
               [exec]     +1 tests included.  The patch appears to include 6 new or modified tests.
               [exec] 
               [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
               [exec] 
               [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
               [exec] 
               [exec]     +1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) warnings.
               [exec] 
               [exec]     +1 release audit.  The applied patch does not increase the total number of release audit warnings.
               [exec] 
               [exec]     +1 system test framework.  The patch passed system test framework compile.
               [exec]
          
          Show
          Patrick Kling added a comment - Mahadev/Nicholas, could one of you please have a look at this patch? ant test-patch results: [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 6 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. [exec] [exec] +1 system test framework. The patch passed system test framework compile. [exec]
          Hide
          Dmytro Molkov added a comment -

          The patch looks good to me +1.
          I guess the main question is does anyone have any objections to the approach in general?

          Show
          Dmytro Molkov added a comment - The patch looks good to me +1. I guess the main question is does anyone have any objections to the approach in general?
          Hide
          dhruba borthakur added a comment -

          I just committed this. Thanks Dmytro and Patrick!

          Show
          dhruba borthakur added a comment - I just committed this. Thanks Dmytro and Patrick!
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk-Commit #557 (See https://hudson.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/557/)
          MAPREDUCE-1752. Implement getFileBlockLocations in HarFilesystem.
          (Patrick Kling via dhruba)

          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk-Commit #557 (See https://hudson.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/557/ ) MAPREDUCE-1752 . Implement getFileBlockLocations in HarFilesystem. (Patrick Kling via dhruba)
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk #643 (See https://hudson.apache.org/hudson/job/Hadoop-Mapreduce-trunk/643/)

          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #643 (See https://hudson.apache.org/hudson/job/Hadoop-Mapreduce-trunk/643/ )

            People

            • Assignee:
              Dmytro Molkov
              Reporter:
              Dmytro Molkov
            • Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development