Details

    • Type: Sub-task
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 2.8.0
    • Fix Version/s: None
    • Component/s: fs/s3
    • Labels:
      None

      Description

      Currently, localhost is passed as locality for each block, causing all blocks involved in job to initially target the same node (RM), before being moved by the scheduler (to a rack-local node). This reduces parallelism for jobs (with short-lived mappers).

      We should mimic Azures implementation: a config setting fs.s3a.block.location.impersonatedhost where the user can enter the list of hostnames in the cluster to return to getFileBlockLocations.

      Possible optimization: for larger systems, it might be better to return N (5?) random hostnames to prevent passing a huge array (the downstream code assumes size = O(3)).

        Activity

        Hide
        stevel@apache.org Steve Loughran added a comment -

        I like your idea of picking randomly; if you really wanted to be clever you''d use rack topology, but I don't see that would gain much. This is about distribution of workload. Provided the returned values were sufficiently random, it would even out.

        Can I propose the code for this being entirely self contained; something that s3a can delegate too. This would allow the test to be isolated in a simple unit test, and the code reusable elsewhere

        Show
        stevel@apache.org Steve Loughran added a comment - I like your idea of picking randomly; if you really wanted to be clever you''d use rack topology, but I don't see that would gain much. This is about distribution of workload. Provided the returned values were sufficiently random, it would even out. Can I propose the code for this being entirely self contained; something that s3a can delegate too. This would allow the test to be isolated in a simple unit test, and the code reusable elsewhere
        Show
        stevel@apache.org Steve Loughran added a comment - Link to the azure code here: https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azure/NativeAzureFileSystem.java#L2550-L2589
        Hide
        dweeks Daniel Weeks added a comment -

        There are a couple problems with this approach:

        1) Impersonating hosts can actually cause scheduling delays for tasks because the scheduler will try to place on the impersonated host for locality and delay placing it on other hosts. Because there is no locality, the delay is unnecessary and can have a significant impact if you have lots of tasks reading from S3 on a busy cluster. Even if you pick a random node (or set of nodes), there's no guarantee they will have available capacity and will incur an artificial delay.
        2) This also messes with tracking locality metrics as some tasks will show up as node/rack local, but really aren't.

        It might be better to return the address of the S3 endpoint so that it appears off cluster at which point the scheduler will not be able to find any locality and simply schedule on the first available node. I'm not sure if you actually need to have a node in the cluster for the block location, though.

        Show
        dweeks Daniel Weeks added a comment - There are a couple problems with this approach: 1) Impersonating hosts can actually cause scheduling delays for tasks because the scheduler will try to place on the impersonated host for locality and delay placing it on other hosts. Because there is no locality, the delay is unnecessary and can have a significant impact if you have lots of tasks reading from S3 on a busy cluster. Even if you pick a random node (or set of nodes), there's no guarantee they will have available capacity and will incur an artificial delay. 2) This also messes with tracking locality metrics as some tasks will show up as node/rack local, but really aren't. It might be better to return the address of the S3 endpoint so that it appears off cluster at which point the scheduler will not be able to find any locality and simply schedule on the first available node. I'm not sure if you actually need to have a node in the cluster for the block location, though.
        Hide
        dweeks Daniel Weeks added a comment - - edited

        Actually, looking at the ResourceRequest, is it possible to use 'ANY' = '*', which looks like it would be intended for this purpose.

        https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/ResourceRequest.java#L131

        Show
        dweeks Daniel Weeks added a comment - - edited Actually, looking at the ResourceRequest, is it possible to use 'ANY' = '*', which looks like it would be intended for this purpose. https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/ResourceRequest.java#L131
        Hide
        stevel@apache.org Steve Loughran added a comment - - edited

        the star symbol is certainly something that is used in resource requests...I'd don't see it being used in any filesystem block location status calls. I suspect that any long-lived application will be doing hostname matching on the blocks that come back for locality —and may not work with a "*" (it's not a hostname after all), or not treat it that well.

        Assuming the hostname was configurable: you'd give it a list, it'd return a random subset of them, people could experiment with "*" and "offsite.example.org" to see what worked well.

        Show
        stevel@apache.org Steve Loughran added a comment - - edited the star symbol is certainly something that is used in resource requests...I'd don't see it being used in any filesystem block location status calls. I suspect that any long-lived application will be doing hostname matching on the blocks that come back for locality —and may not work with a "*" (it's not a hostname after all), or not treat it that well. Assuming the hostname was configurable: you'd give it a list, it'd return a random subset of them, people could experiment with "*" and "offsite.example.org" to see what worked well.
        Hide
        cnauroth Chris Nauroth added a comment -

        From reviewing the description, there might be a misunderstanding about the implementation in hadoop-azure.

        We should mimic Azures implementation: a config setting fs.s3a.block.location.impersonatedhost where the user can enter the list of hostnames in the cluster to return to getFileBlockLocations.

        The hadoop-azure implementation currently reports the same host for every block location. That host is configurable and defaults to "localhost". I'm not aware of a way to get it to use a mix of different hostnames.

        What WASB does differently from S3A right now is that it overrides getFileBlockLocations to mimic the concept of block size and use that block size to divide a file and report that it has multiple block locations. For something like MapReduce, that translates to multiple input splits, more map tasks and a greater opportunity for I/O parallelism on jobs that consume a small number of very large files. S3A is different in that it inherits the getFileBlockLocations implementation from the superclass, which always reports that the file has exactly 1 block location. That could mean that for example, S3A would experience a bottleneck on a job whose input is a single very large file, because it would get only 1 input split.

        If use of the same host name in every block location can cause scheduling bottlenecks at the ResourceManager, then I have to assume that WASB is prone to that same problem today. Echoing Steve's comment, if any code changes done here are self-contained and reusable, then S3A, WASB and any other file system could call it to get the benefits.

        Show
        cnauroth Chris Nauroth added a comment - From reviewing the description, there might be a misunderstanding about the implementation in hadoop-azure. We should mimic Azures implementation: a config setting fs.s3a.block.location.impersonatedhost where the user can enter the list of hostnames in the cluster to return to getFileBlockLocations. The hadoop-azure implementation currently reports the same host for every block location. That host is configurable and defaults to "localhost". I'm not aware of a way to get it to use a mix of different hostnames. What WASB does differently from S3A right now is that it overrides getFileBlockLocations to mimic the concept of block size and use that block size to divide a file and report that it has multiple block locations. For something like MapReduce, that translates to multiple input splits, more map tasks and a greater opportunity for I/O parallelism on jobs that consume a small number of very large files. S3A is different in that it inherits the getFileBlockLocations implementation from the superclass, which always reports that the file has exactly 1 block location. That could mean that for example, S3A would experience a bottleneck on a job whose input is a single very large file, because it would get only 1 input split. If use of the same host name in every block location can cause scheduling bottlenecks at the ResourceManager, then I have to assume that WASB is prone to that same problem today. Echoing Steve's comment, if any code changes done here are self-contained and reusable, then S3A, WASB and any other file system could call it to get the benefits.
        Hide
        Thomas Demoor Thomas Demoor added a comment -

        Chris Nauroth: I assumed this would be used as follows: by having fs.wasb/s3a.block.location.impersonatedhost=host1,host2,...,host50 you would impersonate the block to be local on all nodes, the scheduler would then be offered options. An optimization would be to return 5 random hosts.
        Daniel Weeks: Thanks for your interest!

        • The current s3a implementation causes scheduling delays as it declares localhost as the only locality for all blocks.
        • The proposed solution can indeed be suboptimal under load but it's always an improvement over the current situation
        • I agree that using "*" woud be easier and more correct but I think Steve Loughran might have a point. Testing...
        • Returning the s3 endpoint so things are always off-rack seems like an interesting idea. Testing...

        Pieter Reuse has verified that what I originally proposed speeds things up. We are checking if any of Daniel's proposals would work out as well. Expect an update next week.

        Show
        Thomas Demoor Thomas Demoor added a comment - Chris Nauroth : I assumed this would be used as follows: by having fs.wasb/s3a.block.location.impersonatedhost=host1,host2,...,host50 you would impersonate the block to be local on all nodes, the scheduler would then be offered options. An optimization would be to return 5 random hosts. Daniel Weeks : Thanks for your interest! The current s3a implementation causes scheduling delays as it declares localhost as the only locality for all blocks. The proposed solution can indeed be suboptimal under load but it's always an improvement over the current situation I agree that using "*" woud be easier and more correct but I think Steve Loughran might have a point. Testing... Returning the s3 endpoint so things are always off-rack seems like an interesting idea. Testing... Pieter Reuse has verified that what I originally proposed speeds things up. We are checking if any of Daniel's proposals would work out as well. Expect an update next week.
        Hide
        rdblue Ryan Blue added a comment - - edited

        FileInputFormat works slightly differently. First, the split size is calculated from the file's reported block size and the current min and max split sizes. Then, the file is broken into N splits that size, where N = Math.ceil(fileLength / splitSize). The block locations are then used to determine where each split is located, based on the split's starting offset.

        The result is that getFileBlockLocations can return a single location for the entire file and you'll still end up with N roughly block-sized splits. This is what enables you to get more parallelism by setting smaller split sizes, even if the resulting splits don't correspond to different blocks. In our environment, we use a 64MB S3 block size and don't see a bottleneck from one input split per file.

        Show
        rdblue Ryan Blue added a comment - - edited FileInputFormat works slightly differently. First, the split size is calculated from the file's reported block size and the current min and max split sizes. Then, the file is broken into N splits that size, where N = Math.ceil(fileLength / splitSize) . The block locations are then used to determine where each split is located , based on the split's starting offset. The result is that getFileBlockLocations can return a single location for the entire file and you'll still end up with N roughly block-sized splits. This is what enables you to get more parallelism by setting smaller split sizes, even if the resulting splits don't correspond to different blocks. In our environment, we use a 64MB S3 block size and don't see a bottleneck from one input split per file.
        Hide
        Thomas Demoor Thomas Demoor added a comment -

        Chris Nauroth pointed out that the issue I originally ran into could also be avoided by setting yarn.scheduler.capacity.node-locality-delay=0. However, there are non-YARN usecases and downstream projects that would benefit from locality faking, so I think this still makes sense.

        Show
        Thomas Demoor Thomas Demoor added a comment - Chris Nauroth pointed out that the issue I originally ran into could also be avoided by setting yarn.scheduler.capacity.node-locality-delay=0 . However, there are non-YARN usecases and downstream projects that would benefit from locality faking, so I think this still makes sense.
        Hide
        stevel@apache.org Steve Loughran added a comment -

        Following on from this: know that hive actively scans for the word "localhost" when querying block locations, and then interprets that as "anywhere" .... it doesn't request locality in any job submissions.

        Show
        stevel@apache.org Steve Loughran added a comment - Following on from this: know that hive actively scans for the word "localhost" when querying block locations, and then interprets that as "anywhere" .... it doesn't request locality in any job submissions.

          People

          • Assignee:
            Thomas Demoor Thomas Demoor
            Reporter:
            Thomas Demoor Thomas Demoor
          • Votes:
            1 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

            • Created:
              Updated:

              Development