Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-4301

2NN image transfer timeout problematic

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Critical Critical
    • Resolution: Duplicate
    • Affects Version/s: 3.0.0, 2.0.3-alpha
    • Fix Version/s: None
    • Component/s: namenode
    • Labels:
      None

      Description

      HDFS-1490 added a timeout on image transfer. But, it seems like the timeout is actually applying to the entirety of the image transfer operation. So, if the image or edits are large (multiple GB) or the transfer is heavily throttled, it is likely to time out repeatedly.

        Issue Links

          Activity

          Hide
          Liang Xie added a comment -

          Todd Lipcon, IMHO the current implement seems OK, we can enlarge DFS_IMAGE_TRANSFER_RATE_KEY && DFS_IMAGE_TRANSFER_TIMEOUT_KEY to relieve, right ?
          or if you have better idea, could you shed some light on it, thanks in advance.

          Show
          Liang Xie added a comment - Todd Lipcon , IMHO the current implement seems OK, we can enlarge DFS_IMAGE_TRANSFER_RATE_KEY && DFS_IMAGE_TRANSFER_TIMEOUT_KEY to relieve, right ? or if you have better idea, could you shed some light on it, thanks in advance.
          Hide
          Todd Lipcon added a comment -

          I had a few thoughts:

          1) Maybe we're just misusing the API: it seems like there should be a way to get a timeout on each individual read() call which is distinct from timing the whole transfer. Need to just look into this more carefully.

          If that fails:

          2) change the config to be "per megabyte" - eg if you expect to transfer 5MB/sec, then you can set the timeout to 200ms/MB. Alternatively we could set a "minimum rate" - eg in this case you'd set to 5MB/sec, and if we are downloading a 50MB image, it would set timeout to 10sec (+ some fixed extra time)

          Show
          Todd Lipcon added a comment - I had a few thoughts: 1) Maybe we're just misusing the API: it seems like there should be a way to get a timeout on each individual read() call which is distinct from timing the whole transfer. Need to just look into this more carefully. If that fails: 2) change the config to be "per megabyte" - eg if you expect to transfer 5MB/sec, then you can set the timeout to 200ms/MB. Alternatively we could set a "minimum rate" - eg in this case you'd set to 5MB/sec, and if we are downloading a 50MB image, it would set timeout to 10sec (+ some fixed extra time)
          Hide
          Andrew Wang added a comment -

          I poked at this some, manually setting an artificially low timeout and transfer bandwidth. The behavior I see is the SNN successfully fetching big edits from the NN (indicating that this is in fact a socket timeout) but timing out in the putimage back to the NN. This is because of the HTTP GET inside the putimage GET handler; we hit the timeout because the putimage request sees no data until the nested getimage finishes.

          The best fix here is to do away with this nested GET business and use an HTTP POST or PUT instead. I'll look into doing this more involved fix, since there are already workarounds in the meantime.

          Show
          Andrew Wang added a comment - I poked at this some, manually setting an artificially low timeout and transfer bandwidth. The behavior I see is the SNN successfully fetching big edits from the NN (indicating that this is in fact a socket timeout) but timing out in the putimage back to the NN. This is because of the HTTP GET inside the putimage GET handler; we hit the timeout because the putimage request sees no data until the nested getimage finishes. The best fix here is to do away with this nested GET business and use an HTTP POST or PUT instead. I'll look into doing this more involved fix, since there are already workarounds in the meantime.
          Hide
          Andrew Wang added a comment -

          Since HDFS-3405 was just committed, resolving this one as dupe. I think using PUT instead of GET-GET will fix all our timeout issues.

          Show
          Andrew Wang added a comment - Since HDFS-3405 was just committed, resolving this one as dupe. I think using PUT instead of GET-GET will fix all our timeout issues.

            People

            • Assignee:
              Andrew Wang
              Reporter:
              Todd Lipcon
            • Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development