Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-7329

HadoopPipes task may fail when linux kernel version change from 3.x to 4.x

VotersStop watchingWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

    • Hadoop Flags:
      Reviewed

      Description

      Hadoop Pipes Ping implement has a bug. Recently, we upgrade linux kernel version from 3.x to 4.x. And we find hadoop pipe task exit with connect timeout which is implemented by PingThread in HadoopPipes.cc.

      After a deep research, we finally find that current ping server won't accept ping client created socket, which may cause critical problem: 

      •  it will cause tcp accept queue full(default 50)
      •  when client close socket, server socket won't call close method, which will leave too many CLOSE_WAIT socket fd existed(default 2h), and accept queue never cleared.
      • Even worse, in 4.x linux kernel version, it will cause tcp drop packet directly which makes ping client connect time out. While In 3.x linux kernel version, when accept queue full, client can also make half connection till sync queue full (default 2048), so from client side, ping will aslo work till sync queue full. And after 3 hours, task will also exit with connect timeout exception.

      To fix this bug, we introduced a PingSocketCleaner thread, which will continuously accept ping socket connect from ping client. When socket close from client,  cleaner thread will detecte closed inputStream reading, then finally close socket from sever side.

      Refrenced by linux kernel patch: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5ea8ea2cb7

       

        Attachments

          Activity

            People

            • Assignee:
              lichaojacobs chaoli
              Reporter:
              lichaojacobs chaoli

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 7h 20m
                7h 20m

                  Issue deployment