Hadoop Common
  1. Hadoop Common
  2. HADOOP-286

copyFromLocal throws LeaseExpiredException

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.3.0
    • Fix Version/s: 0.6.0
    • Component/s: None
    • Labels:
      None
    • Environment:

      redhar linux

      Description

      Loading local files to dfs through hadoop dfs -copyFromLocal failed due to the following exception:

      copyFromLocal: org.apache.hadoop.dfs.LeaseExpiredException: No lease on output_crawled.1.txt
      at org.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.java:414)
      at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:190)
      at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
      at java.lang.reflect.Method.invoke(Method.java:585)
      at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:243)
      at org.apache.hadoop.ipc.Server$Handler.run(Server.java:231)

      1. Lease.patch
        0.8 kB
        Konstantin Shvachko

        Activity

        Owen O'Malley made changes -
        Component/s dfs [ 12310710 ]
        Doug Cutting made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Doug Cutting made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Hide
        Doug Cutting added a comment -

        I just committed this. Thanks, Konstantin!

        Show
        Doug Cutting added a comment - I just committed this. Thanks, Konstantin!
        Konstantin Shvachko made changes -
        Fix Version/s 0.6.0 [ 12312025 ]
        Assignee Konstantin Shvachko [ shv ]
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hide
        Konstantin Shvachko added a comment -

        Patch for not renewing leases when pendingCreates is empty.
        This is a scalability issue.

        Show
        Konstantin Shvachko added a comment - Patch for not renewing leases when pendingCreates is empty. This is a scalability issue.
        Hide
        Yoram Arnon added a comment -

        +1 for not requesting a lease unless a write operation is required (i.e. the patch).

        Show
        Yoram Arnon added a comment - +1 for not requesting a lease unless a write operation is required (i.e. the patch).
        Doug Cutting made changes -
        Workflow no-reopen-closed [ 12373531 ] no-reopen-closed, patch-avail [ 12377491 ]
        Konstantin Shvachko made changes -
        Field Original Value New Value
        Attachment Lease.patch [ 12337268 ]
        Hide
        Konstantin Shvachko added a comment -

        This is a very simple patch that renews leases only when pendingCreates is not empty.
        This prevents the client from sending lease renewal messages when the client
        is not writing into dfs, say just reading or doing local stuff.
        This should make the name node less busy.

        I tried to change the ipc.client.timeout from 60 secs to 20 secs.
        On my 3 node cluster everything worked fine.
        On a large cluster the timeout was changed only for the DFSClient.
        The LeaseExpiredException does not appear anymore.
        But we need more statistics on that, especially with slower networks.
        The ipc timeout is global for all ipc connections, so if we make it
        smaller there is a risk that long lasting operations like block transfers
        will start to timeout. I haven't seen it.
        If anybody is willing to try 20 sec ipc timeout please post the results.
        Failing early, and retrying might make things faster in general.

        Show
        Konstantin Shvachko added a comment - This is a very simple patch that renews leases only when pendingCreates is not empty. This prevents the client from sending lease renewal messages when the client is not writing into dfs, say just reading or doing local stuff. This should make the name node less busy. I tried to change the ipc.client.timeout from 60 secs to 20 secs. On my 3 node cluster everything worked fine. On a large cluster the timeout was changed only for the DFSClient. The LeaseExpiredException does not appear anymore. But we need more statistics on that, especially with slower networks. The ipc timeout is global for all ipc connections, so if we make it smaller there is a risk that long lasting operations like block transfers will start to timeout. I haven't seen it. If anybody is willing to try 20 sec ipc timeout please post the results. Failing early, and retrying might make things faster in general.
        Hide
        Konstantin Shvachko added a comment -

        It looks like the following scenario leads to this exception.
        LEASE_PERIOD = 60 sec is a global constants defining for how long a lease is issued.
        DFSClient.LeaseChecker renews this client leases every 30 sec = LEASE_PERIOD/2.
        If the renewLease() fails then the client retries to renew every second.
        One of the most popular reasons the renewLease() fails is because it timeouts SocketTimeoutException.
        This happens when the namenode is busy, which is not unusual since we lock it for each operation.
        The socket timeout is defined by the config parameter "ipc.client.timeout", which is set to 60 sec in
        hadoop-default.xml That means that the renewLease() can last up to 60 seconds and the lease will
        expire the next time the client tries to renew it, which could be up to 90 seconds after the lease was
        created or renewed last time.
        So there are 2 simple solutions to the problem:
        1) to increase LEASE_PERIOD
        2) to decrease ipc.client.timeout

        A related problem is that DFSClient sends lease renew requests no matter what every 30 seconds
        or less. It looks like the DFSClient has enough information to send renew messages only if it really
        holds a lease. A simple solution would be avoid calling renewLease() when
        DFSClient.pendingCreates is empty.
        This could substantially decrease overall net traffic for map/reduce.

        Show
        Konstantin Shvachko added a comment - It looks like the following scenario leads to this exception. LEASE_PERIOD = 60 sec is a global constants defining for how long a lease is issued. DFSClient.LeaseChecker renews this client leases every 30 sec = LEASE_PERIOD/2. If the renewLease() fails then the client retries to renew every second. One of the most popular reasons the renewLease() fails is because it timeouts SocketTimeoutException. This happens when the namenode is busy, which is not unusual since we lock it for each operation. The socket timeout is defined by the config parameter "ipc.client.timeout", which is set to 60 sec in hadoop-default.xml That means that the renewLease() can last up to 60 seconds and the lease will expire the next time the client tries to renew it, which could be up to 90 seconds after the lease was created or renewed last time. So there are 2 simple solutions to the problem: 1) to increase LEASE_PERIOD 2) to decrease ipc.client.timeout A related problem is that DFSClient sends lease renew requests no matter what every 30 seconds or less. It looks like the DFSClient has enough information to send renew messages only if it really holds a lease. A simple solution would be avoid calling renewLease() when DFSClient.pendingCreates is empty. This could substantially decrease overall net traffic for map/reduce.
        Runping Qi created issue -

          People

          • Assignee:
            Konstantin Shvachko
            Reporter:
            Runping Qi
          • Votes:
            1 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development