Hadoop Common
  1. Hadoop Common
  2. HADOOP-2669

DFS client lost lease during writing into DFS files

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.18.0
    • Component/s: None
    • Labels:
      None

      Description

      I have a program that reads a block compressed sequence file, does some processing on the records and writes the
      processed records into another block compressed sequence file.
      During execution of the program, I got the following exception:

      org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.dfs.LeaseExpiredException: No lease on xxxxx/part-00000
      at org.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.java:976)
      at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:293)
      at sun.reflect.GeneratedMethodAccessor47.invoke(Unknown Source)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
      at java.lang.reflect.Method.invoke(Method.java:597)
      at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:379)
      at org.apache.hadoop.ipc.Server$Handler.run(Server.java:596)

      at org.apache.hadoop.ipc.Client.call(Client.java:482)
      at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
      at org.apache.hadoop.dfs.$Proxy0.addBlock(Unknown Source)
      at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
      at java.lang.reflect.Method.invoke(Method.java:597)
      at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
      at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
      at org.apache.hadoop.dfs.$Proxy0.addBlock(Unknown Source)
      at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:1554)
      at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:1500)
      at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.endBlock(DFSClient.java:1626)
      at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.writeChunk(DFSClient.java:1602)
      at org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunk(FSOutputSummer.java:140)
      at org.apache.hadoop.fs.FSOutputSummer.write1(FSOutputSummer.java:100)
      at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:86)
      at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:39)
      at java.io.DataOutputStream.write(DataOutputStream.java:90)
      at org.apache.hadoop.io.SequenceFile$BlockCompressWriter.writeBuffer(SequenceFile.java:1181)
      at org.apache.hadoop.io.SequenceFile$BlockCompressWriter.sync(SequenceFile.java:1198)

      at org.apache.hadoop.io.SequenceFile$BlockCompressWriter.append(SequenceFile.java:1248)
      at org.apache.hadoop.mapred.SequenceFileOutputFormat$1.write(SequenceFileOutputFormat.java:69)

      1. NoLease.patch
        1 kB
        dhruba borthakur
      2. noLeaseOnFile.patch
        0.8 kB
        dhruba borthakur

        Issue Links

          Activity

          Hide
          Hairong Kuang added a comment -

          Vadim, could you open a new jira that describes the problem that you have with lease? So we can investigate the issue. The fix to this jira should also be in 0.19.0.

          Show
          Hairong Kuang added a comment - Vadim, could you open a new jira that describes the problem that you have with lease? So we can investigate the issue. The fix to this jira should also be in 0.19.0.
          Hide
          Vadim Zaliva added a comment -

          I am observing very similar (most likely the same) issue in hadoop-0.19.0.
          Have it been fixed in hadoop-0.19.0?

          Show
          Vadim Zaliva added a comment - I am observing very similar (most likely the same) issue in hadoop-0.19.0. Have it been fixed in hadoop-0.19.0?
          Hide
          Hudson added a comment -
          Show
          Hudson added a comment - Integrated in Hadoop-trunk #509 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/509/ )
          Hide
          dhruba borthakur added a comment -

          I just committed this.

          Show
          dhruba borthakur added a comment - I just committed this.
          Hide
          Abhijit Bagri added a comment -

          An OutputStream is removed from prepareCreates when a close() is called on DFSDataOutputStream. An FSDataOutputStream handle is present with anyone who uses FileSystem.create(). Now if one calls FSDataOutputStream.close(), it would trigger DFSDataOutputStream.close(). Hence, a FSDataOutputStream.close() call may potentially bring prepareCreates to zero.

          However, in the meantime a DataStreamer thread may be still writing a block. If this takes a long time, then at some time it may face lease expiry. Am I missing something?

          Show
          Abhijit Bagri added a comment - An OutputStream is removed from prepareCreates when a close() is called on DFSDataOutputStream. An FSDataOutputStream handle is present with anyone who uses FileSystem.create(). Now if one calls FSDataOutputStream.close(), it would trigger DFSDataOutputStream.close() . Hence, a FSDataOutputStream.close() call may potentially bring prepareCreates to zero. However, in the meantime a DataStreamer thread may be still writing a block. If this takes a long time, then at some time it may face lease expiry. Am I missing something?
          Hide
          dhruba borthakur added a comment -

          Thanks Lohit for reviewing it. The logging in the first patch is already in the trunk (as part of another JIRA).

          Show
          dhruba borthakur added a comment - Thanks Lohit for reviewing it. The logging in the first patch is already in the trunk (as part of another JIRA).
          Hide
          Lohit Vijayarenu added a comment -

          +1 on this, would it be good to also include the logging provided in initial patch if it help?

          Show
          Lohit Vijayarenu added a comment - +1 on this, would it be good to also include the logging provided in initial patch if it help?
          Hide
          dhruba borthakur added a comment -

          One problem could be that the lease renewal was not occuring within a synchronized section. It is plausible that this could cause a race on multiprocessor systems.

          Show
          dhruba borthakur added a comment - One problem could be that the lease renewal was not occuring within a synchronized section. It is plausible that this could cause a race on multiprocessor systems.
          Hide
          dhruba borthakur added a comment -

          I agree that there could be other bugs that is causing this problem. But can the fact that DFSClient.renewLease() does not check for pendingCreates.size() inside a synchronized section cause the observed behaviour?

          Show
          dhruba borthakur added a comment - I agree that there could be other bugs that is causing this problem. But can the fact that DFSClient.renewLease() does not check for pendingCreates.size() inside a synchronized section cause the observed behaviour?
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Could it be the case that namenode.renewLease(...) keep failing for some reason (e.g. network problem) and it keep retrying for a long time? Then it successes finally but the lease is already expired.

          Show
          Tsz Wo Nicholas Sze added a comment - Could it be the case that namenode.renewLease(...) keep failing for some reason (e.g. network problem) and it keep retrying for a long time? Then it successes finally but the lease is already expired.
          Hide
          dhruba borthakur added a comment -

          I see that DFSClient.LeaseChecker the lease is renewed only if pendingCreates.size() is greater than 0. But this is not inside a synchronized block. Insertions and deletions to/from pendingCreates occur within a synchronized section. I am wondering if this could cause the client to not renew leases.

          Show
          dhruba borthakur added a comment - I see that DFSClient.LeaseChecker the lease is renewed only if pendingCreates.size() is greater than 0. But this is not inside a synchronized block. Insertions and deletions to/from pendingCreates occur within a synchronized section. I am wondering if this could cause the client to not renew leases.
          Hide
          Raghu Angadi added a comment -

          +1 for log. In case of error, can we also include if this holder has any leases at all? Now we don't know if something happened with this leaseholder or with the file.

          Show
          Raghu Angadi added a comment - +1 for log. In case of error, can we also include if this holder has any leases at all? Now we don't know if something happened with this leaseholder or with the file.
          Hide
          dhruba borthakur added a comment -

          This patch is not a fix for this problem but attempts to capture more detailed log message. I would like this patch to go into 0.16.1 as well.

          Show
          dhruba borthakur added a comment - This patch is not a fix for this problem but attempts to capture more detailed log message. I would like this patch to go into 0.16.1 as well.
          Hide
          Raghu Angadi added a comment -

          Thanks Pete.

          No suspects yet but surely looks like a bug. Could you grep for "Removing lease" around the log for "/tmp/1050617226/ds=2008-02-18/SomeTable/part-00015" and grep for one its blocks, say 2853191854023133101 (to see what happened to it) . Not yet sure if client side failures are related.

          Show
          Raghu Angadi added a comment - Thanks Pete. No suspects yet but surely looks like a bug. Could you grep for "Removing lease" around the log for "/tmp/1050617226/ds=2008-02-18/SomeTable/part-00015" and grep for one its blocks, say 2853191854023133101 (to see what happened to it) . Not yet sure if client side failures are related.
          Hide
          Pete Wyckoff added a comment -

          rException: Could not find any valid local directory for task_200802191501_0724_r_000007_0/map_94.out
          at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:313)
          at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
          at org.apache.hadoop.mapred.MapOutputFile.getInputFileForWrite(MapOutputFile.java:133)
          at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.run(ReduceTask.java:1314)

          2008-02-20 05:17:29,319 INFO org.apache.hadoop.mapred.ReduceTask: task_200802191501_0724_r_000007_0 done copying task_200802191501_0724_m_000237_0 output from hadoopX.facebook.com..
          2008-02-20 05:17:29,319 INFO org.apache.hadoop.mapred.ReduceTask: task_200802191501_0724_r_000007_0 Copying task_200802191501_0724_m_000146_0 output from hadoopA.facebook.com..
          2008-02-20 05:17:29,322 ERROR org.apache.hadoop.mapred.ReduceTask: Map output copy failure: java.lang.NullPointerException
          at org.apache.hadoop.fs.InMemoryFileSystem$RawInMemoryFileSystem$InMemoryOutputStream.close(InMemoryFileSystem.java:161)
          at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:49)
          at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:64)
          at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:332)
          at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:49)
          at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:64)
          at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:253)
          at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:714)
          at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:666)

          2008-02-20 05:17:29,322 INFO org.apache.hadoop.mapred.ReduceTask: task_200802191501_0724_r_000007_0 Copying task_200802191501_0724_m_000055_0 output from hadoopY.facebook.com..
          2008-02-20 05:17:29,322 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
          java.io.IOException: task_200802191501_0724_r_000007_0The reduce copier failed
          at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:254)
          at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1804)
          2008-02-20 05:17:29,324 ERROR org.apache.hadoop.mapred.ReduceTask: Map output copy failure: java.lang.NullPointerException
          at org.apache.hadoop.fs.InMemoryFileSystem$RawInMemoryFileSystem$InMemoryOutputStream.close(InMemoryFileSystem.java:161)
          at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:49)
          at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:64)
          at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:332)
          at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:49)
          at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:64)
          at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:253)
          at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:714)
          at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:666)

          2008-02-20 05:17:29,325 INFO org.apache.hadoop.mapred.ReduceTask: task_200802191501_0724_r_000007_0 Copying task_200802191501_0724_m_000120_0 output from hadoopZ.facebook.com..

          Show
          Pete Wyckoff added a comment - rException: Could not find any valid local directory for task_200802191501_0724_r_000007_0/map_94.out at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:313) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124) at org.apache.hadoop.mapred.MapOutputFile.getInputFileForWrite(MapOutputFile.java:133) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.run(ReduceTask.java:1314) 2008-02-20 05:17:29,319 INFO org.apache.hadoop.mapred.ReduceTask: task_200802191501_0724_r_000007_0 done copying task_200802191501_0724_m_000237_0 output from hadoopX.facebook.com.. 2008-02-20 05:17:29,319 INFO org.apache.hadoop.mapred.ReduceTask: task_200802191501_0724_r_000007_0 Copying task_200802191501_0724_m_000146_0 output from hadoopA.facebook.com.. 2008-02-20 05:17:29,322 ERROR org.apache.hadoop.mapred.ReduceTask: Map output copy failure: java.lang.NullPointerException at org.apache.hadoop.fs.InMemoryFileSystem$RawInMemoryFileSystem$InMemoryOutputStream.close(InMemoryFileSystem.java:161) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:49) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:64) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:332) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:49) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:64) at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:253) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:714) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:666) 2008-02-20 05:17:29,322 INFO org.apache.hadoop.mapred.ReduceTask: task_200802191501_0724_r_000007_0 Copying task_200802191501_0724_m_000055_0 output from hadoopY.facebook.com.. 2008-02-20 05:17:29,322 WARN org.apache.hadoop.mapred.TaskTracker: Error running child java.io.IOException: task_200802191501_0724_r_000007_0The reduce copier failed at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:254) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1804) 2008-02-20 05:17:29,324 ERROR org.apache.hadoop.mapred.ReduceTask: Map output copy failure: java.lang.NullPointerException at org.apache.hadoop.fs.InMemoryFileSystem$RawInMemoryFileSystem$InMemoryOutputStream.close(InMemoryFileSystem.java:161) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:49) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:64) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:332) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:49) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:64) at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:253) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:714) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:666) 2008-02-20 05:17:29,325 INFO org.apache.hadoop.mapred.ReduceTask: task_200802191501_0724_r_000007_0 Copying task_200802191501_0724_m_000120_0 output from hadoopZ.facebook.com..
          Hide
          Pete Wyckoff added a comment -

          Hi Raghu, here's the grep of the file from the NameNode. Note that both times this happened, a reduce failed on a machine because it couldn't allocate a directory on that machine because it was out of space. I'll send you the stack trace. thx, pete

          hadoopNN] logs > grep "tmp/1050617226/ds=2008-02-18/SomeTable/part-00015" hadoop-root-namenode-hadoopNN.facebook.com.log
          2008-02-20 04:04:34,434 WARN org.apache.hadoop.dfs.StateChange: DIR* FSDirectory.unprotectedDelete: failed to remove /tmp/1050617226/ds=2008-02-18/SomeTable/part-00015 because it does not exist
          2008-02-20 04:50:43,307 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.allocateBlock: /tmp/1050617226/ds=2008-02-18/SomeTable/part-00015. blk_2853191854023133101
          2008-02-20 04:58:40,735 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.allocateBlock: /tmp/1050617226/ds=2008-02-18/SomeTable/part-00015. blk_2092683246371438874
          2008-02-20 05:06:40,676 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.allocateBlock: /tmp/1050617226/ds=2008-02-18/SomeTable/part-00015. blk_-6864374713288003568
          2008-02-20 05:15:00,448 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.allocateBlock: /tmp/1050617226/ds=2008-02-18/SomeTable/part-00015. blk_6785822354554528111
          2008-02-20 05:23:05,122 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.allocateBlock: /tmp/1050617226/ds=2008-02-18/SomeTable/part-00015. blk_-1002253031864643087
          2008-02-20 05:42:58,850 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.allocateBlock: /tmp/1050617226/ds=2008-02-18/SomeTable/part-00015. blk_4868356428049530206
          2008-02-20 05:51:30,124 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.allocateBlock: /tmp/1050617226/ds=2008-02-18/SomeTable/part-00015. blk_4502317329036147265
          2008-02-20 05:59:41,655 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.allocateBlock: /tmp/1050617226/ds=2008-02-18/SomeTable/part-00015. blk_-1028912444083540626
          2008-02-20 06:07:59,222 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.allocateBlock: /tmp/1050617226/ds=2008-02-18/SomeTable/part-00015. blk_-2305314790692595413
          2008-02-20 06:15:52,835 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.allocateBlock: /tmp/1050617226/ds=2008-02-18/SomeTable/part-00015. blk_6988580537244455759
          2008-02-20 06:23:58,683 INFO org.apache.hadoop.ipc.Server: IPC Server handler 28 on 9000, call addBlock(/tmp/1050617226/ds=2008-02-18/SomeTable/part-00015, DFSClient_task_200802191501_0676_r_000015_1) fr\
          om SomeIP:45453: error: org.apache.hadoop.dfs.LeaseExpiredException: No lease on /tmp/1050617226/ds=2008-02-18/SomeTable/part-00015
          org.apache.hadoop.dfs.LeaseExpiredException: No lease on /tmp/1050617226/ds=2008-02-18/SomeTable/part-00015
          2008-02-20 06:44:24,151 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.allocateBlock: /tmp/1050617226/ds=2008-02-18/SomeTable/part-00015. blk_3760541177395872782
          2008-02-20 06:52:26,050 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.allocateBlock: /tmp/1050617226/ds=2008-02-18/SomeTable/part-00015. blk_-3341039163005140779
          2008-02-20 07:00:16,378 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.allocateBlock: /tmp/1050617226/ds=2008-02-18/SomeTable/part-00015. blk_6227348648842177155
          2008-02-20 07:08:08,215 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.allocateBlock: /tmp/1050617226/ds=2008-02-18/SomeTable/part-00015. blk_1906369823207034882
          2008-02-20 07:16:01,895 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.allocateBlock: /tmp/1050617226/ds=2008-02-18/SomeTable/part-00015. blk_2621171097816659457
          2008-02-20 07:24:33,046 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 9000, call addBlock(/tmp/1050617226/ds=2008-02-18/SomeTable/part-00015, DFSClient_task_200802191501_0676_r_000015_2) fro\
          m SomeOtherIP:48442: error: org.apache.hadoop.dfs.LeaseExpiredException: No lease on /tmp/1050617226/ds=2008-02-18/SomeTable/part-00015
          org.apache.hadoop.dfs.LeaseExpiredException: No lease on /tmp/1050617226/ds=2008-02-18/SomeTable/part-00015
          2008-02-20 07:47:14,155 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.allocateBlock: /tmp/1050617226/ds=2008-02-18/SomeTable/part-00015. blk_-7287451503662587350
          2008-02-20 07:55:21,732 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.allocateBlock: /tmp/1050617226/ds=2008-02-18/SomeTable/part-00015. blk_7927573534765844605
          2008-02-20 08:03:14,563 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.allocateBlock: /tmp/1050617226/ds=2008-02-18/SomeTable/part-00015. blk_3932465634538367837
          2008-02-20 08:11:08,936 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.allocateBlock: /tmp/1050617226/ds=2008-02-18/SomeTable/part-00015. blk_-3190439497258007802
          2008-02-20 08:19:01,456 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.allocateBlock: /tmp/1050617226/ds=2008-02-18/SomeTable/part-00015. blk_2730573925522501767
          2008-02-20 08:27:02,055 INFO org.apache.hadoop.ipc.Server: IPC Server handler 17 on 9000, call addBlock(/tmp/1050617226/ds=2008-02-18/SomeTable/part-00015, DFSClient_task_200802191501_0676_r_000015_3) fr\
          om YetAnotherIP:51184: error: org.apache.hadoop.dfs.LeaseExpiredException: No lease on /tmp/1050617226/ds=2008-02-18/SomeTable/part-00015
          org.apache.hadoop.dfs.LeaseExpiredException: No lease on /tmp/1050617226/ds=2008-02-18/SomeTable/part-00015
          2008-02-20 09:26:58,886 WARN org.apache.hadoop.dfs.StateChange: DIR* NameSystem.internalReleaseCreate: attempt to release a create lock on /tmp/1050617226/ds=2008-02-18/SomeTable/part-00015 file does not\
          exist.

          Show
          Pete Wyckoff added a comment - Hi Raghu, here's the grep of the file from the NameNode. Note that both times this happened, a reduce failed on a machine because it couldn't allocate a directory on that machine because it was out of space. I'll send you the stack trace. thx, pete hadoopNN] logs > grep "tmp/1050617226/ds=2008-02-18/SomeTable/part-00015" hadoop-root-namenode-hadoopNN.facebook.com.log 2008-02-20 04:04:34,434 WARN org.apache.hadoop.dfs.StateChange: DIR* FSDirectory.unprotectedDelete: failed to remove /tmp/1050617226/ds=2008-02-18/SomeTable/part-00015 because it does not exist 2008-02-20 04:50:43,307 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.allocateBlock: /tmp/1050617226/ds=2008-02-18/SomeTable/part-00015. blk_2853191854023133101 2008-02-20 04:58:40,735 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.allocateBlock: /tmp/1050617226/ds=2008-02-18/SomeTable/part-00015. blk_2092683246371438874 2008-02-20 05:06:40,676 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.allocateBlock: /tmp/1050617226/ds=2008-02-18/SomeTable/part-00015. blk_-6864374713288003568 2008-02-20 05:15:00,448 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.allocateBlock: /tmp/1050617226/ds=2008-02-18/SomeTable/part-00015. blk_6785822354554528111 2008-02-20 05:23:05,122 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.allocateBlock: /tmp/1050617226/ds=2008-02-18/SomeTable/part-00015. blk_-1002253031864643087 2008-02-20 05:42:58,850 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.allocateBlock: /tmp/1050617226/ds=2008-02-18/SomeTable/part-00015. blk_4868356428049530206 2008-02-20 05:51:30,124 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.allocateBlock: /tmp/1050617226/ds=2008-02-18/SomeTable/part-00015. blk_4502317329036147265 2008-02-20 05:59:41,655 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.allocateBlock: /tmp/1050617226/ds=2008-02-18/SomeTable/part-00015. blk_-1028912444083540626 2008-02-20 06:07:59,222 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.allocateBlock: /tmp/1050617226/ds=2008-02-18/SomeTable/part-00015. blk_-2305314790692595413 2008-02-20 06:15:52,835 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.allocateBlock: /tmp/1050617226/ds=2008-02-18/SomeTable/part-00015. blk_6988580537244455759 2008-02-20 06:23:58,683 INFO org.apache.hadoop.ipc.Server: IPC Server handler 28 on 9000, call addBlock(/tmp/1050617226/ds=2008-02-18/SomeTable/part-00015, DFSClient_task_200802191501_0676_r_000015_1) fr\ om SomeIP:45453: error: org.apache.hadoop.dfs.LeaseExpiredException: No lease on /tmp/1050617226/ds=2008-02-18/SomeTable/part-00015 org.apache.hadoop.dfs.LeaseExpiredException: No lease on /tmp/1050617226/ds=2008-02-18/SomeTable/part-00015 2008-02-20 06:44:24,151 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.allocateBlock: /tmp/1050617226/ds=2008-02-18/SomeTable/part-00015. blk_3760541177395872782 2008-02-20 06:52:26,050 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.allocateBlock: /tmp/1050617226/ds=2008-02-18/SomeTable/part-00015. blk_-3341039163005140779 2008-02-20 07:00:16,378 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.allocateBlock: /tmp/1050617226/ds=2008-02-18/SomeTable/part-00015. blk_6227348648842177155 2008-02-20 07:08:08,215 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.allocateBlock: /tmp/1050617226/ds=2008-02-18/SomeTable/part-00015. blk_1906369823207034882 2008-02-20 07:16:01,895 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.allocateBlock: /tmp/1050617226/ds=2008-02-18/SomeTable/part-00015. blk_2621171097816659457 2008-02-20 07:24:33,046 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 9000, call addBlock(/tmp/1050617226/ds=2008-02-18/SomeTable/part-00015, DFSClient_task_200802191501_0676_r_000015_2) fro\ m SomeOtherIP:48442: error: org.apache.hadoop.dfs.LeaseExpiredException: No lease on /tmp/1050617226/ds=2008-02-18/SomeTable/part-00015 org.apache.hadoop.dfs.LeaseExpiredException: No lease on /tmp/1050617226/ds=2008-02-18/SomeTable/part-00015 2008-02-20 07:47:14,155 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.allocateBlock: /tmp/1050617226/ds=2008-02-18/SomeTable/part-00015. blk_-7287451503662587350 2008-02-20 07:55:21,732 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.allocateBlock: /tmp/1050617226/ds=2008-02-18/SomeTable/part-00015. blk_7927573534765844605 2008-02-20 08:03:14,563 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.allocateBlock: /tmp/1050617226/ds=2008-02-18/SomeTable/part-00015. blk_3932465634538367837 2008-02-20 08:11:08,936 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.allocateBlock: /tmp/1050617226/ds=2008-02-18/SomeTable/part-00015. blk_-3190439497258007802 2008-02-20 08:19:01,456 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.allocateBlock: /tmp/1050617226/ds=2008-02-18/SomeTable/part-00015. blk_2730573925522501767 2008-02-20 08:27:02,055 INFO org.apache.hadoop.ipc.Server: IPC Server handler 17 on 9000, call addBlock(/tmp/1050617226/ds=2008-02-18/SomeTable/part-00015, DFSClient_task_200802191501_0676_r_000015_3) fr\ om YetAnotherIP:51184: error: org.apache.hadoop.dfs.LeaseExpiredException: No lease on /tmp/1050617226/ds=2008-02-18/SomeTable/part-00015 org.apache.hadoop.dfs.LeaseExpiredException: No lease on /tmp/1050617226/ds=2008-02-18/SomeTable/part-00015 2008-02-20 09:26:58,886 WARN org.apache.hadoop.dfs.StateChange: DIR* NameSystem.internalReleaseCreate: attempt to release a create lock on /tmp/1050617226/ds=2008-02-18/SomeTable/part-00015 file does not\ exist.
          Hide
          Raghu Angadi added a comment -

          Ashish, are you consistently hitting this? Runping mentioned that the client node had no cpu left. Even if the client is very busy, I don't think it is expected to loose the lease.

          When this happens again, could you grep for file name in Namenode log and attach here? Another useful thing would be to grep for all the blocks assigned to this file in Namenode log.

          Show
          Raghu Angadi added a comment - Ashish, are you consistently hitting this? Runping mentioned that the client node had no cpu left. Even if the client is very busy, I don't think it is expected to loose the lease. When this happens again, could you grep for file name in Namenode log and attach here? Another useful thing would be to grep for all the blocks assigned to this file in Namenode log.
          Hide
          Ashish Thusoo added a comment - - edited

          I am hitting the same stack. Is there a work around or change in some configuration parameter that can help me avoid this?

          For me this is happening in version 0.15.3. I did not see this behavior in 0.14.4

          Thanks,
          Ashish

          Show
          Ashish Thusoo added a comment - - edited I am hitting the same stack. Is there a work around or change in some configuration parameter that can help me avoid this? For me this is happening in version 0.15.3. I did not see this behavior in 0.14.4 Thanks, Ashish

            People

            • Assignee:
              dhruba borthakur
              Reporter:
              Runping Qi
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development