Accumulo
  1. Accumulo
  2. ACCUMULO-233

rfile rename failure can lead to unavailability

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Cannot Reproduce
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: tserver
    • Labels:
      None

      Description

      During a minor compaction, a rename from *.rf_tmp to *.rf fails. This would be OK, except that we left a reference to *.rf in the !METADATA table. We need to make sure that if any part of the compaction fails we properly roll back to a good state. Could this be an opportunity for a FATE operation?

      20 15:03:17,033 [tabletserver.Tablet] WARN : tserver:servername Tablet !0;~;!0< failed to rename /table_info/00790_00002.rf after MinC, will retry in 60 secs...
      java.io.IOException: Call to servername/10.20.30.40:9000 failed on local exception: java.io.IOException: Too many open files
              at org.apache.hadoop.ipc.Client.wrapException(Client.java:775)
              at org.apache.hadoop.ipc.Client.call(Client.java:743)
              at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
              at $Proxy0.rename(Unknown Source)
              at sun.reflect.GeneratedMethodAccessor26.invoke(Unknown Source)
              at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
              at java.lang.reflect.Method.invoke(Method.java:597)
              at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
              at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
              at $Proxy0.rename(Unknown Source)
              at org.apache.hadoop.hdfs.DFSClient.rename(DFSClient.java:556)
              at org.apache.hadoop.hdfs.DistributedFileSystem.rename(DistributedFileSystem.java:211)
              at cloudbase.server.tabletserver.Tablet$DatafileManager.bringMinorCompactionOnline(Tablet.java:748)
              at cloudbase.server.tabletserver.Tablet.minorCompact(Tablet.java:1999)
              at cloudbase.server.tabletserver.Tablet.access$3800(Tablet.java:123)
              at cloudbase.server.tabletserver.Tablet$MinorCompactionTask.run(Tablet.java:2070)
              at cloudbase.core.util.LoggingRunnable.run(LoggingRunnable.java:18)
              at cloudtrace.instrument.TraceRunnable.run(TraceRunnable.java:31)
              at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
              at java.lang.Thread.run(Thread.java:662)
      Caused by: java.io.IOException: Too many open files
              at sun.nio.ch.EPollArrayWrapper.epollCreate(Native Method)
              at sun.nio.ch.EPollArrayWrapper.<init>(EPollArrayWrapper.java:69)
              at sun.nio.ch.EPollSelectorImpl.<init>(EPollSelectorImpl.java:52)
              at sun.nio.ch.EPollSelectorProvider.openSelector(EPollSelectorProvider.java:18)
              at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.get(SocketIOWithTimeout.java:407)
              at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:322)
              at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
              at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
              at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
              at java.io.FilterInputStream.read(FilterInputStream.java:116)
              at org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:276)
              at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
              at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
              at java.io.DataInputStream.readInt(DataInputStream.java:370)
              at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501)
              at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446)
      20 15:04:17,641 [tabletserver.Tablet] WARN : tserver:servername Target map file already exist /accumulo/tables/!0/table_info/00790_00002.rf
      20 15:04:17,897 [tabletserver.FileManager] ERROR: tserver:servername Failed to open file /accumulo/tables/!0/table_info/00790_00002.rf File does not exist: /accumulo/tables/!0/table_info/00790_00002.rf
      

        Activity

        Keith Turner made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Cannot Reproduce [ 5 ]
        Hide
        Keith Turner added a comment -

        Not sure exactly what happened here. Maybe ACCUMULO-65? The code in 1.4.5-SNAPSHOT renames before putting in metadata and only puts it in metadata if rename is sucessful.

        Show
        Keith Turner added a comment - Not sure exactly what happened here. Maybe ACCUMULO-65 ? The code in 1.4.5-SNAPSHOT renames before putting in metadata and only puts it in metadata if rename is sucessful.
        Hide
        Mike Drob added a comment -

        Keith fixed this at some point, but my memory (and Google) fail to get me
        any more detail.

        Show
        Mike Drob added a comment - Keith fixed this at some point, but my memory (and Google) fail to get me any more detail.
        Hide
        Christopher Tubbs added a comment -

        Is this still an issue? This is a pretty old ticket, and there's no version information associated with it (which version does it apply to? which version are we planning to fix it in?)

        Show
        Christopher Tubbs added a comment - Is this still an issue? This is a pretty old ticket, and there's no version information associated with it (which version does it apply to? which version are we planning to fix it in?)
        Eric Newton made changes -
        Assignee Adam Fuchs [ afuchs ]
        Eric Newton made changes -
        Affects Version/s 1.3.5 [ 12318442 ]
        Gavin made changes -
        Field Original Value New Value
        Workflow no-reopen-closed, patch-avail [ 12646921 ] patch-available, re-open possible [ 12671710 ]
        Hide
        Aaron Cordova added a comment -

        hrm - well, we're not seeing the 'map file already exists' message, but we are seeing 'too many open files' which kills the tablet server.

        Show
        Aaron Cordova added a comment - hrm - well, we're not seeing the 'map file already exists' message, but we are seeing 'too many open files' which kills the tablet server.
        Hide
        Aaron Cordova added a comment -

        +1 for fixing this.

        What is the underlying cause of a failure to rename? Can that root cause be addressed?

        I'm seeing this phenomenon in EC2 using 200 machines, with 100 clients writing to a table with the StringSummation aggregator enabled.

        Show
        Aaron Cordova added a comment - +1 for fixing this. What is the underlying cause of a failure to rename? Can that root cause be addressed? I'm seeing this phenomenon in EC2 using 200 machines, with 100 clients writing to a table with the StringSummation aggregator enabled.
        Hide
        Adam Fuchs added a comment -

        Incidentally, the workaround should this state arise would be to create an empty or effectively empty rfile with the name of the missing rfile. If there are no aggregators or other interesting iterators configured on the affected table, this can be done by copying an existing rfile that is currently referenced by the given tablet. So, assuming that 00790_00001.rf is currently referenced by the affected tablet (as determined by a scan of the !METADATA table), the fix in this case would be something like:

        hadoop fs -cp /accumulo/tables/\!0/table_info/00790_0000{1,2}.rf
        
        Show
        Adam Fuchs added a comment - Incidentally, the workaround should this state arise would be to create an empty or effectively empty rfile with the name of the missing rfile. If there are no aggregators or other interesting iterators configured on the affected table, this can be done by copying an existing rfile that is currently referenced by the given tablet. So, assuming that 00790_00001.rf is currently referenced by the affected tablet (as determined by a scan of the !METADATA table), the fix in this case would be something like: hadoop fs -cp /accumulo/tables/\!0/table_info/00790_0000{1,2}.rf
        Adam Fuchs created issue -

          People

          • Assignee:
            Unassigned
            Reporter:
            Adam Fuchs
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development