Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-25692

Failure to instantiate WALCellCodec leaks socket in replication

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.1.0, 2.2.0, 2.1.1, 2.1.2, 2.1.3, 2.3.0, 2.3.1, 2.1.4, 2.0.6, 2.1.5, 2.2.1, 2.1.6, 2.1.7, 2.2.2, 2.1.8, 2.2.3, 2.3.3, 2.1.9, 2.2.4, 2.4.0, 2.2.5, 2.2.6, 2.3.2, 2.3.4, 2.4.1, 2.4.2
    • Fix Version/s: 3.0.0-alpha-1, 2.2.7, 2.5.0, 2.4.3, 2.3.6
    • Component/s: Replication
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      I was looking at an HBase user's cluster with Danilo Perez where they saw two otherwise identical clusters where one of them was regularly had sockets in CLOSE_WAIT going from RegionServers to a distributed storage appliance.

      After a lot of analysis, we eventually figured out that these sockets in CLOSE_WAIT were directly related to an FSDataInputStream which we forgot to close inside of the RegionServer. The subtlety was that only one of these HBase clusters was set up to do replication (to the other cluster). The HBase cluster experiencing this problem was shipping edits to a peer, and had previously been using Phoenix. At some point, the cluster had Phoenix removed from it.

      What we found was that replication still had WALs to ship which were for Phoenix tables. Phoenix, in this version, still used the custom WALCellCodec; however, this codec class was missing from the RS classpath after the owner of the cluster removed Phoenix.

      When we try to instantiate the Codec implementation via ReflectionUtils, we end up throwing an UnsupportedOperationException which wraps a NoClassDefFoundException. However, in WALFactory, we only close the FSDataInputStream when we catch an IOException. 

      Thus, replication sits in a "fast" loop, trying to ship these edits, each time leaking a new socket because of the InputStream not being closed. There is an obvious workaround for this specific issue, but we should not leak this inside HBase.

      Approximate, 2.1.x stack trace which lead us to this is below.

      2021-03-11 18:19:20,364 ERROR org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader: Failed to read stream of replication entries
      java.io.IOException: Cannot get log reader
      	at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:366)
      	at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:303)
      	at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:291)
      	at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:427)
      	at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:354)
      	at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openNextLog(WALEntryStream.java:302)
      	at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.checkReader(WALEntryStream.java:293)
      	at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:174)
      	at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:100)
      	at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.readWALEntries(ReplicationSourceWALReader.java:192)
      	at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.run(ReplicationSourceWALReader.java:138)
      Caused by: java.lang.UnsupportedOperationException: Unable to find org.apache.hadoop.hbase.regionserver.wal.IndexedWALEditCodec
      	at org.apache.hadoop.hbase.util.ReflectionUtils.instantiateWithCustomCtor(ReflectionUtils.java:47)
      	at org.apache.hadoop.hbase.regionserver.wal.WALCellCodec.create(WALCellCodec.java:106)
      	at org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.getCodec(ProtobufLogReader.java:301)
      	at org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.initAfterCompression(ProtobufLogReader.java:311)
      	at org.apache.hadoop.hbase.regionserver.wal.ReaderBase.init(ReaderBase.java:81)
      	at org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.init(ProtobufLogReader.java:168)
      	at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:321)
      	... 10 more
      Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.regionserver.wal.IndexedWALEditCodec
      	at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
      	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
      	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
      	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
      	at java.lang.Class.forName0(Native Method)
      	at java.lang.Class.forName(Class.java:264)
      	at org.apache.hadoop.hbase.util.ReflectionUtils.instantiateWithCustomCtor(ReflectionUtils.java:43)
      	... 16 more
      

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                elserj Josh Elser
                Reporter:
                elserj Josh Elser
              • Votes:
                0 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: