Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-3438

TestRaidNode fails because of "Too many open files"

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.22.0
    • Fix Version/s: 0.22.0
    • Component/s: contrib/raid
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      TestRaidNode fails because it opens many connections.

      1. MAPREDUCE-3438.patch
        1.0 kB
        Ramkumar Vadali

        Activity

        Konstantin Shvachko created issue -
        Hide
        Konstantin Shvachko added a comment -

        This is the last failing test for 0.22. See last several builds for Hadoop-Mapreduce-22-branch.
        The failure is because of the following exception:

        11/11/21 01:05:26 INFO hdfs.DFSClient: Failed to connect to /127.0.0.1:45905, add to deadNodes and continue
        java.net.SocketException: Too many open files
        	at sun.nio.ch.Net.socket0(Native Method)
        	at sun.nio.ch.Net.socket(Net.java:97)
        	at sun.nio.ch.SocketChannelImpl.<init>(SocketChannelImpl.java:84)
        	at sun.nio.ch.SelectorProviderImpl.openSocketChannel(SelectorProviderImpl.java:37)
        	at java.nio.channels.SocketChannel.open(SocketChannel.java:105)
        	at org.apache.hadoop.net.StandardSocketFactory.createSocket(StandardSocketFactory.java:63)
        	at org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:702)
        	at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:390)
        	at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:514)
        	at java.io.DataInputStream.read(DataInputStream.java:132)
        	at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:122)
        	at org.apache.hadoop.raid.RaidUtils.copyBytes(RaidUtils.java:93)
        	at org.apache.hadoop.raid.Decoder.decodeFile(Decoder.java:133)
        	at org.apache.hadoop.raid.RaidNode.unRaid(RaidNode.java:867)
        	at org.apache.hadoop.raid.RaidNode.recoverFile(RaidNode.java:333)
        	at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source)
        	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        	at java.lang.reflect.Method.invoke(Method.java:597)
        	at org.apache.hadoop.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:349)
        	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1482)
        	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1478)
        	at java.security.AccessController.doPrivileged(Native Method)
        	at javax.security.auth.Subject.doAs(Subject.java:396)
        	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1153)
        	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1476)
        

        Which leads to BlockMissingException and failure of TestRaidNode.testPathFilter in the end.

        The fix is either

        1. to increase ulimit on Jenkins machines, which I did on my box and everything passed, or
        2. to scale down the test itself.
        Show
        Konstantin Shvachko added a comment - This is the last failing test for 0.22. See last several builds for Hadoop-Mapreduce-22-branch. The failure is because of the following exception: 11/11/21 01:05:26 INFO hdfs.DFSClient: Failed to connect to /127.0.0.1:45905, add to deadNodes and continue java.net.SocketException: Too many open files at sun.nio.ch.Net.socket0(Native Method) at sun.nio.ch.Net.socket(Net.java:97) at sun.nio.ch.SocketChannelImpl.<init>(SocketChannelImpl.java:84) at sun.nio.ch.SelectorProviderImpl.openSocketChannel(SelectorProviderImpl.java:37) at java.nio.channels.SocketChannel.open(SocketChannel.java:105) at org.apache.hadoop.net.StandardSocketFactory.createSocket(StandardSocketFactory.java:63) at org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:702) at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:390) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:514) at java.io.DataInputStream.read(DataInputStream.java:132) at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:122) at org.apache.hadoop.raid.RaidUtils.copyBytes(RaidUtils.java:93) at org.apache.hadoop.raid.Decoder.decodeFile(Decoder.java:133) at org.apache.hadoop.raid.RaidNode.unRaid(RaidNode.java:867) at org.apache.hadoop.raid.RaidNode.recoverFile(RaidNode.java:333) at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:349) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1482) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1478) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1153) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1476) Which leads to BlockMissingException and failure of TestRaidNode.testPathFilter in the end. The fix is either to increase ulimit on Jenkins machines, which I did on my box and everything passed, or to scale down the test itself.
        Hide
        Konstantin Boudnik added a comment -

        +1 on the first option. Jenkins slaves are using default settings for ulimit, which isn't a viable option once you're dealing with applications at scale.

        Show
        Konstantin Boudnik added a comment - +1 on the first option. Jenkins slaves are using default settings for ulimit, which isn't a viable option once you're dealing with applications at scale.
        Hide
        Ramkumar Vadali added a comment -

        This creates/tears-down the MR/DFS clusters for each iteration in the test. This cleans up the sockets hanging around in the data nodes

        Show
        Ramkumar Vadali added a comment - This creates/tears-down the MR/DFS clusters for each iteration in the test. This cleans up the sockets hanging around in the data nodes
        Ramkumar Vadali made changes -
        Field Original Value New Value
        Attachment MAPREDUCE-3438.patch [ 12505356 ]
        Hide
        Konstantin Shvachko added a comment -

        Thanks, Ram. Couple of questions.

        1. Does this mean that Raid does not close files / sockets? Do we need to create a separate jira for that?
        2. Will it be possible to prevent socket leak in the test by just closing the file system fileSys instead of restarting the entire cluster many times, which increases running time of the test substantially, which is already one of the longest running?
        Show
        Konstantin Shvachko added a comment - Thanks, Ram. Couple of questions. Does this mean that Raid does not close files / sockets? Do we need to create a separate jira for that? Will it be possible to prevent socket leak in the test by just closing the file system fileSys instead of restarting the entire cluster many times, which increases running time of the test substantially, which is already one of the longest running?
        Hide
        Konstantin Shvachko added a comment -

        I committed this to branch 0.22. Let's see if it helps.

        Show
        Konstantin Shvachko added a comment - I committed this to branch 0.22. Let's see if it helps.
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-22-branch #93 (See https://builds.apache.org/job/Hadoop-Mapreduce-22-branch/93/)
        MAPREDUCE-3438. TestRaidNode fails because of "Too many open files". Contributed by Ramkumar Vadali.

        shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1207722
        Files :

        • /hadoop/common/branches/branch-0.22/mapreduce/CHANGES.txt
        • /hadoop/common/branches/branch-0.22/mapreduce/src/contrib/raid/src/test/org/apache/hadoop/raid/TestRaidNode.java
        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-22-branch #93 (See https://builds.apache.org/job/Hadoop-Mapreduce-22-branch/93/ ) MAPREDUCE-3438 . TestRaidNode fails because of "Too many open files". Contributed by Ramkumar Vadali. shv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1207722 Files : /hadoop/common/branches/branch-0.22/mapreduce/CHANGES.txt /hadoop/common/branches/branch-0.22/mapreduce/src/contrib/raid/src/test/org/apache/hadoop/raid/TestRaidNode.java
        Konstantin Shvachko made changes -
        Assignee Ramkumar Vadali [ rvadali ]
        Hide
        Konstantin Shvachko added a comment -

        It worked. Thank you Ramkumar.

        Show
        Konstantin Shvachko added a comment - It worked. Thank you Ramkumar.
        Konstantin Shvachko made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Hadoop Flags Reviewed [ 10343 ]
        Resolution Fixed [ 1 ]
        Konstantin Shvachko made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

          People

          • Assignee:
            Ramkumar Vadali
            Reporter:
            Konstantin Shvachko
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development