Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.18.0
    • Fix Version/s: 0.18.0
    • Component/s: fs
    • Labels:
      None
    • Environment:

      LINUX 2.6.9

    • Hadoop Flags:
      Reviewed

      Description

      TestUrlStreamHandler sets setURLStreamHandlerFactory as

      FsUrlStreamHandlerFactory factory =
              new org.apache.hadoop.fs.FsUrlStreamHandlerFactory();
          java.net.URL.setURLStreamHandlerFactory(factory);
      

      After this, MiniDFSCluster seems to hang while Datanodes tries to register in setNewStorageID, specifically at

      rand = SecureRandom.getInstance("SHA1PRNG").nextInt(Integer.MAX_VALUE);
      

      jstack output shows that the main thread is stuck in RawLocalFileSystem$LocalFSFileInputStream.read

      (Attaching the jstack)

      1. 3348-2nd-option.patch
        0.6 kB
        Raghu Angadi
      2. Datanode_jstack.txt
        23 kB
        Lohit Vijayarenu
      3. HADOOP-3348.patch
        1 kB
        Lohit Vijayarenu

        Activity

        Hide
        Hudson added a comment -
        Show
        Hudson added a comment - Integrated in Hadoop-trunk #486 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/486/ )
        Hide
        Raghu Angadi added a comment -

        I just committed this. Thanks Lohit!

        Show
        Raghu Angadi added a comment - I just committed this. Thanks Lohit!
        Hide
        Christophe Taton added a comment -

        +1 for me, the workaround looks good to me

        Show
        Christophe Taton added a comment - +1 for me, the workaround looks good to me
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12381709/HADOOP-3348.patch
        against trunk revision 654315.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2433/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2433/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2433/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2433/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12381709/HADOOP-3348.patch against trunk revision 654315. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2433/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2433/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2433/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2433/console This message is automatically generated.
        Hide
        Raghu Angadi added a comment -

        +1. Patch looks fine to me. The larger issue could be discussed separately.

        Show
        Raghu Angadi added a comment - +1. Patch looks fine to me. The larger issue could be discussed separately.
        Hide
        Lohit Vijayarenu added a comment -

        Attached patch fixes the testcase to set UrlStreamHandlerFactory after brining up the cluster. If we decide to track issue about fixing handling local files with LocalFileSystem, then this patchs the failing testcase.

        Show
        Lohit Vijayarenu added a comment - Attached patch fixes the testcase to set UrlStreamHandlerFactory after brining up the cluster. If we decide to track issue about fixing handling local files with LocalFileSystem, then this patchs the failing testcase.
        Hide
        Raghu Angadi added a comment -

        patch for 2nd option. I am not very sure if it is the right fix for this jira, though it is a good change to have. The extra 'exists()' could be removed by a little bigger patch.

        larger questions are: For JVM global replacement for "file://" handler, is LocalFileSystem appropriate? Does RawLocalFileSystem suite that better? etc.

        If we could punt this issue by just changing the test a little bit, thats fine too.

        Show
        Raghu Angadi added a comment - patch for 2nd option. I am not very sure if it is the right fix for this jira, though it is a good change to have. The extra 'exists()' could be removed by a little bigger patch. larger questions are: For JVM global replacement for "file://" handler, is LocalFileSystem appropriate? Does RawLocalFileSystem suite that better? etc. If we could punt this issue by just changing the test a little bit, thats fine too.
        Hide
        Raghu Angadi added a comment -

        Two options :
        1. FSInputChecker.read() should not do readFully(). This is harder to change since multiple users implicitly depend on current behaviour. But in long run it should change.
        2. ChecksumFileSystem should use the base filesystem (RawLocalFileSystem in this case) directly when there is no .crc file.

        I think #2 is simpler to do and will reduce strange surprises. It is required anyway since even after #1, it will still try to read 512 bytes.

        Show
        Raghu Angadi added a comment - Two options : 1. FSInputChecker.read() should not do readFully(). This is harder to change since multiple users implicitly depend on current behaviour. But in long run it should change. 2. ChecksumFileSystem should use the base filesystem (RawLocalFileSystem in this case) directly when there is no .crc file. I think #2 is simpler to do and will reduce strange surprises. It is required anyway since even after #1, it will still try to read 512 bytes.
        Hide
        Lohit Vijayarenu added a comment - - edited

        After looking a bit with hints from Raghu, looks like this is what is causing the problem.
        once setUrlStreamHandle is set in JVM, opening a device file /dev/random is opened as ChecksumFileSystem.
        SecureRandom wraps up the stream returned by opening /dev/random into BufferedInputStream and calls read to get 20 byte DIGEST to generate the seed. This read when passed through FSInputChecker.read with a buffer size of 8K, which used to loop until we read 8K bytes. We invoke readFully() which loops calling multiple reads until 8K buffer is filled up. /dev/random was unable to produce random bytes so fast and hence registration of Datanode used to take forever.

        Show
        Lohit Vijayarenu added a comment - - edited After looking a bit with hints from Raghu, looks like this is what is causing the problem. once setUrlStreamHandle is set in JVM, opening a device file /dev/random is opened as ChecksumFileSystem. SecureRandom wraps up the stream returned by opening /dev/random into BufferedInputStream and calls read to get 20 byte DIGEST to generate the seed. This read when passed through FSInputChecker.read with a buffer size of 8K, which used to loop until we read 8K bytes. We invoke readFully() which loops calling multiple reads until 8K buffer is filled up. /dev/random was unable to produce random bytes so fast and hence registration of Datanode used to take forever.
        Hide
        Tsz Wo Nicholas Sze added a comment -

        > Besides, I noticed that the less activity (keyboard input, mouse moves, process activities, etc) there is on the machine that runs the test, the longer the test hangs.

        There are some random number generating algorithms using these activities as input. I am surprise that you could notice the difference (if it is indeed the cause).

        Show
        Tsz Wo Nicholas Sze added a comment - > Besides, I noticed that the less activity (keyboard input, mouse moves, process activities, etc) there is on the machine that runs the test, the longer the test hangs. There are some random number generating algorithms using these activities as input. I am surprise that you could notice the difference (if it is indeed the cause).
        Hide
        Raghu Angadi added a comment -

        > For generating storageIDs, the goal is to generate unique IDs. These IDs are not for cryptographic uses. So Random may be good enough.
        That is different issue, though it could be a work around if we don't fix the real issue.

        Show
        Raghu Angadi added a comment - > For generating storageIDs, the goal is to generate unique IDs. These IDs are not for cryptographic uses. So Random may be good enough. That is different issue, though it could be a work around if we don't fix the real issue.
        Hide
        Christophe Taton added a comment -

        Actually, this test probably hangs because of the use of our own file:// URL handler, but I don't understand yet what differences between the "file://" URL handling provided by Hadoop and the default (Sun) one could lead the SecureRandom to work well or not.
        Besides, I noticed that the less activity (keyboard input, mouse moves, process activities, etc) there is on the machine that runs the test, the longer the test hangs.

        Show
        Christophe Taton added a comment - Actually, this test probably hangs because of the use of our own file:// URL handler, but I don't understand yet what differences between the "file://" URL handling provided by Hadoop and the default (Sun) one could lead the SecureRandom to work well or not. Besides, I noticed that the less activity (keyboard input, mouse moves, process activities, etc) there is on the machine that runs the test, the longer the test hangs.
        Hide
        Tsz Wo Nicholas Sze added a comment -

        For generating storageIDs, the goal is to generate unique IDs. These IDs are not for cryptographic uses. So Random may be good enough.

        Show
        Tsz Wo Nicholas Sze added a comment - For generating storageIDs, the goal is to generate unique IDs. These IDs are not for cryptographic uses. So Random may be good enough.
        Hide
        Raghu Angadi added a comment -

        It will be useful to have an explanation for why only this test triggers this problem, because Datanode invokes SecureRandom.getInstance() every time. It is surprising to see /dev/random being read through hadoop.ChecksumFileSystem. Is that expected? There is probably a unintentional mapping from "file:///" to ChecksumFileSystem (and if ChecksumFileSystem should have succeded anyway in this case etc).

        Show
        Raghu Angadi added a comment - It will be useful to have an explanation for why only this test triggers this problem, because Datanode invokes SecureRandom.getInstance() every time. It is surprising to see /dev/random being read through hadoop.ChecksumFileSystem. Is that expected? There is probably a unintentional mapping from "file:///" to ChecksumFileSystem (and if ChecksumFileSystem should have succeded anyway in this case etc).
        Hide
        Lohit Vijayarenu added a comment -

        Christophe, thanks for the pointer. Yes, on my LINUX machine, while running this program I do see that /proc/sys/kernel/random/entropy_avail returns 0. After digging a bit I came to know that, when this happens any readers of /dev/random hang, which seems to be the case right now. Is it good idea to make an assumption that this works on a system we are trying to bring up our datanodes? Are there disadvantages to using Random.nextInt(Integer.MAX_VALUE);

        Show
        Lohit Vijayarenu added a comment - Christophe, thanks for the pointer. Yes, on my LINUX machine, while running this program I do see that /proc/sys/kernel/random/entropy_avail returns 0. After digging a bit I came to know that, when this happens any readers of /dev/random hang, which seems to be the case right now. Is it good idea to make an assumption that this works on a system we are trying to bring up our datanodes? Are there disadvantages to using Random.nextInt(Integer.MAX_VALUE);
        Hide
        Christophe Taton added a comment -

        I have been running through similar issues. After looking into this for a while, I came to the conclusion that the problem is related to the linux random number generator, which I solved on my computer by running the rngd daemon (http://linux.die.net/man/8/rngd).

        Show
        Christophe Taton added a comment - I have been running through similar issues. After looking into this for a while, I came to the conclusion that the problem is related to the linux random number generator, which I solved on my computer by running the rngd daemon ( http://linux.die.net/man/8/rngd ).
        Hide
        Lohit Vijayarenu added a comment -

        After running a few times, I see that SecureRandom.getInstance("SHA1PRNG").nextInt(Integer.MAX_VALUE) seem to hang when setURLStreamHandlerFactor(factory) is set. main thread as seen from the stack trace seem to hang on readBytes on file descriptor /dev/random strace. I see that this test runs to completion 1 out of 5-6 times tough. And also calling setURLStreamingHandlerFactory after MiniDFSCluster is up works fine.

        Show
        Lohit Vijayarenu added a comment - After running a few times, I see that SecureRandom.getInstance("SHA1PRNG").nextInt(Integer.MAX_VALUE) seem to hang when setURLStreamHandlerFactor(factory) is set. main thread as seen from the stack trace seem to hang on readBytes on file descriptor /dev/random strace. I see that this test runs to completion 1 out of 5-6 times tough. And also calling setURLStreamingHandlerFactory after MiniDFSCluster is up works fine.

          People

          • Assignee:
            Lohit Vijayarenu
            Reporter:
            Lohit Vijayarenu
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development