Harmony
  1. Harmony
  2. HARMONY-5617

[drlvm] On Linux crash handler may crash itself when handling SOE condition

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: 5.0M6
    • Fix Version/s: None
    • Component/s: DRLVM
    • Labels:
      None
    • Environment:
      Linux x86 and x86_64
    • Estimated Complexity:
      Advanced

      Description

      When crash handler receives a SIGSEGV signal that is caused by a stack overflow, it tries to transfer control out of signal handler (just like it does for other kind of signals). But the problem is that it tries to write memory pointed to by SP register (ESP or RSP) of the original crash context. When crash happened because of stack overflow, writing to that memory may not be possible since it is protected by a guard page. In this case crash handler crashes itself:

      Program terminated with signal 11, Segmentation fault.
      #0 0x00002aaaaac27bbf in port_set_longjump_regs (fn=0x2aaaaac2658e, regs=0x7fffbd422b90, num=3)
      at /home/gregory/work/64/trunk/working_vm/vm/port/src/thread/linux/thread_em64t.c:86
      86 sp = (void)regs->rip;
      (gdb) bt
      #0 0x00002aaaaac27bbf in port_set_longjump_regs (fn=0x2aaaaac2658e, regs=0x7fffbd422b90, num=3)
      at /home/gregory/work/64/trunk/working_vm/vm/port/src/thread/linux/thread_em64t.c:86
      #1 0x00002aaaaac26579 in general_signal_handler (signum=11, info=0x7fffbd422d70,
      context=0x7fffbd422c40)
      at /home/gregory/work/64/trunk/working_vm/vm/port/src/signals/linux/signals_common.cpp:107
      #2 <signal handler called>
      #3 0x00002aaabf5254c8 in ?? ()

      The crash happens for me on Gentoo Linux installations both on x86 and x86_64. The easiest way to reproduce it is to run StackTest from the smoke VM tests.

      1. 0001-HARMONY-5617.patch
        175 kB
        Ilya Berezhniuk
      2. 0001-HARMONY-5617.patch
        183 kB
        Ilya Berezhniuk
      3. 0001-HARMONY-5617.patch
        178 kB
        Ilya Berezhniuk
      4. H-5617-workaround.patch
        3 kB
        Ilya Berezhniuk
      5. ThreadCreation.gif
        11 kB
        Ilya Berezhniuk
      6. ThreadCreation.gif
        20 kB
        Ilya Berezhniuk
      7. ThreadManager.html.patch
        2 kB
        Ilya Berezhniuk

        Issue Links

          Activity

          Hide
          Gregory Shimansky added a comment -

          The same problem started to appear on SLES10 on ia32. It seems that it depends on small stack shifts.

          Show
          Gregory Shimansky added a comment - The same problem started to appear on SLES10 on ia32. It seems that it depends on small stack shifts.
          Hide
          Ilya Berezhniuk added a comment -

          I suggest the following patch as a workaround for the problem.
          With this patch Port SIGSEGV handler restores access to 1 (or 2) pages in the stack where signal has occured, to let VM signal handlers process it.
          The patch also corrects a bit incorrect change from HARMONY-5624 (SO can occur in native code, so corresponding checking should be divided on 2 stages).

          The patch is a workaround only. It works fine, but it makes undesired dependency between Port and VM core, and breaks crash processing for SIGSEGV signal.
          I think the issue should be left open, until proper solution is implemented.

          Actually Port should provide guard page setting/clearing by itself, and also detect when SIGSEGV is actually a stack overflow.
          Unfortunately, this will require moving another part of OS-dependent threading functions from the VM HyThread module to the Port, and also moving fast TLS access to the Port.
          It's good from design and modularity POV, but it can affect performance and needs accurate changes and testing.
          So I suggest committing this workaround and leaving the issue open for further work.

          Show
          Ilya Berezhniuk added a comment - I suggest the following patch as a workaround for the problem. With this patch Port SIGSEGV handler restores access to 1 (or 2) pages in the stack where signal has occured, to let VM signal handlers process it. The patch also corrects a bit incorrect change from HARMONY-5624 (SO can occur in native code, so corresponding checking should be divided on 2 stages). The patch is a workaround only. It works fine, but it makes undesired dependency between Port and VM core, and breaks crash processing for SIGSEGV signal. I think the issue should be left open, until proper solution is implemented. Actually Port should provide guard page setting/clearing by itself, and also detect when SIGSEGV is actually a stack overflow. Unfortunately, this will require moving another part of OS-dependent threading functions from the VM HyThread module to the Port, and also moving fast TLS access to the Port. It's good from design and modularity POV, but it can affect performance and needs accurate changes and testing. So I suggest committing this workaround and leaving the issue open for further work.
          Hide
          Gregory Shimansky added a comment -

          I applied this workaround at 640769 and indeed SOE tests now passed. The bug remains open until a better solution is implemented.

          Show
          Gregory Shimansky added a comment - I applied this workaround at 640769 and indeed SOE tests now passed. The bug remains open until a better solution is implemented.
          Hide
          Ilya Berezhniuk added a comment -

          The patch to fix the problem.

          The patch includes following major changes:

          1) OS threading was totally moved from VM's HyThread 'os_thread.c' files to static part of the Port. 'port_thread_create' now creates a thread with special wrapper used to attach the thread to porting library, i.e. set up TLS data and perform additional preparation. 'port_thread_attach' function is used to attach threads not created through 'port_thread_create'.

          2) When created or foreign thread is attached to the porting library, all the data about stack location and size and guard pages etc. are stored in TLS, instead of doing this in VM core. VM core platform-dependent code operating with guard pages/alternative stack/steck overflow detection etc. was moved to the Port.

          3) Port signal/crash handling code was corrected to use Port TLS data. Now it's possible to detect SO condition in Port itself, and deliver SO (not GPF) to VM on Linux. Also Port can restore guard page during crash processing, to rethrow signal or exception and perform operation that need to be done inside of signal/exception handler.

          4) Named shared memory is used on both Linux and Windows to synchronize data between multiple instances of Port static library - the most important thing here is TLS key which must be the same everywhere.

          Some improvements were also made in this patch:

          • hand-made assert dialog was enabled to fix the problem with attaching to debugger on Windows/x86_64
          • stack settings on Linux/ia64 were synchronized with other platforms; alternative stack now works on ia64
          • crash dump now contains code addresses

          ---> The patch has intermittent problem with SIGUSR2 handling on Linux (it's used for thread_yield_other) - I'm investigating the problem now. I hope to update the patch soon with final version.

          Show
          Ilya Berezhniuk added a comment - The patch to fix the problem. The patch includes following major changes: 1) OS threading was totally moved from VM's HyThread 'os_thread.c' files to static part of the Port. 'port_thread_create' now creates a thread with special wrapper used to attach the thread to porting library, i.e. set up TLS data and perform additional preparation. 'port_thread_attach' function is used to attach threads not created through 'port_thread_create'. 2) When created or foreign thread is attached to the porting library, all the data about stack location and size and guard pages etc. are stored in TLS, instead of doing this in VM core. VM core platform-dependent code operating with guard pages/alternative stack/steck overflow detection etc. was moved to the Port. 3) Port signal/crash handling code was corrected to use Port TLS data. Now it's possible to detect SO condition in Port itself, and deliver SO (not GPF) to VM on Linux. Also Port can restore guard page during crash processing, to rethrow signal or exception and perform operation that need to be done inside of signal/exception handler. 4) Named shared memory is used on both Linux and Windows to synchronize data between multiple instances of Port static library - the most important thing here is TLS key which must be the same everywhere. Some improvements were also made in this patch: hand-made assert dialog was enabled to fix the problem with attaching to debugger on Windows/x86_64 stack settings on Linux/ia64 were synchronized with other platforms; alternative stack now works on ia64 crash dump now contains code addresses ---> The patch has intermittent problem with SIGUSR2 handling on Linux (it's used for thread_yield_other) - I'm investigating the problem now. I hope to update the patch soon with final version.
          Hide
          Ilya Berezhniuk added a comment -

          Attaching the changes in Thread Manager documentation for suggested approach.

          Unfortunately, the problem on Linux still persists. Looks like using shared memory for sharing global data between several instances of Port static library is not a good idea. Extensive testing show a cause of problem.
          Both SysV shared memory and memory-mapped files are supposed to be used for inter-process communication. So these segments remain alive when program exits. The problem appears when system limit on shared objects is exhausted.
          So I need to find another way for sharing global data...

          Show
          Ilya Berezhniuk added a comment - Attaching the changes in Thread Manager documentation for suggested approach. Unfortunately, the problem on Linux still persists. Looks like using shared memory for sharing global data between several instances of Port static library is not a good idea. Extensive testing show a cause of problem. Both SysV shared memory and memory-mapped files are supposed to be used for inter-process communication. So these segments remain alive when program exits. The problem appears when system limit on shared objects is exhausted. So I need to find another way for sharing global data...
          Hide
          Ilya Berezhniuk added a comment -

          Here is the final patch; all VM tests pass.

          Instead of using shared memory, current version uses crash handler shared library as a unique holder of Port global data.

          Additionally, port_thread_create now checks stack size passed; if stack size is less than we need, stack size is increased to minimum size.
          Minimum size was set to (<guard pagesize> + 2*<size needed for signal processing>), i.e. minimum working stack size is equal to reserved stack size for signal processing.

          Show
          Ilya Berezhniuk added a comment - Here is the final patch; all VM tests pass. Instead of using shared memory, current version uses crash handler shared library as a unique holder of Port global data. Additionally, port_thread_create now checks stack size passed; if stack size is less than we need, stack size is increased to minimum size. Minimum size was set to (<guard pagesize> + 2*<size needed for signal processing>), i.e. minimum working stack size is equal to reserved stack size for signal processing.
          Hide
          Ilya Berezhniuk added a comment -

          Performance testing shows that the patch does not affect performance.

          Show
          Ilya Berezhniuk added a comment - Performance testing shows that the patch does not affect performance.
          Hide
          Ilya Berezhniuk added a comment -

          The updated picture for TM documentation - previous picture was converted unhappily.

          Show
          Ilya Berezhniuk added a comment - The updated picture for TM documentation - previous picture was converted unhappily.
          Hide
          Ilya Berezhniuk added a comment -

          Patch is updated to apply after r659108.

          Show
          Ilya Berezhniuk added a comment - Patch is updated to apply after r659108.
          Hide
          Ilya Berezhniuk added a comment -

          Patch is applied at r660062.

          Show
          Ilya Berezhniuk added a comment - Patch is applied at r660062.
          Hide
          Gregory Shimansky added a comment -

          Ilya, I am assigning this bug to you now that you are the committer.

          Show
          Gregory Shimansky added a comment - Ilya, I am assigning this bug to you now that you are the committer.

            People

            • Assignee:
              Ilya Berezhniuk
              Reporter:
              Gregory Shimansky
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:

                Development