Details

      Description

      Using the following command usually fails within an hour: while [ $? -eq 0 ]; do ./run-tests.py ./custom_cluster/test_breakpad.py --exploration_strategy=exhaustive -k test_minidump_relative_path -x -s; done. The problem is, that only two of the three impalad processes will write a minidump, while one of them won't. However in that case there is no breakpad related error message in its logfile.

      Creating the minidump_base_dir seems to alleviate the problem, suggesting there might be a race condition somewhere in the call to boost::filesystem::create_directories.

      This problem only seems to arise in the specific scenario of multiple impalad processes sharing the same minidump_base_dir which also has to be non-existent. Outside of tests it seems like it cannot occur.

        Issue Links

          Activity

          Hide
          lv Lars Volker added a comment -

          Creating the minidump_base_dir before running the test does not change the behavior, so I don't think it is boost::filesystem::create_directories. Also, catalogd and statestored don't seem to be affected. This could be an issue with multiple impalads being started at the exact same time, although I have no idea why in this case breakpad would not work. The handler registration looks fine, the logfiles indicate it succeeded, yet in very rare cases one of the three daemons does not write a minidump file. I'll need to think about this more, currently I don't know what else to check.

          Show
          lv Lars Volker added a comment - Creating the minidump_base_dir before running the test does not change the behavior, so I don't think it is boost::filesystem::create_directories . Also, catalogd and statestored don't seem to be affected. This could be an issue with multiple impalads being started at the exact same time, although I have no idea why in this case breakpad would not work. The handler registration looks fine, the logfiles indicate it succeeded, yet in very rare cases one of the three daemons does not write a minidump file. I'll need to think about this more, currently I don't know what else to check.
          Hide
          lv Lars Volker added a comment -

          Creating the minidump folders for each daemon (so that create_directories() in minidump.cc is not needed at all) does not seem to help. However, making each impalad use a different folder for writing minidumps does seem to help:

          if (daemon.string() == "impalad") daemon += "_" + to_string(FLAGS_be_port);
          

          I will keep looking for the root cause of this.

          Show
          lv Lars Volker added a comment - Creating the minidump folders for each daemon (so that create_directories() in minidump.cc is not needed at all) does not seem to help. However, making each impalad use a different folder for writing minidumps does seem to help: if (daemon.string() == "impalad") daemon += "_" + to_string(FLAGS_be_port); I will keep looking for the root cause of this.
          Show
          mikesbrown Michael Brown added a comment - I believe this happened here: http://sandbox.jenkins.cloudera.com/job/impala-umbrella-build-and-test/4858/
          Show
          mikesbrown Michael Brown added a comment - Also here: http://sandbox.jenkins.cloudera.com/view/Impala/view/Evergreen-asf-master/job/impala-asf-master-exhaustive-code-coverage/20/
          Hide
          lv Lars Volker added a comment -

          This seems to have ceased. I'm downgrading it since it seems less urgent now.

          Show
          lv Lars Volker added a comment - This seems to have ceased. I'm downgrading it since it seems less urgent now.
          Hide
          mikesbrown Michael Brown added a comment -

          I saw this recently again. :-/

          Show
          mikesbrown Michael Brown added a comment - I saw this recently again. :-/
          Hide
          lv Lars Volker added a comment -
          Show
          lv Lars Volker added a comment - This could be Breakpad #681 . A fix is under review upstream .
          Hide
          lv Lars Volker added a comment -

          I saw this again, too.

          Show
          lv Lars Volker added a comment - I saw this again, too.
          Hide
          lv Lars Volker added a comment -

          Pinged the Breakpad change to see if there's anything we can do to speed this up.

          Show
          lv Lars Volker added a comment - Pinged the Breakpad change to see if there's anything we can do to speed this up.
          Hide
          lv Lars Volker added a comment -

          This change should remove the flakiness until Breakpad 681 has been fixed: https://gerrit.cloudera.org/#/c/6588/

          Show
          lv Lars Volker added a comment - This change should remove the flakiness until Breakpad 681 has been fixed: https://gerrit.cloudera.org/#/c/6588/
          Hide
          lv Lars Volker added a comment -

          I dug around more and by logging failed attempts of breakpad to write a minidump in minidump.cc::DumpCallback(), I was able to confirm that when writing a minidump fails, the target path collides with another one that succeeds. bharath v brought up the very good point in this code review, that impala daemons are started with a one second delay in between. Therefore their random GUID generators should actually use different seeds. However, when looking at the logfiles of impalads that showed the error, the breakpad initialization seemed to happen within the same second, possible because of delays when starting the first one of them:

          Log file created at: 2017/04/07 22:34:52
          Running on machine: de
          Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
          I0407 22:34:52.996690 28749 logging.cc:123] stdout will be logged to this file.
          E0407 22:34:52.996870 28749 logging.cc:124] stderr will be logged to this file.
          I0407 22:34:52.997046 28749 minidump.cc:245] Setting minidump size limit to 20971520.
          I0407 22:34:52.997105 28749 atomicops-internals-x86.cc:98] vendor GenuineIntel  family 6  model 14  sse2 1  cmpxchg16b 1
          

          Observe how the second one is at minidump.cc:245 in the same second:

          Log file created at: 2017/04/07 22:34:52
          Running on machine: de
          Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
          I0407 22:34:52.017959 28473 logging.cc:123] stdout will be logged to this file.
          E0407 22:34:52.018280 28473 logging.cc:124] stderr will be logged to this file.
          I0407 22:34:52.018563 28473 minidump.cc:245] Setting minidump size limit to 20971520.
          I0407 22:34:52.018663 28473 atomicops-internals-x86.cc:98] vendor GenuineIntel  family 6  model 14  sse2 1  cmpxchg16b 1
          

          This makes me confident we're actually seeing Breakpad #681 here.

          Show
          lv Lars Volker added a comment - I dug around more and by logging failed attempts of breakpad to write a minidump in minidump.cc::DumpCallback(), I was able to confirm that when writing a minidump fails, the target path collides with another one that succeeds. bharath v brought up the very good point in this code review , that impala daemons are started with a one second delay in between. Therefore their random GUID generators should actually use different seeds. However, when looking at the logfiles of impalads that showed the error, the breakpad initialization seemed to happen within the same second, possible because of delays when starting the first one of them: Log file created at: 2017/04/07 22:34:52 Running on machine: de Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg I0407 22:34:52.996690 28749 logging.cc:123] stdout will be logged to this file. E0407 22:34:52.996870 28749 logging.cc:124] stderr will be logged to this file. I0407 22:34:52.997046 28749 minidump.cc:245] Setting minidump size limit to 20971520. I0407 22:34:52.997105 28749 atomicops-internals-x86.cc:98] vendor GenuineIntel family 6 model 14 sse2 1 cmpxchg16b 1 Observe how the second one is at minidump.cc:245 in the same second: Log file created at: 2017/04/07 22:34:52 Running on machine: de Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg I0407 22:34:52.017959 28473 logging.cc:123] stdout will be logged to this file. E0407 22:34:52.018280 28473 logging.cc:124] stderr will be logged to this file. I0407 22:34:52.018563 28473 minidump.cc:245] Setting minidump size limit to 20971520. I0407 22:34:52.018663 28473 atomicops-internals-x86.cc:98] vendor GenuineIntel family 6 model 14 sse2 1 cmpxchg16b 1 This makes me confident we're actually seeing Breakpad #681 here.
          Hide
          lv Lars Volker added a comment -

          I'm marking this one as fixed, since the test should not be flaky with this fix. Followup work is tracked in IMPALA-5187.

          IMPALA-3794: Workaround for Breakpad ID conflicts

          Breakpad determines the ID of the minidump file to be written in case of
          a crash during startup of the process randomly, seeded with the current
          system time with second granularity. If two impalads start up within the
          same second, there is a chance for a name conflict. The one second delay
          between starting impalads in start-impala-cluster.py is not sufficient:

          I0407 22:34:52.018563 28473 minidump.cc:245] Setting minidump size limit
          to 20971520.
          I0407 22:34:52.997046 28749 minidump.cc:245] Setting minidump size limit
          to 20971520.

          When sending a signal to all of them, one process can overwrite the
          minidump of another one. This is an upstream issue and is tracked in
          Breakpad-681. I further confirmed my suspicion by tentatively making an
          own output folder for each running instance of impalad and was then
          unable to reproduce the issue. However, it is a more clear solution to
          fix the underlying issue than to change the folder locations for
          minidumps in impala.

          Until this is fixed upstream, we can make sure that we see at least one
          minidump for the group of impalads in the test cluster. It is not a
          product defect, since we don't support running multiple impalads on a
          single host, let alone starting them all at once.

          To test this I ran the following loop for about an hour on my dev
          machine without hitting the issue:

          while [ $? -eq 0 ]; do impala-py.test
          tests/custom_cluster/test_breakpad.py --exploration_strategy=exhaustive
          -k test_minidump_relative_path -x -s; done

          Change-Id: I4ae589f6eb5cbbfb860943214edc0e6415eeb862
          Reviewed-on: http://gerrit.cloudera.org:8080/6588
          Reviewed-by: Lars Volker <lv@cloudera.com>
          Tested-by: Impala Public Jenkins

          Show
          lv Lars Volker added a comment - I'm marking this one as fixed, since the test should not be flaky with this fix. Followup work is tracked in IMPALA-5187 . IMPALA-3794 : Workaround for Breakpad ID conflicts Breakpad determines the ID of the minidump file to be written in case of a crash during startup of the process randomly, seeded with the current system time with second granularity. If two impalads start up within the same second, there is a chance for a name conflict. The one second delay between starting impalads in start-impala-cluster.py is not sufficient: I0407 22:34:52.018563 28473 minidump.cc:245] Setting minidump size limit to 20971520. I0407 22:34:52.997046 28749 minidump.cc:245] Setting minidump size limit to 20971520. When sending a signal to all of them, one process can overwrite the minidump of another one. This is an upstream issue and is tracked in Breakpad-681. I further confirmed my suspicion by tentatively making an own output folder for each running instance of impalad and was then unable to reproduce the issue. However, it is a more clear solution to fix the underlying issue than to change the folder locations for minidumps in impala. Until this is fixed upstream, we can make sure that we see at least one minidump for the group of impalads in the test cluster. It is not a product defect, since we don't support running multiple impalads on a single host, let alone starting them all at once. To test this I ran the following loop for about an hour on my dev machine without hitting the issue: while [ $? -eq 0 ]; do impala-py.test tests/custom_cluster/test_breakpad.py --exploration_strategy=exhaustive -k test_minidump_relative_path -x -s; done Change-Id: I4ae589f6eb5cbbfb860943214edc0e6415eeb862 Reviewed-on: http://gerrit.cloudera.org:8080/6588 Reviewed-by: Lars Volker <lv@cloudera.com> Tested-by: Impala Public Jenkins

            People

            • Assignee:
              lv Lars Volker
              Reporter:
              lv Lars Volker
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development