Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-4842

BufferedBlockMgrTest.WriteError occasionally fails with error

    Details

      Description

      I saw this error in a Jenkins run. Tim Armstrong - I’m assigning this to you thinking you might have an idea what’s going on here; feel free to find another person or assign back to me if you're swamped.

      05:35:34.665 [ RUN      ] BufferedBlockMgrTest.WriteError
      05:35:34.665 /data/jenkins/workspace/impala-umbrella-build-and-test/repos/Impala/be/src/runtime/buffered-block-mgr-test.cc:225: Failure
      05:35:34.665 Value of: status.ok() || status.IsCancelled()
      05:35:34.665   Actual: false
      05:35:34.665 Expected: true
      05:35:34.665 No usable scratch files: space could not be allocated in any of the configured scratch directories (--scratch_dirs). See logs for previous errors that may have caused this.
      

        Activity

        Hide
        dhecht Dan Hecht added a comment -

        commit 1335af3684ec54b63a9602816b2cda776b40ce2f
        Author: Tim Armstrong <tarmstrong@cloudera.com>
        Date: Tue Feb 7 17:47:10 2017 -0800

        IMPALA-4842: BufferedBlockMgrTest.WriteError is flaky

        The test should allow Unpin() to fail with a scratch allocation error to
        handle the case where the first write fails and blacklists the scratch
        disk around the same time that the second write starts. Usually either
        the second write succeeds because it started before the first write
        failed or it fails with CANCELLED because the
        BufferedBlockMgr::is_cancelled_ flag is set. There is a small
        window for a race after the disk is blacklisted in TmpFileMgr but
        before BufferedBlockMgr::WriteComplete() is called.

        Testing:
        I was able to reproduce the problem locally by adding some delays
        to the test. I added a variant of the WriteError test that more reliably
        reproduces the bug. Ran both WriteError tests in a loop locally to try
        to flush out flakiness.

        Change-Id: I9878d7000b03a64ee06c2088a8c30e318fe1d2a3
        Reviewed-on: http://gerrit.cloudera.org:8080/5940
        Tested-by: Impala Public Jenkins
        Reviewed-by: Michael Ho <kwho@cloudera.com>

        Show
        dhecht Dan Hecht added a comment - commit 1335af3684ec54b63a9602816b2cda776b40ce2f Author: Tim Armstrong <tarmstrong@cloudera.com> Date: Tue Feb 7 17:47:10 2017 -0800 IMPALA-4842 : BufferedBlockMgrTest.WriteError is flaky The test should allow Unpin() to fail with a scratch allocation error to handle the case where the first write fails and blacklists the scratch disk around the same time that the second write starts. Usually either the second write succeeds because it started before the first write failed or it fails with CANCELLED because the BufferedBlockMgr::is_cancelled_ flag is set. There is a small window for a race after the disk is blacklisted in TmpFileMgr but before BufferedBlockMgr::WriteComplete() is called. Testing: I was able to reproduce the problem locally by adding some delays to the test. I added a variant of the WriteError test that more reliably reproduces the bug. Ran both WriteError tests in a loop locally to try to flush out flakiness. Change-Id: I9878d7000b03a64ee06c2088a8c30e318fe1d2a3 Reviewed-on: http://gerrit.cloudera.org:8080/5940 Tested-by: Impala Public Jenkins Reviewed-by: Michael Ho <kwho@cloudera.com>
        Hide
        tarmstrong Tim Armstrong added a comment -

        I have a fix posted but won't have time to see it through to completion before going on break. I'll turn this over to you Dan since it would be good to get this in to avoid build breakage.

        Show
        tarmstrong Tim Armstrong added a comment - I have a fix posted but won't have time to see it through to completion before going on break. I'll turn this over to you Dan since it would be good to get this in to avoid build breakage.
        Hide
        tarmstrong Tim Armstrong added a comment -

        I pushed out a fix here: http://gerrit.cloudera.org:8080/5940 that should solve the issue. I may not have time to get this merged before I am on holiday for a couple of weeks so I'll probably have to hand this off to someone.

        Show
        tarmstrong Tim Armstrong added a comment - I pushed out a fix here: http://gerrit.cloudera.org:8080/5940 that should solve the issue. I may not have time to get this merged before I am on holiday for a couple of weeks so I'll probably have to hand this off to someone.
        Hide
        tarmstrong Tim Armstrong added a comment -

        I think the problem is that the test needs to be more permissive in the errors it accepts at that point.

        Show
        tarmstrong Tim Armstrong added a comment - I think the problem is that the test needs to be more permissive in the errors it accepts at that point.
        Hide
        jbapple Jim Apple added a comment -

        I have now seen this with a non-ASAN build. It was in an "exhaustive" build, but it my understanding that the test exploration strategy does not change the BE tests.

        Show
        jbapple Jim Apple added a comment - I have now seen this with a non-ASAN build. It was in an "exhaustive" build, but it my understanding that the test exploration strategy does not change the BE tests.
        Hide
        jbapple Jim Apple added a comment -

        Tim Armstrong, no I was not confident.

        On some of the Jenkins jobs on <http://jenkins.impala.io:8080> that have failed with low disk space, I added a trap to the bash script to print df -m on exit, so that even if the machine is ephemeral and goes away, the log can help debug disk space issues.

        Show
        jbapple Jim Apple added a comment - Tim Armstrong , no I was not confident. On some of the Jenkins jobs on < http://jenkins.impala.io:8080 > that have failed with low disk space, I added a trap to the bash script to print df -m on exit, so that even if the machine is ephemeral and goes away, the log can help debug disk space issues.
        Hide
        tarmstrong Tim Armstrong added a comment -

        Looks like a duplicate of IMPALA-4781. Jim Apple were we confident that IMPALA-4781 was an infra issue instead of a test issue?

        I don't think this is directly connected to ASAN - I've run this test under ASAN quite a few times recently.

        Show
        tarmstrong Tim Armstrong added a comment - Looks like a duplicate of IMPALA-4781 . Jim Apple were we confident that IMPALA-4781 was an infra issue instead of a test issue? I don't think this is directly connected to ASAN - I've run this test under ASAN quite a few times recently.

          People

          • Assignee:
            dhecht Dan Hecht
            Reporter:
            lv Lars Volker
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development