Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-5172

crash in tcmalloc::CentralFreeList::FetchFromOneSpans

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: Impala 2.8.0
    • Fix Version/s: Impala 2.9.0
    • Component/s: Backend
    • Labels:

      Description

      This was encountered during an automated test run.

      #0  0x0000003556e328e5 in raise () from /lib64/libc.so.6
      #1  0x0000003556e340c5 in abort () from /lib64/libc.so.6
      #2  0x00007fea49b64c55 in os::abort(bool) () from /opt/toolchain/sun-jdk-64bit-1.7.0.75/jre/lib/amd64/server/libjvm.so
      #3  0x00007fea49ce6cd7 in VMError::report_and_die() () from /opt/toolchain/sun-jdk-64bit-1.7.0.75/jre/lib/amd64/server/libjvm.so
      #4  0x00007fea49b69b6f in JVM_handle_linux_signal () from /opt/toolchain/sun-jdk-64bit-1.7.0.75/jre/lib/amd64/server/libjvm.so
      #5  <signal handler called>
      #6  0x0000000001bf2253 in tcmalloc::CentralFreeList::FetchFromOneSpans(int, void**, void**) ()
      #7  0x0000000001bf254c in tcmalloc::CentralFreeList::FetchFromOneSpansSafe(int, void**, void**) ()
      #8  0x0000000001bf25f4 in tcmalloc::CentralFreeList::RemoveRange(void**, void**, int) ()
      #9  0x0000000001bffcc3 in tcmalloc::ThreadCache::FetchFromCentralCache(unsigned long, unsigned long) ()
      #10 0x0000000001c0eea8 in tc_newarray ()
      #11 0x0000000000bcbd24 in allocate (this=0x7fe9ee3f6d78) at /data/jenkins/workspace/impala-umbrella-build-and-test/Impala-Toolchain/gcc-4.9.2/include/c++/4.9.2/ext/new_allocator.h:104
      #12 allocate (this=0x7fe9ee3f6d78) at /data/jenkins/workspace/impala-umbrella-build-and-test/Impala-Toolchain/gcc-4.9.2/include/c++/4.9.2/bits/alloc_traits.h:357
      #13 _M_allocate (this=0x7fe9ee3f6d78) at /data/jenkins/workspace/impala-umbrella-build-and-test/Impala-Toolchain/gcc-4.9.2/include/c++/4.9.2/bits/stl_vector.h:170
      #14 std::vector<impala::TRuntimeProfileNode, std::allocator<impala::TRuntimeProfileNode> >::_M_emplace_back_aux<impala::TRuntimeProfileNode> (this=0x7fe9ee3f6d78) at /data/jenkins/workspace/impala-umbrella-build-and-test/Impala-Toolchain/gcc-4.9.2/include/c++/4.9.2/bits/vector.tcc:412
      #15 0x0000000000bc5747 in emplace_back<impala::TRuntimeProfileNode> (this=0x93c3400, nodes=0x7fe9ee3f6d78) at /data/jenkins/workspace/impala-umbrella-build-and-test/Impala-Toolchain/gcc-4.9.2/include/c++/4.9.2/bits/vector.tcc:101
      #16 push_back (this=0x93c3400, nodes=0x7fe9ee3f6d78) at /data/jenkins/workspace/impala-umbrella-build-and-test/Impala-Toolchain/gcc-4.9.2/include/c++/4.9.2/bits/stl_vector.h:932
      #17 impala::RuntimeProfile::ToThrift (this=0x93c3400, nodes=0x7fe9ee3f6d78) at /data/jenkins/workspace/impala-umbrella-build-and-test/repos/Impala/be/src/util/runtime-profile.cc:758
      #18 0x0000000000bc54b2 in impala::RuntimeProfile::ToThrift (this=0x29480a00, nodes=0x7fe9ee3f6d78) at /data/jenkins/workspace/impala-umbrella-build-and-test/repos/Impala/be/src/util/runtime-profile.cc:836
      #19 0x0000000000bc54b2 in impala::RuntimeProfile::ToThrift (this=0x1ebf6300, nodes=0x7fe9ee3f6d78) at /data/jenkins/workspace/impala-umbrella-build-and-test/repos/Impala/be/src/util/runtime-profile.cc:836
      #20 0x0000000000bc54b2 in impala::RuntimeProfile::ToThrift (this=0x78d1700, nodes=0x7fe9ee3f6d78) at /data/jenkins/workspace/impala-umbrella-build-and-test/repos/Impala/be/src/util/runtime-profile.cc:836
      #21 0x0000000000bc54b2 in impala::RuntimeProfile::ToThrift (this=0x598f700, nodes=0x7fe9ee3f6d78) at /data/jenkins/workspace/impala-umbrella-build-and-test/repos/Impala/be/src/util/runtime-profile.cc:836
      #22 0x0000000000bc54b2 in impala::RuntimeProfile::ToThrift (this=0xaa04f00, nodes=0x7fe9ee3f6d78) at /data/jenkins/workspace/impala-umbrella-build-and-test/repos/Impala/be/src/util/runtime-profile.cc:836
      #23 0x0000000000bc54b2 in impala::RuntimeProfile::ToThrift (this=0x1fdd0460, nodes=0x7fe9ee3f6d78) at /data/jenkins/workspace/impala-umbrella-build-and-test/repos/Impala/be/src/util/runtime-profile.cc:836
      #24 0x0000000000bc5c1d in impala::RuntimeProfile::SerializeToArchiveString (this=) at /data/jenkins/workspace/impala-umbrella-build-and-test/repos/Impala/be/src/util/runtime-profile.cc:724
      #25 0x0000000000bc632f in impala::RuntimeProfile::SerializeToArchiveString (this=0x1fdd0460) at /data/jenkins/workspace/impala-umbrella-build-and-test/repos/Impala/be/src/util/runtime-profile.cc:718
      #26 0x0000000000abb6c4 in impala::ImpalaServer::ArchiveQuery (this=0x7bd9c00, query=...) at /data/jenkins/workspace/impala-umbrella-build-and-test/repos/Impala/be/src/service/impala-server.cc:685
      #27 0x0000000000abc6e6 in impala::ImpalaServer::UnregisterQuery (this=0x7bd9c00, query_id=) at /data/jenkins/workspace/impala-umbrella-build-and-test/repos/Impala/be/src/service/impala-server.cc:986
      #28 0x0000000000af6a42 in impala::ImpalaServer::close (this=0x7bd9c00, handle=) at /data/jenkins/workspace/impala-umbrella-build-and-test/repos/Impala/be/src/service/impala-beeswax-server.cc:236
      #29 0x0000000000d523f5 in beeswax::BeeswaxServiceProcessor::process_close (this=0x7279ae0, seqid=0, iprot=) at /data/jenkins/workspace/impala-umbrella-build-and-test/repos/Impala/be/generated-sources/gen-cpp/BeeswaxService.cpp:3543
      #30 0x0000000000d59c54 in beeswax::BeeswaxServiceProcessor::dispatchCall (this=) at /data/jenkins/workspace/impala-umbrella-build-and-test/repos/Impala/be/generated-sources/gen-cpp/BeeswaxService.cpp:2952
      #31 0x000000000080f24c in apache::thrift::TDispatchProcessor::process (this=0x7279ae0, in=..., out=..., connectionContext=0xa817d80) at /data/jenkins/workspace/impala-umbrella-build-and-test/Impala-Toolchain/thrift-0.9.0-p8/include/thrift/TDispatchProcessor.h:121
      #32 0x0000000001b4291b in apache::thrift::server::TThreadPoolServer::Task::run() ()
      #33 0x0000000001b2a4e9 in apache::thrift::concurrency::ThreadManager::Worker::run() ()
      #34 0x00000000009fdbe9 in impala::ThriftThread::RunRunnable (this=) at /data/jenkins/workspace/impala-umbrella-build-and-test/repos/Impala/be/src/rpc/thrift-thread.cc:64
      #35 0x00000000009fe642 in operator() (function_obj_ptr=) at /data/jenkins/workspace/impala-umbrella-build-and-test/Impala-Toolchain/boost-1.57.0-p1/include/boost/bind/mem_fn_template.hpp:280
      #36 operator()<boost::_mfi::mf2<void, impala::ThriftThread, boost::shared_ptr<apache::thrift::concurrency::Runnable>, impala::Promise<long unsigned int>*>, boost::_bi::list0> (function_obj_ptr=) at /data/jenkins/workspace/impala-umbrella-build-and-test/Impala-Toolchain/boost-1.57.0-p1/include/boost/bind/bind.hpp:392
      #37 operator() (function_obj_ptr=) at /data/jenkins/workspace/impala-umbrella-build-and-test/Impala-Toolchain/boost-1.57.0-p1/include/boost/bind/bind_template.hpp:20
      #38 boost::detail::function::void_function_obj_invoker0<boost::_bi::bind_t<void, boost::_mfi::mf2<void, impala::ThriftThread, boost::shared_ptr<apache::thrift::concurrency::Runnable>, impala::Promise<unsigned long>*>, boost::_bi::list3<boost::_bi::value<impala::ThriftThread*>, boost::_bi::value<boost::shared_ptr<apache::thrift::concurrency::Runnable> >, boost::_bi::value<impala::Promise<unsigned long>*> > >, void>::invoke (function_obj_ptr=) at /data/jenkins/workspace/impala-umbrella-build-and-test/Impala-Toolchain/boost-1.57.0-p1/include/boost/function/function_template.hpp:153
      #39 0x0000000000be2479 in operator() (name=) at /data/jenkins/workspace/impala-umbrella-build-and-test/Impala-Toolchain/boost-1.57.0-p1/include/boost/function/function_template.hpp:767
      #40 impala::Thread::SuperviseThread (name=) at /data/jenkins/workspace/impala-umbrella-build-and-test/repos/Impala/be/src/util/thread.cc:325
      #41 0x0000000000be2ec4 in operator()<void (*)(const std::basic_string<char>&, const std::basic_string<char>&, boost::function<void()>, impala::Promise<long int>*), boost::_bi::list0> (this=0x7de3c00) at /data/jenkins/workspace/impala-umbrella-build-and-test/Impala-Toolchain/boost-1.57.0-p1/include/boost/bind/bind.hpp:457
      #42 operator() (this=0x7de3c00) at /data/jenkins/workspace/impala-umbrella-build-and-test/Impala-Toolchain/boost-1.57.0-p1/include/boost/bind/bind_template.hpp:20
      #43 boost::detail::thread_data<boost::_bi::bind_t<void, void (*)(const std::basic_string<char, std::char_traits<char>, std::allocator<char> >&, const std::basic_string<char, std::char_traits<char>, std::allocator<char> >&, boost::function<void()>, impala::Promise<long int>*), boost::_bi::list4<boost::_bi::value<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, boost::_bi::value<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, boost::_bi::value<boost::function<void()> >, boost::_bi::value<impala::Promise<long int>*> > > >::run(void) (this=0x7de3c00) at /data/jenkins/workspace/impala-umbrella-build-and-test/Impala-Toolchain/boost-1.57.0-p1/include/boost/thread/detail/thread.hpp:116
      #44 0x0000000000e4e33a in thread_proxy ()
      #45 0x0000003557207851 in start_thread () from /lib64/libpthread.so.0
      #46 0x0000003556ee894d in clone () from /lib64/libc.so.6
      1. impalad_node2.ERROR
        12 kB
        Joe McDonnell
      2. impalad_node1.ERROR
        12 kB
        Joe McDonnell
      3. IMPALA-5172.patch
        2 kB
        Michael Ho

        Issue Links

          Activity

          Hide
          joemcdonnell Joe McDonnell added a comment -

          The only code change remaining is to reenable passing the release build down to Impala-lzo from the Impala side.

          Show
          joemcdonnell Joe McDonnell added a comment - The only code change remaining is to reenable passing the release build down to Impala-lzo from the Impala side.
          Hide
          joemcdonnell Joe McDonnell added a comment -

          commit e2f3bb2011679c965875e7e17fe4a537dddd566b
          Author: Joe McDonnell <joemcdonnell@cloudera.com>
          Date: Thu Apr 20 18:18:12 2017 -0700

          IMPALA-5172: fix incorrect cast in call to LZO decompress

          For the call to lzo1x_decompress_safe, the output buffer
          size that LZO expects is a lzo_uint*, which is a pointer
          to a 64 bit unsigned integer. Our code currently uses
          a pointer to uncompressed_len, which is a 32 bit integer.
          This is incorrect and leads to a buffer overflow for
          corrupted LZO files, because 4 extra bytes are included
          in the output buffer size seen by LZO. This leads LZO to
          consider the output buffer effectively unbounded. Any
          corruption related to the size is not caught and can
          read arbitrarily far beyond the end of the provided
          buffer.

          To fix this, the uncompressed_len is stored into a 64-bit
          integer, which is passed into lzo1x_decompress_safe.
          This also verifies that the expected uncompressed length
          is greater than zero, as a negative uncompressed length
          could lead to a buffer overrun.

          Change-Id: Id46e7cb508c9e494718c329650361396698b1180

          Show
          joemcdonnell Joe McDonnell added a comment - commit e2f3bb2011679c965875e7e17fe4a537dddd566b Author: Joe McDonnell <joemcdonnell@cloudera.com> Date: Thu Apr 20 18:18:12 2017 -0700 IMPALA-5172 : fix incorrect cast in call to LZO decompress For the call to lzo1x_decompress_safe, the output buffer size that LZO expects is a lzo_uint*, which is a pointer to a 64 bit unsigned integer. Our code currently uses a pointer to uncompressed_len, which is a 32 bit integer. This is incorrect and leads to a buffer overflow for corrupted LZO files, because 4 extra bytes are included in the output buffer size seen by LZO. This leads LZO to consider the output buffer effectively unbounded. Any corruption related to the size is not caught and can read arbitrarily far beyond the end of the provided buffer. To fix this, the uncompressed_len is stored into a 64-bit integer, which is passed into lzo1x_decompress_safe. This also verifies that the expected uncompressed length is greater than zero, as a negative uncompressed length could lead to a buffer overrun. Change-Id: Id46e7cb508c9e494718c329650361396698b1180
          Hide
          joemcdonnell Joe McDonnell added a comment -

          The commit on Apr 19th (297e23ba993d9db22287d2ac54bacdff52d5461a) fixed the Impala build. A separate change is being merged to Impala-lzo to fix the underlying issue.

          Downgrading the priority, as it is not a build blocker.

          Show
          joemcdonnell Joe McDonnell added a comment - The commit on Apr 19th (297e23ba993d9db22287d2ac54bacdff52d5461a) fixed the Impala build. A separate change is being merged to Impala-lzo to fix the underlying issue. Downgrading the priority, as it is not a build blocker.
          Hide
          joemcdonnell Joe McDonnell added a comment -

          commit 297e23ba993d9db22287d2ac54bacdff52d5461a
          Author: Alex Behm <alex.behm@cloudera.com>
          Date: Wed Apr 19 10:21:05 2017 -0700

          IMPALA-5172: Always pass DEBUG build type to Impala-Lzo

          This changes CMakeLists.txt to pass the DEBUG build type
          to LZO. This effectively reverts commit
          2598e3b26449a03011ab419a4cc1171cee249427.

          Change-Id: I15f0b383ef29d8eecccb69e9adde76386c7cdac6
          Reviewed-on: http://gerrit.cloudera.org:8080/6686
          Reviewed-by: Alex Behm <alex.behm@cloudera.com>
          Reviewed-by: Dan Hecht <dhecht@cloudera.com>
          Tested-by: Impala Public Jenkins

          Show
          joemcdonnell Joe McDonnell added a comment - commit 297e23ba993d9db22287d2ac54bacdff52d5461a Author: Alex Behm <alex.behm@cloudera.com> Date: Wed Apr 19 10:21:05 2017 -0700 IMPALA-5172 : Always pass DEBUG build type to Impala-Lzo This changes CMakeLists.txt to pass the DEBUG build type to LZO. This effectively reverts commit 2598e3b26449a03011ab419a4cc1171cee249427. Change-Id: I15f0b383ef29d8eecccb69e9adde76386c7cdac6 Reviewed-on: http://gerrit.cloudera.org:8080/6686 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Impala Public Jenkins
          Hide
          joemcdonnell Joe McDonnell added a comment -

          A print statement just before the call to lzo1x_decompress_safe shows that uncompressed_len, compressed_len, etc are all set correctly. The problem is the cast:
          int32_t uncompressed_len = 0, compressed_len = 0;
          ... read uncompressed_len ...
          int ret = lzo1x_decompress_safe(compressed_data, compressed_len, block_buffer_, reinterpret_cast<lzo_uint*>(&uncompressed_len), nullptr);

          lzo_uint is a 64-bit datatype. So, if the compiler has this point at uncompressed_len on the stack, it is adding 4 extra bytes. As it happens, those 4 extra bytes are compressed_len. compressed_len = 5309, uncompressed_len = 20480. (5309 << 32) + 20480 = 22801981394944. With some tracing, this is confirmed:
          "Value of reinterpret_cast<lzo_uint*>(uncompressed_len): 22801981394944"
          Oops. The output length we pass into LZO decompress is always effectively unbounded (even without a corrupt file) and so LZO decompresses and overruns our 20480 byte buffer.

          Changing the code to pass in a 64-bit value of uncompressed_len works fine:
          int32_t uncompressed_len = 0;
          ... read uncompressed_len ...
          int64_t big_uncompressed_len = uncompressed_len;
          int ret = lzo1x_decompress_safe(compressed_data, compressed_len, block_buffer_, reinterpret_cast<lzo_uint*>(&big_uncompressed_len), nullptr);

          Show
          joemcdonnell Joe McDonnell added a comment - A print statement just before the call to lzo1x_decompress_safe shows that uncompressed_len, compressed_len, etc are all set correctly. The problem is the cast: int32_t uncompressed_len = 0, compressed_len = 0; ... read uncompressed_len ... int ret = lzo1x_decompress_safe(compressed_data, compressed_len, block_buffer_, reinterpret_cast<lzo_uint*>(&uncompressed_len), nullptr); lzo_uint is a 64-bit datatype. So, if the compiler has this point at uncompressed_len on the stack, it is adding 4 extra bytes. As it happens, those 4 extra bytes are compressed_len. compressed_len = 5309, uncompressed_len = 20480. (5309 << 32) + 20480 = 22801981394944. With some tracing, this is confirmed: "Value of reinterpret_cast<lzo_uint*>(uncompressed_len): 22801981394944" Oops. The output length we pass into LZO decompress is always effectively unbounded (even without a corrupt file) and so LZO decompresses and overruns our 20480 byte buffer. Changing the code to pass in a 64-bit value of uncompressed_len works fine: int32_t uncompressed_len = 0; ... read uncompressed_len ... int64_t big_uncompressed_len = uncompressed_len; int ret = lzo1x_decompress_safe(compressed_data, compressed_len, block_buffer_, reinterpret_cast<lzo_uint*>(&big_uncompressed_len), nullptr);
          Hide
          joemcdonnell Joe McDonnell added a comment -

          I isolated the test case down to a single LZO encoded file that will cause crashes in other work running on the system (in this case TPC-H). I slightly modified the Status message for this file to print a bit more, and here is what I see on the release build:

          WARNINGS: Lzo decompression failed on file: hdfs://localhost:20500/test-warehouse/test_fuzz_alltypes_7a856c89.db/alltypes/year=2009/month=10/000000_0.lzo at offset: 5355 returned: 0 output size: 0 expected: 20480 uncompressed len: 20509

          Everything is wrong about this error message. LZO decompression returned 0, indicating success. The uncompressed len is 20509 bytes, which is larger than the buffer that we passed in (20480 bytes). output size (which is actually the input compressed_len) is zero, which makes no sense (possibly because compressed_len has been hit by a buffer overflow).

          On the debug build, I see this:

          WARNINGS: Lzo decompression failed on file: hdfs://localhost:20500/test-warehouse/test_fuzz_alltypes_7a856c89.db/alltypes/year=2009/month=10/000000_0.lzo at offset: 5355 returned: -5 output size: 5309 expected: 20480 uncompressed len: 20478

          This makes perfect sense, and it does not cause any crashes. I can run it constantly while TPC-H is running, and nothing crashes. The return code is -5, which indicates is LZO_E_OUTPUT_OVERRUN, the output buffer is too small. Everything else fits.

          So, debug build is actually producing different behavior from release, and it does not look like a timing issue.

          Show
          joemcdonnell Joe McDonnell added a comment - I isolated the test case down to a single LZO encoded file that will cause crashes in other work running on the system (in this case TPC-H). I slightly modified the Status message for this file to print a bit more, and here is what I see on the release build: WARNINGS: Lzo decompression failed on file: hdfs://localhost:20500/test-warehouse/test_fuzz_alltypes_7a856c89.db/alltypes/year=2009/month=10/000000_0.lzo at offset: 5355 returned: 0 output size: 0 expected: 20480 uncompressed len: 20509 Everything is wrong about this error message. LZO decompression returned 0, indicating success. The uncompressed len is 20509 bytes, which is larger than the buffer that we passed in (20480 bytes). output size (which is actually the input compressed_len) is zero, which makes no sense (possibly because compressed_len has been hit by a buffer overflow). On the debug build, I see this: WARNINGS: Lzo decompression failed on file: hdfs://localhost:20500/test-warehouse/test_fuzz_alltypes_7a856c89.db/alltypes/year=2009/month=10/000000_0.lzo at offset: 5355 returned: -5 output size: 5309 expected: 20480 uncompressed len: 20478 This makes perfect sense, and it does not cause any crashes. I can run it constantly while TPC-H is running, and nothing crashes. The return code is -5, which indicates is LZO_E_OUTPUT_OVERRUN, the output buffer is too small. Everything else fits. So, debug build is actually producing different behavior from release, and it does not look like a timing issue.
          Hide
          joemcdonnell Joe McDonnell added a comment -

          When using SCANNER_FUZZ_SEED=1492709433, this reproduces every time with this command:
          impala-py.test tests/query_test/test_scanners_fuzz.py -n 10 --verbose --workload_exploration_strategy=functional-query:exhaustive

          Show
          joemcdonnell Joe McDonnell added a comment - When using SCANNER_FUZZ_SEED=1492709433, this reproduces every time with this command: impala-py.test tests/query_test/test_scanners_fuzz.py -n 10 --verbose --workload_exploration_strategy=functional-query:exhaustive
          Hide
          joemcdonnell Joe McDonnell added a comment -

          Tim Armstrong I tried initializing all the fields (including block_buffer_), and that doesn't fix it. block_buffer_len_ is initialized to 0, so I think that protects block_buffer_. We allocate a new block_buffer_ if uncompressed_len > block_buffer_len_.

          Show
          joemcdonnell Joe McDonnell added a comment - Tim Armstrong I tried initializing all the fields (including block_buffer_), and that doesn't fix it. block_buffer_len_ is initialized to 0, so I think that protects block_buffer_. We allocate a new block_buffer_ if uncompressed_len > block_buffer_len_.
          Hide
          tarmstrong Tim Armstrong added a comment -

          Joe McDonnell I wonder if initialising block_buffer_ to nullptr changes anything.

          Show
          tarmstrong Tim Armstrong added a comment - Joe McDonnell I wonder if initialising block_buffer_ to nullptr changes anything.
          Hide
          tarmstrong Tim Armstrong added a comment -

          I did a pass over the Impala-lzo looking for uninitialised variables and anything else suspicious. I noticed that block_buffer_ is not initialised and it wasn't clear that it was initialised before being used on all code paths. Not sure if that's the problem but we should just initialise it to NULL.

          Show
          tarmstrong Tim Armstrong added a comment - I did a pass over the Impala-lzo looking for uninitialised variables and anything else suspicious. I noticed that block_buffer_ is not initialised and it wasn't clear that it was initialised before being used on all code paths. Not sure if that's the problem but we should just initialise it to NULL.
          Show
          alex.behm Alexander Behm added a comment - http://gerrit.cloudera.org:8080/6686
          Hide
          alex.behm Alexander Behm added a comment -

          Joe McDonnell, thanks for investigating. Let's revert that commit asap.

          Show
          alex.behm Alexander Behm added a comment - Joe McDonnell , thanks for investigating. Let's revert that commit asap.
          Hide
          joemcdonnell Joe McDonnell added a comment -

          This is caused by change to LZO build to pass in build type:
          https://github.com/apache/incubator-impala/commit/2598e3b26449a03011ab419a4cc1171cee249427
          https://github.com/cloudera/impala-lzo/commit/5ce3eb6237bbae90a57fc52281e625fd2bd38d51

          The problem reproduces without IMPALA-3905. When I remove the above commits, the problem does not reproduce.

          Show
          joemcdonnell Joe McDonnell added a comment - This is caused by change to LZO build to pass in build type: https://github.com/apache/incubator-impala/commit/2598e3b26449a03011ab419a4cc1171cee249427 https://github.com/cloudera/impala-lzo/commit/5ce3eb6237bbae90a57fc52281e625fd2bd38d51 The problem reproduces without IMPALA-3905 . When I remove the above commits, the problem does not reproduce.
          Hide
          joemcdonnell Joe McDonnell added a comment -

          Running test_scanners_fuzz.py with text/lzo disabled runs without issue. Since this is lzo related, I'm looking at IMPALA-3905.

          Show
          joemcdonnell Joe McDonnell added a comment - Running test_scanners_fuzz.py with text/lzo disabled runs without issue. Since this is lzo related, I'm looking at IMPALA-3905 .
          Hide
          joemcdonnell Joe McDonnell added a comment - - edited

          This reproduces very quickly (<5 min, often faster) when looping test_scanners_fuzz.py with a release binary. Exact steps:

          cd ${IMPALA_HOME}
          ./bin/clean.sh
          ./buildall.sh -noclean -release -notests
          ./bin/create-test-configuration.sh
          ./bin/start-impala-cluster.py
          while impala-py.test tests/query_test/test_scanners_fuzz.py -n 10 --verbose --workload_exploration_strategy=functional-query:exhaustive; do echo yes; done

          Show
          joemcdonnell Joe McDonnell added a comment - - edited This reproduces very quickly (<5 min, often faster) when looping test_scanners_fuzz.py with a release binary. Exact steps: cd ${IMPALA_HOME} ./bin/clean.sh ./buildall.sh -noclean -release -notests ./bin/create-test-configuration.sh ./bin/start-impala-cluster.py while impala-py.test tests/query_test/test_scanners_fuzz.py -n 10 --verbose --workload_exploration_strategy=functional-query:exhaustive; do echo yes; done
          Hide
          joemcdonnell Joe McDonnell added a comment -

          Looking at the new logs, the impalad.INFO logs for all impalads show that the test_scanners_fuzz.py test is running and trying to read corrupted LZO files. The impalads that crash are immediately preceded by LZO decompression failures. (LZO decompression failures also show up in a clean run.)

          I'm currently running ASAN on test_scanners_fuzz.py in a loop with mem-pools disabled. I have exhaustive release private builds running to try to reproduce. Another run of impala-asf-master-exhaustive-release is underway.

          Show
          joemcdonnell Joe McDonnell added a comment - Looking at the new logs, the impalad.INFO logs for all impalads show that the test_scanners_fuzz.py test is running and trying to read corrupted LZO files. The impalads that crash are immediately preceded by LZO decompression failures. (LZO decompression failures also show up in a clean run.) I'm currently running ASAN on test_scanners_fuzz.py in a loop with mem-pools disabled. I have exhaustive release private builds running to try to reproduce. Another run of impala-asf-master-exhaustive-release is underway.
          Hide
          mjacobs Matthew Jacobs added a comment -

          Reopening- looks like this is still happening, so Joe's fix was a fix for a different bug.

          Show
          mjacobs Matthew Jacobs added a comment - Reopening- looks like this is still happening, so Joe's fix was a fix for a different bug.
          Hide
          joemcdonnell Joe McDonnell added a comment -

          commit fcefe47d7348f01e6a6fef700b421290a3f536b3
          Author: Joe McDonnell <joemcdonnell@cloudera.com>
          Date: Wed Apr 12 16:46:15 2017 -0700

          IMPALA-5172: Buffer overrun for Snappy decompression

          When using a preallocated buffer for decompression, a
          file corruption can lead to the expected decompressed size
          being smaller than the actual decompressed size. Since
          we use this for allocating the output buffer,
          decompression needs to be able to handle a buffer that
          is too small.

          Snappy does not properly handle a buffer that is too small
          and will overrun the buffer. This changes the code to
          check the decompressed length and return an error if
          the buffer is not large enough. It also adds a test to
          verify that this behavior is respected for other
          compression algorithms.

          Change-Id: I45b75f61e8c0ae85f9add5b13ac2b161a803d2ba
          Reviewed-on: http://gerrit.cloudera.org:8080/6625
          Reviewed-by: Dan Hecht <dhecht@cloudera.com>
          Tested-by: Dan Hecht <dhecht@cloudera.com>

          Show
          joemcdonnell Joe McDonnell added a comment - commit fcefe47d7348f01e6a6fef700b421290a3f536b3 Author: Joe McDonnell <joemcdonnell@cloudera.com> Date: Wed Apr 12 16:46:15 2017 -0700 IMPALA-5172 : Buffer overrun for Snappy decompression When using a preallocated buffer for decompression, a file corruption can lead to the expected decompressed size being smaller than the actual decompressed size. Since we use this for allocating the output buffer, decompression needs to be able to handle a buffer that is too small. Snappy does not properly handle a buffer that is too small and will overrun the buffer. This changes the code to check the decompressed length and return an error if the buffer is not large enough. It also adds a test to verify that this behavior is respected for other compression algorithms. Change-Id: I45b75f61e8c0ae85f9add5b13ac2b161a803d2ba Reviewed-on: http://gerrit.cloudera.org:8080/6625 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Dan Hecht <dhecht@cloudera.com>
          Hide
          joemcdonnell Joe McDonnell added a comment -

          I have an equivalent patch that avoids the decompressor issue. I will start an ASAN run with it.

          Show
          joemcdonnell Joe McDonnell added a comment - I have an equivalent patch that avoids the decompressor issue. I will start an ASAN run with it.
          Hide
          kwho Michael Ho added a comment -

          Joe, if the problem is reproducible for you with ASAN build, can you please the attached patch a try to see if you can get further and find other issues ?

          Show
          kwho Michael Ho added a comment - Joe, if the problem is reproducible for you with ASAN build, can you please the attached patch a try to see if you can get further and find other issues ?
          Hide
          kwho Michael Ho added a comment - - edited

          We may as well don't preallocate the buffer and let the decompressor allocate a buffer of appropriate size in case the file is corrupted somehow.

          Show
          kwho Michael Ho added a comment - - edited We may as well don't preallocate the buffer and let the decompressor allocate a buffer of appropriate size in case the file is corrupted somehow.
          Hide
          kwho Michael Ho added a comment - - edited

          For Sanppy, we can call snappy::GetUncompressedLength() to get a good estimation of the size. Not so sure if such function is available for other compression format.

          int64_t SnappyDecompressor::MaxOutputLen(int64_t input_len, const uint8_t* input) {
            DCHECK(input != NULL);
            size_t result;
            if (!snappy::GetUncompressedLength(reinterpret_cast<const char*>(input),
                    input_len, &result)) {
              return -1;
            }
            return result;
          }
          
          Show
          kwho Michael Ho added a comment - - edited For Sanppy, we can call snappy::GetUncompressedLength() to get a good estimation of the size. Not so sure if such function is available for other compression format. int64_t SnappyDecompressor::MaxOutputLen(int64_t input_len, const uint8_t* input) { DCHECK(input != NULL); size_t result; if (!snappy::GetUncompressedLength(reinterpret_cast<const char*>(input), input_len, &result)) { return -1; } return result; }
          Hide
          sailesh Sailesh Mukil added a comment -

          nvm, we don't have the new value of uncompressed_size before calling ProcessBlock32().

          In any case, we need to know if the actual uncompressed buffer size will not match the uncompressed buffer size from the page header before we actually write into that buffer.

          Show
          sailesh Sailesh Mukil added a comment - nvm, we don't have the new value of uncompressed_size before calling ProcessBlock32(). In any case, we need to know if the actual uncompressed buffer size will not match the uncompressed buffer size from the page header before we actually write into that buffer.
          Show
          sailesh Sailesh Mukil added a comment - Joe McDonnell Is this the bug? We should have this check before calling ProcessBlock32(): i.e. do this: https://github.com/apache/incubator-impala/blob/56e37166492b8ee155a6bae851489ace635ae085/be/src/exec/parquet-column-readers.cc#L1023 before this: https://github.com/apache/incubator-impala/blob/56e37166492b8ee155a6bae851489ace635ae085/be/src/exec/parquet-column-readers.cc#L1018
          Hide
          joemcdonnell Joe McDonnell added a comment -

          Run on ASAN with mem-pools disabled found a heap buffer overflow. I'm attaching the error output. I have all the other log files in my environment.

          Show
          joemcdonnell Joe McDonnell added a comment - Run on ASAN with mem-pools disabled found a heap buffer overflow. I'm attaching the error output. I have all the other log files in my environment.
          Hide
          dhecht Dan Hecht added a comment -

          Cool. BTW, when I hit that crash above, it was after about 3 hours, and with ASAN that may take even longer. I've been looping that test again in parallel (without ASAN) since this morning and haven't yet hit another crash, so it may take a while to repro with ASAN.

          Show
          dhecht Dan Hecht added a comment - Cool. BTW, when I hit that crash above, it was after about 3 hours, and with ASAN that may take even longer. I've been looping that test again in parallel (without ASAN) since this morning and haven't yet hit another crash, so it may take a while to repro with ASAN.
          Hide
          joemcdonnell Joe McDonnell added a comment -

          I've been running test_scanners_fuzz.py with ASAN for the past hour with no errors so far. This is without mem-pools disabled. I will switch to having mem-pools disabled and get it running.

          Show
          joemcdonnell Joe McDonnell added a comment - I've been running test_scanners_fuzz.py with ASAN for the past hour with no errors so far. This is without mem-pools disabled. I will switch to having mem-pools disabled and get it running.
          Hide
          dhecht Dan Hecht added a comment -

          What are the next steps here? Once thing that might be worth trying is to loop test_scanners_fuzz.py in parallel on ASAN with mem-pools disabled.

          Show
          dhecht Dan Hecht added a comment - What are the next steps here? Once thing that might be worth trying is to loop test_scanners_fuzz.py in parallel on ASAN with mem-pools disabled.
          Hide
          dhecht Dan Hecht added a comment -

          I was able to reproduce what looks like memory corruption by looping test_scanners_fuzz.py in parallel, and it crashed after about 3 hours:

          I0408 09:46:27.961217  8343 coordinator.cc:616] started 7 fragment instances for query 364ad04dbf5fcbb7:b8e4dc3400000000
          I0408 09:46:27.991331 26500 status.cc:58] File 'hdfs://localhost:20500/test-warehouse/test_fuzz_decimal_tbl_d2d846e1.db/decimal_tbl/d6=1/copy9_5d43c9e787ef43ec-17c9301a00000000_779888162_data.0.parq' has an invalid version number:
          p
          This could be due to stale metadata. Try running "refresh test_fuzz_decimal_tbl_d2d846e1.decimal_tbl".
              @          0x1207d03  impala::Status::Status()
              @          0x172c2d9  impala::HdfsParquetScanner::ProcessFooter()
              @          0x1722f0d  impala::HdfsParquetScanner::Open()
              @          0x16e1e13  impala::HdfsScanNodeBase::CreateAndOpenScanner()
              @          0x16d5436  impala::HdfsScanNode::ProcessSplit()
              @          0x16d4bbc  impala::HdfsScanNode::ScannerThread()
              @          0x16da995  boost::_mfi::mf0<>::operator()()
              @          0x16da5b8  boost::_bi::list1<>::operator()<>()
              @          0x16da149  boost::_bi::bind_t<>::operator()()
              @          0x16d9c6e  boost::detail::function::void_function_obj_invoker0<>::invoke()
              @          0x137bd38  boost::function0<>::operator()()
              @          0x162e353  impala::Thread::SuperviseThread()
              @          0x1636d2e  boost::_bi::list4<>::operator()<>()
              @          0x1636c71  boost::_bi::bind_t<>::operator()()
              @          0x1636c34  boost::detail::thread_data<>::run()
              @          0x1af26fa  thread_proxy
              @     0x7f6a3daeb184  start_thread
              @     0x7f6a3d81837d  clone
          F0408 09:46:27.922556 26465 mem-pool.cc:73] Check failed: zero_length_region_ == (0x66aa77bb) (2 vs. 1722447803) F0408 09:46:27.943893 26504 mem-pool.cc:73] Check failed: zero_length_region_ == (0x66aa77bb) (2 vs. 1722447803) F0408 09:46:27.944975 26520 mem-pool.cc:50] Check failed: zero_length_region_ == (0x66aa77bb) (2 vs. 1722447803) F0408 09:46:27.945159 26524 mem-pool.cc:50] Check failed: zero_length_region_ == (0x66aa77bb) (2 vs. 1722447803) F0408 09:46:27.949890 26527 mem-pool.cc:50] Check failed: zero_length_region_ == (0x66aa77bb) (2 vs. 1722447803) F0408 09:46:27.991372 26500 mem-pool.cc:73] Check failed: zero_length_region_ == (0x66aa77bb) (2 vs. 1722447803)
          
          #6  0x00000000028c501e in google::LogMessageFatal::~LogMessageFatal() ()
          #7  0x0000000001426ee9 in impala::MemPool::~MemPool (this=0xc2f0da8, __in_chrg=<optimized out>) at /home/dhecht/src/Impala/be/src/runtime/mem-pool.cc:73
          #8  0x000000000143132e in impala::RowBatch::~RowBatch (this=0xc2f0d80, __in_chrg=<optimized out>) at /home/dhecht/src/Impala/be/src/runtime/row-batch.cc:153
          #9  0x00000000013eb31d in boost::checked_delete<impala::RowBatch> (x=0xc2f0d80) at /home/dhecht/toolchain/boost-1.57.0-p1/include/boost/core/checked_delete.hpp:34
          #10 0x00000000013ea637 in boost::scoped_ptr<impala::RowBatch>::~scoped_ptr (this=0x7f69ae800ea0, __in_chrg=<optimized out>) at /home/dhecht/toolchain/boost-1.57.0-p1/include/boost/smart_ptr/scoped_ptr.hpp:82
          #11 0x00000000013ea6cb in boost::scoped_ptr<impala::RowBatch>::reset (this=0xc7da458, p=0x0) at /home/dhecht/toolchain/boost-1.57.0-p1/include/boost/smart_ptr/scoped_ptr.hpp:88
          #12 0x0000000001a64105 in impala::DataStreamSender::Channel::Teardown (this=0xc7da400, state=0xea99500) at /home/dhecht/src/Impala/be/src/runtime/data-stream-sender.cc:325
          #13 0x0000000001a66326 in impala::DataStreamSender::Close (this=0xc76f200, state=0xea99500) at /home/dhecht/src/Impala/be/src/runtime/data-stream-sender.cc:483
          #14 0x0000000001a751cd in impala::PlanFragmentExecutor::Close (this=0xcd0dad0) at /home/dhecht/src/Impala/be/src/runtime/plan-fragment-executor.cc:495
          #15 0x0000000001a6cdab in impala::FragmentInstanceState::Exec (this=0xcd0d800) at /home/dhecht/src/Impala/be/src/runtime/fragment-instance-state.cc:71
          #16 0x0000000001a783df in impala::QueryExecMgr::ExecFInstance (this=0xb2bad20, fis=0xcd0d800) at /home/dhecht/src/Impala/be/src/runtime/query-exec-mgr.cc:110
          
          Show
          dhecht Dan Hecht added a comment - I was able to reproduce what looks like memory corruption by looping test_scanners_fuzz.py in parallel, and it crashed after about 3 hours: I0408 09:46:27.961217 8343 coordinator.cc:616] started 7 fragment instances for query 364ad04dbf5fcbb7:b8e4dc3400000000 I0408 09:46:27.991331 26500 status.cc:58] File 'hdfs: //localhost:20500/test-warehouse/test_fuzz_decimal_tbl_d2d846e1.db/decimal_tbl/d6=1/copy9_5d43c9e787ef43ec-17c9301a00000000_779888162_data.0.parq' has an invalid version number: p This could be due to stale metadata. Try running "refresh test_fuzz_decimal_tbl_d2d846e1.decimal_tbl" . @ 0x1207d03 impala::Status::Status() @ 0x172c2d9 impala::HdfsParquetScanner::ProcessFooter() @ 0x1722f0d impala::HdfsParquetScanner::Open() @ 0x16e1e13 impala::HdfsScanNodeBase::CreateAndOpenScanner() @ 0x16d5436 impala::HdfsScanNode::ProcessSplit() @ 0x16d4bbc impala::HdfsScanNode::ScannerThread() @ 0x16da995 boost::_mfi::mf0<>:: operator ()() @ 0x16da5b8 boost::_bi::list1<>:: operator ()<>() @ 0x16da149 boost::_bi::bind_t<>:: operator ()() @ 0x16d9c6e boost::detail::function::void_function_obj_invoker0<>::invoke() @ 0x137bd38 boost::function0<>:: operator ()() @ 0x162e353 impala:: Thread ::SuperviseThread() @ 0x1636d2e boost::_bi::list4<>:: operator ()<>() @ 0x1636c71 boost::_bi::bind_t<>:: operator ()() @ 0x1636c34 boost::detail::thread_data<>::run() @ 0x1af26fa thread_proxy @ 0x7f6a3daeb184 start_thread @ 0x7f6a3d81837d clone F0408 09:46:27.922556 26465 mem-pool.cc:73] Check failed: zero_length_region_ == (0x66aa77bb) (2 vs. 1722447803) F0408 09:46:27.943893 26504 mem-pool.cc:73] Check failed: zero_length_region_ == (0x66aa77bb) (2 vs. 1722447803) F0408 09:46:27.944975 26520 mem-pool.cc:50] Check failed: zero_length_region_ == (0x66aa77bb) (2 vs. 1722447803) F0408 09:46:27.945159 26524 mem-pool.cc:50] Check failed: zero_length_region_ == (0x66aa77bb) (2 vs. 1722447803) F0408 09:46:27.949890 26527 mem-pool.cc:50] Check failed: zero_length_region_ == (0x66aa77bb) (2 vs. 1722447803) F0408 09:46:27.991372 26500 mem-pool.cc:73] Check failed: zero_length_region_ == (0x66aa77bb) (2 vs. 1722447803) #6 0x00000000028c501e in google::LogMessageFatal::~LogMessageFatal() () #7 0x0000000001426ee9 in impala::MemPool::~MemPool ( this =0xc2f0da8, __in_chrg=<optimized out>) at /home/dhecht/src/Impala/be/src/runtime/mem-pool.cc:73 #8 0x000000000143132e in impala::RowBatch::~RowBatch ( this =0xc2f0d80, __in_chrg=<optimized out>) at /home/dhecht/src/Impala/be/src/runtime/row-batch.cc:153 #9 0x00000000013eb31d in boost::checked_delete<impala::RowBatch> (x=0xc2f0d80) at /home/dhecht/toolchain/boost-1.57.0-p1/include/boost/core/checked_delete.hpp:34 #10 0x00000000013ea637 in boost::scoped_ptr<impala::RowBatch>::~scoped_ptr ( this =0x7f69ae800ea0, __in_chrg=<optimized out>) at /home/dhecht/toolchain/boost-1.57.0-p1/include/boost/smart_ptr/scoped_ptr.hpp:82 #11 0x00000000013ea6cb in boost::scoped_ptr<impala::RowBatch>::reset ( this =0xc7da458, p=0x0) at /home/dhecht/toolchain/boost-1.57.0-p1/include/boost/smart_ptr/scoped_ptr.hpp:88 #12 0x0000000001a64105 in impala::DataStreamSender::Channel::Teardown ( this =0xc7da400, state=0xea99500) at /home/dhecht/src/Impala/be/src/runtime/data-stream-sender.cc:325 #13 0x0000000001a66326 in impala::DataStreamSender::Close ( this =0xc76f200, state=0xea99500) at /home/dhecht/src/Impala/be/src/runtime/data-stream-sender.cc:483 #14 0x0000000001a751cd in impala::PlanFragmentExecutor::Close ( this =0xcd0dad0) at /home/dhecht/src/Impala/be/src/runtime/plan-fragment-executor.cc:495 #15 0x0000000001a6cdab in impala::FragmentInstanceState::Exec ( this =0xcd0d800) at /home/dhecht/src/Impala/be/src/runtime/fragment-instance-state.cc:71 #16 0x0000000001a783df in impala::QueryExecMgr::ExecFInstance ( this =0xb2bad20, fis=0xcd0d800) at /home/dhecht/src/Impala/be/src/runtime/query-exec-mgr.cc:110
          Hide
          dhecht Dan Hecht added a comment -

          Seems plausible this could be another manifestation of the races in IMPALA-4890/IMPALA-5143. Might be worth thinking through if that race can lead to a double delete.

          Show
          dhecht Dan Hecht added a comment - Seems plausible this could be another manifestation of the races in IMPALA-4890 / IMPALA-5143 . Might be worth thinking through if that race can lead to a double delete.
          Hide
          sailesh Sailesh Mukil added a comment -

          Just chiming in, both the crashes are in the coordinator. So that could narrow down the possibilities of the culprit.

          Show
          sailesh Sailesh Mukil added a comment - Just chiming in, both the crashes are in the coordinator. So that could narrow down the possibilities of the culprit.
          Hide
          mjacobs Matthew Jacobs added a comment -

          Joe McDonnell can you look at this? Perhaps ask Tim Armstrong for some help

          Show
          mjacobs Matthew Jacobs added a comment - Joe McDonnell can you look at this? Perhaps ask Tim Armstrong for some help
          Hide
          tarmstrong Tim Armstrong added a comment -

          I tried rerunning with and without the same seed under ASAN locally and didn't have any luck reproducing it.

          Show
          tarmstrong Tim Armstrong added a comment - I tried rerunning with and without the same seed under ASAN locally and didn't have any luck reproducing it.
          Hide
          dhecht Dan Hecht added a comment -

          Maybe worth running with the same seed (SCANNER_FUZZ_SEED)?

          Show
          dhecht Dan Hecht added a comment - Maybe worth running with the same seed (SCANNER_FUZZ_SEED)?
          Hide
          tarmstrong Tim Armstrong added a comment -

          This looks like TCMalloc heap corruption - my guess is that either there was an invalid free (double free or free of a bogus memory address). I don't think the rest of the stack is likely to have any bearing on the problem - that code was probably just the unlucky victim of a bad free from somewhere else. It's hard to trace those back without ASAN.

          Could be the scanner fuzz testing catching a bug. Could be worth running that under ASAN.

          Show
          tarmstrong Tim Armstrong added a comment - This looks like TCMalloc heap corruption - my guess is that either there was an invalid free (double free or free of a bogus memory address). I don't think the rest of the stack is likely to have any bearing on the problem - that code was probably just the unlucky victim of a bad free from somewhere else. It's hard to trace those back without ASAN. Could be the scanner fuzz testing catching a bug. Could be worth running that under ASAN.
          Hide
          mjacobs Matthew Jacobs added a comment -

          Dan Hecht the first run is already gone from jenkins. It might be hard to tell anyway, because it's crashing during the parallel tests so many tests fail at the same time:

          ...
          [gw1] PASSED query_test/test_sort.py::TestQueryFullSort::test_multiple_mem_limits_full_output[exec_option: {'disable_codegen': False, 'abort_on_error': 1, 'exec_single_node_rows_threshold': 0, 'batch_size': 0, 'num_nodes': 0} | table_format: parquet/none] 
          query_test/test_sort.py::TestQueryFullSort::test_sort_join[exec_option: {'disable_codegen': False, 'abort_on_error': 1, 'exec_single_node_rows_threshold': 0, 'batch_size': 0, 'num_nodes': 0} | table_format: parquet/none] 
          [gw2] PASSED query_test/test_scanners_fuzz.py::TestScannersFuzzing::test_fuzz_alltypes[exec_option: {'mem_limit': '512m', 'abort_on_error': False, 'num_nodes': 0} | table_format: text/lzo/block] 
          [gw3] FAILED query_test/test_queries.py::TestHdfsQueries::test_top_n[exec_option: {'disable_codegen': True, 'abort_on_error': 1, 'exec_single_node_rows_threshold': 0, 'batch_size': 0, 'num_nodes': 0} | table_format: seq/bzip/block] 
          query_test/test_queries.py::TestHdfsQueries::test_top_n[exec_option: {'disable_codegen': False, 'abort_on_error': 1, 'exec_single_node_rows_threshold': 0, 'batch_size': 0, 'num_nodes': 0} | table_format: seq/bzip/block] 
          [gw0] FAILED query_test/test_tpch_queries.py::TestTpchQuery::test_tpch[exec_option: {'disable_codegen': False, 'abort_on_error': 1, 'exec_single_node_rows_threshold': 0, 'batch_size': 0, 'num_nodes': 0} | table_format: text/gzip/block-TPC-H: Q2] 
          query_test/test_tpch_queries.py::TestTpchQuery::test_tpch[exec_option: {'disable_codegen': False, 'abort_on_error': 1, 'exec_single_node_rows_threshold': 0, 'batch_size': 0, 'num_nodes': 0} | table_format: text/gzip/block-TPC-H: Q3] 
          [gw2] ERROR query_test/test_scanners_fuzz.py::TestScannersFuzzing::test_fuzz_alltypes[exec_option: {'mem_limit': '512m', 'abort_on_error': False, 'num_nodes': 0} | table_format: text/lzo/block] 
          query_test/test_scanners_fuzz.py::TestScannersFuzzing::test_fuzz_alltypes[exec_option: {'mem_limit': '512m', 'abort_on_error': False, 'num_nodes': 0} | table_format: avro/none] 
          [gw0] FAILED query_test/test_tpch_queries.py::TestTpchQuery::test_tpch[exec_option: {'disable_codegen': False, 'abort_on_error': 1, 'exec_single_node_rows_threshold': 0, 'batch_size': 0, 'num_nodes': 0} | table_format: text/gzip/block-TPC-H: Q3] 
          query_test/test_tpch_queries.py::TestTpchQuery::test_tpch[exec_option: {'disable_codegen': False, 'abort_on_error': 1, 'exec_single_node_rows_threshold': 0, 'batch_size': 0, 'num_nodes': 0} | table_format: text/gzip/block-TPC-H: Q4] 
          [gw3] FAILED query_test/test_queries.py::TestHdfsQueries::test_top_n[exec_option: {'disable_codegen': False, 'abort_on_error': 1, 'exec_single_node_rows_threshold': 0, 'batch_size': 0, 'num_nodes': 0} | table_format: seq/bzip/block] 
          query_test/test_queries.py::TestHdfsQueries::test_top_n[exec_option: {'disable_codegen': True, 'abort_on_error': 1, 'exec_single_node_rows_threshold': 100, 'batch_size': 0, 'num_nodes': 0} | table_format: seq/bzip/block] 
          [gw1] FAILED query_test/test_sort.py::TestQueryFullSort::test_sort_join[exec_option: {'disable_codegen': False, 'abort_on_error': 1, 'exec_single_node_rows_threshold': 0, 'batch_size': 0, 'num_nodes': 0} | table_format: parquet/none] 
          query_test/test_sort.py::TestQueryFullSort::test_sort_union[exec_option: {'disable_codegen': False, 'abort_on_error': 1, 'exec_single_node_rows_threshold': 0, 'batch_size': 0, 'num_nodes': 0} | table_format: parquet/none] 
          [gw3] FAILED query_test/test_queries.py::TestHdfsQueries::test_top_n[exec_option: {'disable_codegen': True, 'abort_on_error': 1, 'exec_single_node_rows_threshold': 100, 'batch_size': 0, 'num_nodes': 0} | table_format: seq/bzip/block] 
          query_test/test_queries.py::TestHdfsQueries::test_top_n[exec_option: {'disable_codegen': False, 'abort_on_error': 1, 'exec_single_node_rows_threshold': 100, 'batch_size': 0, 'num_nodes': 0} | table_format: seq/bzip/block] 
          [gw2] ERROR query_test/test_scanners_fuzz.py::TestScannersFuzzing::test_fuzz_alltypes[exec_option: {'mem_limit': '512m', 'abort_on_error': False, 'num_nodes': 0} | table_format: avro/none] 
          [gw1] FAILED query_test/test_sort.py::TestQueryFullSort::test_sort_union[exec_option: {'disable_codegen': False, 'abort_on_error': 1, 'exec_single_node_rows_threshold': 0, 'batch_size': 0, 'num_nodes': 0} | table_format: parquet/none] 
          query_test/test_sort.py::TestQueryFullSort::test_pathological_input[exec_option: {'disable_codegen': False, 'abort_on_error': 1, 'exec_single_node_rows_threshold': 0, 'batch_size': 0, 'num_nodes': 0} | table_format: parquet/none] 
          [gw3] FAILED query_test/test_queries.py::TestHdfsQueries::test_top_n[exec_option: {'disable_codegen': False, 'abort_on_error': 1, 'exec_single_node_rows_threshold': 100, 'batch_size': 0, 'num_nodes': 0} | table_format: seq/bzip/block] -- closing connection to: localhost:21000
          

          One thing that's kind of interesting/funny is that there are ~30k llines of weird single-character-per-row output. E.g. here's a small snippet:

          select count(*) from (select distinct * from test_fuzz_alltypes_7a856c89.alltypes) q;
          
          MainThread: E
          r
          r
          o
          r
           
          c
          o
          n
          v
          e
          r
          t
          i
          n
          g
           
          c
          o
          l
          u
          m
          n
          :
           
          1
          0
           
          t
          o
           
          T
          I
          M
          E
          S
          T
          A
          M
          P
          
          
          E
          r
          r
          o
          r
           
          p
          a
          r
          s
          i
          n
          g
           
          r
          o
          w
          :
           
          f
          i
          l
          e
          :
           
          h
          d
          f
          s
          :
          /
          /
          l
          o
          c
          a
          l
          h
          o
          s
          t
          :
          2
          0
          5
          0
          0
          /
          t
          e
          s
          t
          -
          w
          a
          r
          e
          h
          o
          u
          s
          e
          /
          t
          e
          s
          t
          _
          f
          u
          z
          z
          _
          a
          l
          l
          t
          y
          p
          e
          s
          _
          7
          a
          8
          5
          6
          c
          8
          9
          .
          d
          b
          /
          a
          l
          l
          t
          y
          p
          e
          s
          /
          y
          e
          a
          r
          =
          2
          0
          0
          9
          /
          m
          o
          n
          t
          h
          =
          4
          /
          0
          0
          0
          0
          2
          1
          _
          0
          .
          l
          z
          o
          ,
           
          b
          e
          f
          o
          r
          e
           
          o
          f
          f
          s
          e
          t
          :
           
          5
          2
          1
          1
          
          
          E
          r
          r
          o
          r
           
          c
          o
          n
          v
          e
          r
          t
          i
          n
          g
           
          c
          o
          l
          u
          m
          n
          :
           
          1
          0
           
          t
          o
           
          T
          I
          M
          E
          S
          T
          A
          M
          P
          
          
          E
          r
          r
          o
          r
           
          p
          a
          r
          s
          i
          n
          g
           
          r
          o
          w
          :
           
          f
          i
          l
          e
          :
           
          h
          d
          f
          s
          :
          /
          /
          l
          o
          c
          a
          l
          h
          o
          s
          t
          :
          2
          0
          5
          0
          0
          /
          t
          e
          s
          t
          -
          w
          a
          r
          e
          h
          o
          u
          s
          e
          /
          t
          e
          s
          t
          _
          f
          u
          z
          z
          _
          a
          l
          l
          t
          y
          p
          e
          s
          _
          7
          a
          8
          5
          6
          c
          8
          9
          .
          d
          b
          /
          a
          l
          l
          t
          y
          p
          e
          s
          /
          y
          e
          a
          r
          =
          2
          0
          0
          9
          /
          m
          o
          ...
          
          Show
          mjacobs Matthew Jacobs added a comment - Dan Hecht the first run is already gone from jenkins. It might be hard to tell anyway, because it's crashing during the parallel tests so many tests fail at the same time: ... [gw1] PASSED query_test/test_sort.py::TestQueryFullSort::test_multiple_mem_limits_full_output[exec_option: {'disable_codegen': False, 'abort_on_error': 1, 'exec_single_node_rows_threshold': 0, 'batch_size': 0, 'num_nodes': 0} | table_format: parquet/none] query_test/test_sort.py::TestQueryFullSort::test_sort_join[exec_option: {'disable_codegen': False, 'abort_on_error': 1, 'exec_single_node_rows_threshold': 0, 'batch_size': 0, 'num_nodes': 0} | table_format: parquet/none] [gw2] PASSED query_test/test_scanners_fuzz.py::TestScannersFuzzing::test_fuzz_alltypes[exec_option: {'mem_limit': '512m', 'abort_on_error': False, 'num_nodes': 0} | table_format: text/lzo/block] [gw3] FAILED query_test/test_queries.py::TestHdfsQueries::test_top_n[exec_option: {'disable_codegen': True, 'abort_on_error': 1, 'exec_single_node_rows_threshold': 0, 'batch_size': 0, 'num_nodes': 0} | table_format: seq/bzip/block] query_test/test_queries.py::TestHdfsQueries::test_top_n[exec_option: {'disable_codegen': False, 'abort_on_error': 1, 'exec_single_node_rows_threshold': 0, 'batch_size': 0, 'num_nodes': 0} | table_format: seq/bzip/block] [gw0] FAILED query_test/test_tpch_queries.py::TestTpchQuery::test_tpch[exec_option: {'disable_codegen': False, 'abort_on_error': 1, 'exec_single_node_rows_threshold': 0, 'batch_size': 0, 'num_nodes': 0} | table_format: text/gzip/block-TPC-H: Q2] query_test/test_tpch_queries.py::TestTpchQuery::test_tpch[exec_option: {'disable_codegen': False, 'abort_on_error': 1, 'exec_single_node_rows_threshold': 0, 'batch_size': 0, 'num_nodes': 0} | table_format: text/gzip/block-TPC-H: Q3] [gw2] ERROR query_test/test_scanners_fuzz.py::TestScannersFuzzing::test_fuzz_alltypes[exec_option: {'mem_limit': '512m', 'abort_on_error': False, 'num_nodes': 0} | table_format: text/lzo/block] query_test/test_scanners_fuzz.py::TestScannersFuzzing::test_fuzz_alltypes[exec_option: {'mem_limit': '512m', 'abort_on_error': False, 'num_nodes': 0} | table_format: avro/none] [gw0] FAILED query_test/test_tpch_queries.py::TestTpchQuery::test_tpch[exec_option: {'disable_codegen': False, 'abort_on_error': 1, 'exec_single_node_rows_threshold': 0, 'batch_size': 0, 'num_nodes': 0} | table_format: text/gzip/block-TPC-H: Q3] query_test/test_tpch_queries.py::TestTpchQuery::test_tpch[exec_option: {'disable_codegen': False, 'abort_on_error': 1, 'exec_single_node_rows_threshold': 0, 'batch_size': 0, 'num_nodes': 0} | table_format: text/gzip/block-TPC-H: Q4] [gw3] FAILED query_test/test_queries.py::TestHdfsQueries::test_top_n[exec_option: {'disable_codegen': False, 'abort_on_error': 1, 'exec_single_node_rows_threshold': 0, 'batch_size': 0, 'num_nodes': 0} | table_format: seq/bzip/block] query_test/test_queries.py::TestHdfsQueries::test_top_n[exec_option: {'disable_codegen': True, 'abort_on_error': 1, 'exec_single_node_rows_threshold': 100, 'batch_size': 0, 'num_nodes': 0} | table_format: seq/bzip/block] [gw1] FAILED query_test/test_sort.py::TestQueryFullSort::test_sort_join[exec_option: {'disable_codegen': False, 'abort_on_error': 1, 'exec_single_node_rows_threshold': 0, 'batch_size': 0, 'num_nodes': 0} | table_format: parquet/none] query_test/test_sort.py::TestQueryFullSort::test_sort_union[exec_option: {'disable_codegen': False, 'abort_on_error': 1, 'exec_single_node_rows_threshold': 0, 'batch_size': 0, 'num_nodes': 0} | table_format: parquet/none] [gw3] FAILED query_test/test_queries.py::TestHdfsQueries::test_top_n[exec_option: {'disable_codegen': True, 'abort_on_error': 1, 'exec_single_node_rows_threshold': 100, 'batch_size': 0, 'num_nodes': 0} | table_format: seq/bzip/block] query_test/test_queries.py::TestHdfsQueries::test_top_n[exec_option: {'disable_codegen': False, 'abort_on_error': 1, 'exec_single_node_rows_threshold': 100, 'batch_size': 0, 'num_nodes': 0} | table_format: seq/bzip/block] [gw2] ERROR query_test/test_scanners_fuzz.py::TestScannersFuzzing::test_fuzz_alltypes[exec_option: {'mem_limit': '512m', 'abort_on_error': False, 'num_nodes': 0} | table_format: avro/none] [gw1] FAILED query_test/test_sort.py::TestQueryFullSort::test_sort_union[exec_option: {'disable_codegen': False, 'abort_on_error': 1, 'exec_single_node_rows_threshold': 0, 'batch_size': 0, 'num_nodes': 0} | table_format: parquet/none] query_test/test_sort.py::TestQueryFullSort::test_pathological_input[exec_option: {'disable_codegen': False, 'abort_on_error': 1, 'exec_single_node_rows_threshold': 0, 'batch_size': 0, 'num_nodes': 0} | table_format: parquet/none] [gw3] FAILED query_test/test_queries.py::TestHdfsQueries::test_top_n[exec_option: {'disable_codegen': False, 'abort_on_error': 1, 'exec_single_node_rows_threshold': 100, 'batch_size': 0, 'num_nodes': 0} | table_format: seq/bzip/block] -- closing connection to: localhost:21000 One thing that's kind of interesting/funny is that there are ~30k llines of weird single-character-per-row output. E.g. here's a small snippet: select count(*) from (select distinct * from test_fuzz_alltypes_7a856c89.alltypes) q; MainThread: E r r o r c o n v e r t i n g c o l u m n : 1 0 t o T I M E S T A M P E r r o r p a r s i n g r o w : f i l e : h d f s : / / l o c a l h o s t : 2 0 5 0 0 / t e s t - w a r e h o u s e / t e s t _ f u z z _ a l l t y p e s _ 7 a 8 5 6 c 8 9 . d b / a l l t y p e s / y e a r = 2 0 0 9 / m o n t h = 4 / 0 0 0 0 2 1 _ 0 . l z o , b e f o r e o f f s e t : 5 2 1 1 E r r o r c o n v e r t i n g c o l u m n : 1 0 t o T I M E S T A M P E r r o r p a r s i n g r o w : f i l e : h d f s : / / l o c a l h o s t : 2 0 5 0 0 / t e s t - w a r e h o u s e / t e s t _ f u z z _ a l l t y p e s _ 7 a 8 5 6 c 8 9 . d b / a l l t y p e s / y e a r = 2 0 0 9 / m o ...
          Hide
          dhecht Dan Hecht added a comment -

          Do we know if it was running the same test during both failures? Maybe we could figure out how to repro it more reliably.

          Show
          dhecht Dan Hecht added a comment - Do we know if it was running the same test during both failures? Maybe we could figure out how to repro it more reliably.
          Hide
          mjacobs Matthew Jacobs added a comment -

          Saw this again, similar but slightly different stack:

          #0  0x00000037cd4328e5 in raise () from /lib64/libc.so.6
          #1  0x00000037cd4340c5 in abort () from /lib64/libc.so.6
          #2  0x00007f905de19c55 in os::abort(bool) () from /opt/toolchain/sun-jdk-64bit-1.7.0.75/jre/lib/amd64/server/libjvm.so
          #3  0x00007f905df9bcd7 in VMError::report_and_die() () from /opt/toolchain/sun-jdk-64bit-1.7.0.75/jre/lib/amd64/server/libjvm.so
          #4  0x00007f905de1eb6f in JVM_handle_linux_signal () from /opt/toolchain/sun-jdk-64bit-1.7.0.75/jre/lib/amd64/server/libjvm.so
          #5  <signal handler called>
          #6  0x0000000001bf2953 in tcmalloc::CentralFreeList::FetchFromOneSpans(int, void**, void**) ()
          #7  0x0000000001bf2d51 in tcmalloc::CentralFreeList::RemoveRange(void**, void**, int) ()
          #8  0x0000000001c003c3 in tcmalloc::ThreadCache::FetchFromCentralCache(unsigned long, unsigned long) ()
          #9  0x0000000001c0f5a8 in tc_newarray ()
          #10 0x0000000000b0f585 in allocate (this=0x7f9017cce648, __x=...) at /data/jenkins/workspace/impala-umbrella-build-and-test/Impala-Toolchain/gcc-4.9.2/include/c++/4.9.2/ext/new_allocator.h:104
          #11 allocate (this=0x7f9017cce648, __x=...) at /data/jenkins/workspace/impala-umbrella-build-and-test/Impala-Toolchain/gcc-4.9.2/include/c++/4.9.2/bits/alloc_traits.h:357
          #12 _M_allocate (this=0x7f9017cce648, __x=...) at /data/jenkins/workspace/impala-umbrella-build-and-test/Impala-Toolchain/gcc-4.9.2/include/c++/4.9.2/bits/stl_vector.h:170
          #13 _M_allocate_and_copy<__gnu_cxx::__normal_iterator<impala::TSlotDescriptor const*, std::vector<impala::TSlotDescriptor> > > (this=0x7f9017cce648, __x=...) at /data/jenkins/workspace/impala-umbrella-build-and-test/Impala-Toolchain/gcc-4.9.2/include/c++/4.9.2/bits/stl_vector.h:1224
          #14 std::vector<impala::TSlotDescriptor, std::allocator<impala::TSlotDescriptor> >::operator= (this=0x7f9017cce648, __x=...) at /data/jenkins/workspace/impala-umbrella-build-and-test/Impala-Toolchain/gcc-4.9.2/include/c++/4.9.2/bits/vector.tcc:195
          #15 0x0000000000dcc1bd in operator= (this=0xa782d00, fragment=..., rpc_params=0x7f9017cce3c0) at /data/jenkins/workspace/impala-umbrella-build-and-test/repos/Impala/be/generated-sources/gen-cpp/Descriptors_types.h:373
          #16 __set_desc_tbl (this=0xa782d00, fragment=..., rpc_params=0x7f9017cce3c0) at /data/jenkins/workspace/impala-umbrella-build-and-test/repos/Impala/be/generated-sources/gen-cpp/ImpalaInternalService_types.h:953
          #17 impala::Coordinator::SetExecPlanDescriptorTable (this=0xa782d00, fragment=..., rpc_params=0x7f9017cce3c0) at /data/jenkins/workspace/impala-umbrella-build-and-test/repos/Impala/be/src/runtime/coordinator.cc:1852
          #18 0x0000000000dd594b in impala::Coordinator::SetExecPlanFragmentParams (this=0xa782d00, params=..., rpc_params=0x7f9017cce3c0) at /data/jenkins/workspace/impala-umbrella-build-and-test/repos/Impala/be/src/runtime/coordinator.cc:1741
          #19 0x0000000000dd672e in impala::Coordinator::ExecRemoteFInstance (this=0xa782d00, exec_params=..., debug_options=0x0) at /data/jenkins/workspace/impala-umbrella-build-and-test/repos/Impala/be/src/runtime/coordinator.cc:1267
          #20 0x0000000000a17a79 in operator() (thread_id=) at /data/jenkins/workspace/impala-umbrella-build-and-test/Impala-Toolchain/boost-1.57.0-p1/include/boost/function/function_template.hpp:767
          #21 impala::CallableThreadPool::Worker(int, const boost::function<void()> &) (thread_id=) at /data/jenkins/workspace/impala-umbrella-build-and-test/repos/Impala/be/src/util/thread-pool.h:174
          #22 0x0000000000a195f9 in operator() (this=0x6fa5980, thread_id=7) at /data/jenkins/workspace/impala-umbrella-build-and-test/Impala-Toolchain/boost-1.57.0-p1/include/boost/function/function_template.hpp:767
          #23 impala::ThreadPool<boost::function<void()> >::WorkerThread(int) (this=0x6fa5980, thread_id=7) at /data/jenkins/workspace/impala-umbrella-build-and-test/repos/Impala/be/src/util/thread-pool.h:125
          #24 0x0000000000be28e9 in operator() (name=) at /data/jenkins/workspace/impala-umbrella-build-and-test/Impala-Toolchain/boost-1.57.0-p1/include/boost/function/function_template.hpp:767
          #25 impala::Thread::SuperviseThread (name=) at /data/jenkins/workspace/impala-umbrella-build-and-test/repos/Impala/be/src/util/thread.cc:325
          #26 0x0000000000be3334 in operator()<void (*)(const std::basic_string<char>&, const std::basic_string<char>&, boost::function<void()>, impala::Promise<long int>*), boost::_bi::list0> (this=0x8a61400) at /data/jenkins/workspace/impala-umbrella-build-and-test/Impala-Toolchain/boost-1.57.0-p1/include/boost/bind/bind.hpp:457
          #27 operator() (this=0x8a61400) at /data/jenkins/workspace/impala-umbrella-build-and-test/Impala-Toolchain/boost-1.57.0-p1/include/boost/bind/bind_template.hpp:20
          #28 boost::detail::thread_data<boost::_bi::bind_t<void, void (*)(const std::basic_string<char, std::char_traits<char>, std::allocator<char> >&, const std::basic_string<char, std::char_traits<char>, std::allocator<char> >&, boost::function<void()>, impala::Promise<long int>*), boost::_bi::list4<boost::_bi::value<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, boost::_bi::value<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, boost::_bi::value<boost::function<void()> >, boost::_bi::value<impala::Promise<long int>*> > > >::run(void) (this=0x8a61400) at /data/jenkins/workspace/impala-umbrella-build-and-test/Impala-Toolchain/boost-1.57.0-p1/include/boost/thread/detail/thread.hpp:116
          #29 0x0000000000e4ea3a in thread_proxy ()
          #30 0x00000037cd807851 in start_thread () from /lib64/libpthread.so.0
          #31 0x00000037cd4e894d in clone () from /lib64/libc.so.6
          

          This smells like a query lifecycle issue

          #17 impala::Coordinator::SetExecPlanDescriptorTable (this=0xa782d00, fragment=..., rpc_params=0x7f9017cce3c0) at /data/jenkins/workspace/impala-umbrella-build-and-test/repos/Impala/be/src/runtime/coordinator.cc:1852
          #18 0x0000000000dd594b in impala::Coordinator::SetExecPlanFragmentParams (this=0xa782d00, params=..., rpc_params=0x7f9017cce3c0) at /data/jenkins/workspace/impala-umbrella-build-and-test/repos/Impala/be/src/runtime/coordinator.cc:1741
          #19 0x0000000000dd672e in impala::Coordinator::ExecRemoteFInstance (this=0xa782d00, exec_params=..., debug_options=0x0) at /data/jenkins/workspace/impala-umbrella-build-and-test/repos/Impala/be/src/runtime/coordinator.cc:1267
          
          Show
          mjacobs Matthew Jacobs added a comment - Saw this again, similar but slightly different stack: #0 0x00000037cd4328e5 in raise () from /lib64/libc.so.6 #1 0x00000037cd4340c5 in abort () from /lib64/libc.so.6 #2 0x00007f905de19c55 in os::abort(bool) () from /opt/toolchain/sun-jdk-64bit-1.7.0.75/jre/lib/amd64/server/libjvm.so #3 0x00007f905df9bcd7 in VMError::report_and_die() () from /opt/toolchain/sun-jdk-64bit-1.7.0.75/jre/lib/amd64/server/libjvm.so #4 0x00007f905de1eb6f in JVM_handle_linux_signal () from /opt/toolchain/sun-jdk-64bit-1.7.0.75/jre/lib/amd64/server/libjvm.so #5 <signal handler called> #6 0x0000000001bf2953 in tcmalloc::CentralFreeList::FetchFromOneSpans( int , void**, void**) () #7 0x0000000001bf2d51 in tcmalloc::CentralFreeList::RemoveRange(void**, void**, int ) () #8 0x0000000001c003c3 in tcmalloc::ThreadCache::FetchFromCentralCache(unsigned long , unsigned long ) () #9 0x0000000001c0f5a8 in tc_newarray () #10 0x0000000000b0f585 in allocate ( this =0x7f9017cce648, __x=...) at /data/jenkins/workspace/impala-umbrella-build-and-test/Impala-Toolchain/gcc-4.9.2/include/c++/4.9.2/ext/new_allocator.h:104 #11 allocate ( this =0x7f9017cce648, __x=...) at /data/jenkins/workspace/impala-umbrella-build-and-test/Impala-Toolchain/gcc-4.9.2/include/c++/4.9.2/bits/alloc_traits.h:357 #12 _M_allocate ( this =0x7f9017cce648, __x=...) at /data/jenkins/workspace/impala-umbrella-build-and-test/Impala-Toolchain/gcc-4.9.2/include/c++/4.9.2/bits/stl_vector.h:170 #13 _M_allocate_and_copy<__gnu_cxx::__normal_iterator<impala::TSlotDescriptor const *, std::vector<impala::TSlotDescriptor> > > ( this =0x7f9017cce648, __x=...) at /data/jenkins/workspace/impala-umbrella-build-and-test/Impala-Toolchain/gcc-4.9.2/include/c++/4.9.2/bits/stl_vector.h:1224 #14 std::vector<impala::TSlotDescriptor, std::allocator<impala::TSlotDescriptor> >:: operator = ( this =0x7f9017cce648, __x=...) at /data/jenkins/workspace/impala-umbrella-build-and-test/Impala-Toolchain/gcc-4.9.2/include/c++/4.9.2/bits/vector.tcc:195 #15 0x0000000000dcc1bd in operator = ( this =0xa782d00, fragment=..., rpc_params=0x7f9017cce3c0) at /data/jenkins/workspace/impala-umbrella-build-and-test/repos/Impala/be/generated-sources/gen-cpp/Descriptors_types.h:373 #16 __set_desc_tbl ( this =0xa782d00, fragment=..., rpc_params=0x7f9017cce3c0) at /data/jenkins/workspace/impala-umbrella-build-and-test/repos/Impala/be/generated-sources/gen-cpp/ImpalaInternalService_types.h:953 #17 impala::Coordinator::SetExecPlanDescriptorTable ( this =0xa782d00, fragment=..., rpc_params=0x7f9017cce3c0) at /data/jenkins/workspace/impala-umbrella-build-and-test/repos/Impala/be/src/runtime/coordinator.cc:1852 #18 0x0000000000dd594b in impala::Coordinator::SetExecPlanFragmentParams ( this =0xa782d00, params=..., rpc_params=0x7f9017cce3c0) at /data/jenkins/workspace/impala-umbrella-build-and-test/repos/Impala/be/src/runtime/coordinator.cc:1741 #19 0x0000000000dd672e in impala::Coordinator::ExecRemoteFInstance ( this =0xa782d00, exec_params=..., debug_options=0x0) at /data/jenkins/workspace/impala-umbrella-build-and-test/repos/Impala/be/src/runtime/coordinator.cc:1267 #20 0x0000000000a17a79 in operator () (thread_id=) at /data/jenkins/workspace/impala-umbrella-build-and-test/Impala-Toolchain/boost-1.57.0-p1/include/boost/function/function_template.hpp:767 #21 impala::CallableThreadPool::Worker( int , const boost::function<void()> &) (thread_id=) at /data/jenkins/workspace/impala-umbrella-build-and-test/repos/Impala/be/src/util/thread-pool.h:174 #22 0x0000000000a195f9 in operator () ( this =0x6fa5980, thread_id=7) at /data/jenkins/workspace/impala-umbrella-build-and-test/Impala-Toolchain/boost-1.57.0-p1/include/boost/function/function_template.hpp:767 #23 impala::ThreadPool<boost::function<void()> >::WorkerThread( int ) ( this =0x6fa5980, thread_id=7) at /data/jenkins/workspace/impala-umbrella-build-and-test/repos/Impala/be/src/util/thread-pool.h:125 #24 0x0000000000be28e9 in operator () (name=) at /data/jenkins/workspace/impala-umbrella-build-and-test/Impala-Toolchain/boost-1.57.0-p1/include/boost/function/function_template.hpp:767 #25 impala:: Thread ::SuperviseThread (name=) at /data/jenkins/workspace/impala-umbrella-build-and-test/repos/Impala/be/src/util/thread.cc:325 #26 0x0000000000be3334 in operator ()<void (*)( const std::basic_string< char >&, const std::basic_string< char >&, boost::function<void()>, impala::Promise< long int >*), boost::_bi::list0> ( this =0x8a61400) at /data/jenkins/workspace/impala-umbrella-build-and-test/Impala-Toolchain/boost-1.57.0-p1/include/boost/bind/bind.hpp:457 #27 operator () ( this =0x8a61400) at /data/jenkins/workspace/impala-umbrella-build-and-test/Impala-Toolchain/boost-1.57.0-p1/include/boost/bind/bind_template.hpp:20 #28 boost::detail::thread_data<boost::_bi::bind_t<void, void (*)( const std::basic_string< char , std::char_traits< char >, std::allocator< char > >&, const std::basic_string< char , std::char_traits< char >, std::allocator< char > >&, boost::function<void()>, impala::Promise< long int >*), boost::_bi::list4<boost::_bi::value<std::basic_string< char , std::char_traits< char >, std::allocator< char > > >, boost::_bi::value<std::basic_string< char , std::char_traits< char >, std::allocator< char > > >, boost::_bi::value<boost::function<void()> >, boost::_bi::value<impala::Promise< long int >*> > > >::run(void) ( this =0x8a61400) at /data/jenkins/workspace/impala-umbrella-build-and-test/Impala-Toolchain/boost-1.57.0-p1/include/boost/thread/detail/thread.hpp:116 #29 0x0000000000e4ea3a in thread_proxy () #30 0x00000037cd807851 in start_thread () from /lib64/libpthread.so.0 #31 0x00000037cd4e894d in clone () from /lib64/libc.so.6 This smells like a query lifecycle issue #17 impala::Coordinator::SetExecPlanDescriptorTable ( this =0xa782d00, fragment=..., rpc_params=0x7f9017cce3c0) at /data/jenkins/workspace/impala-umbrella-build-and-test/repos/Impala/be/src/runtime/coordinator.cc:1852 #18 0x0000000000dd594b in impala::Coordinator::SetExecPlanFragmentParams ( this =0xa782d00, params=..., rpc_params=0x7f9017cce3c0) at /data/jenkins/workspace/impala-umbrella-build-and-test/repos/Impala/be/src/runtime/coordinator.cc:1741 #19 0x0000000000dd672e in impala::Coordinator::ExecRemoteFInstance ( this =0xa782d00, exec_params=..., debug_options=0x0) at /data/jenkins/workspace/impala-umbrella-build-and-test/repos/Impala/be/src/runtime/coordinator.cc:1267
          Hide
          mmulder Matthew Mulder added a comment -

          I only saw it once in 2598e3b26449a03011ab419a4cc1171cee249427.

          Show
          mmulder Matthew Mulder added a comment - I only saw it once in 2598e3b26449a03011ab419a4cc1171cee249427.
          Hide
          dhecht Dan Hecht added a comment -

          Matthew Mulder Is this reproducing? What githash did it first fail at?

          Show
          dhecht Dan Hecht added a comment - Matthew Mulder Is this reproducing? What githash did it first fail at?

            People

            • Assignee:
              joemcdonnell Joe McDonnell
              Reporter:
              mmulder Matthew Mulder
            • Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development