[IMPALA-4539] Parquet scanner memory bug: I/O buffer is attached to output batch while scratch batch rows still reference it - ASF JIRA

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: Impala 2.6.0, Impala 2.7.0, Impala 2.8.0
Fix Version/s: Impala 2.8.0
Component/s: Backend
Labels:
- crash
- resource-management

Target Version:

Impala 2.8.0

Description

in the HdfsScanner，RowBatch own some io_buffers when Scanner has complete some io read：

  // We need to pass the row batch to the scan node if there is too much memory attached,
  // which can happen if the query is very selective. We need to release memory even
  // if no rows passed predicates.
  if (batch_->AtCapacity() || context_->num_completed_io_buffers() > 0) {
    context_->ReleaseCompletedResources(batch_, /* done */ false);
  }

when the row batch is reset, the io_buffers will be free or return to the mem_pool。

  if (!FLAGS_disable_mem_pools && free_buffers_[idx].size() < FLAGS_max_free_io_buffers) {
    free_buffers_[idx].push_back(buffer);
    if (ImpaladMetrics::IO_MGR_NUM_UNUSED_BUFFERS != NULL) {
      ImpaladMetrics::IO_MGR_NUM_UNUSED_BUFFERS->Increment(1L);
    }
  } else {
    process_mem_tracker_->Release(buffer_size);
    num_allocated_buffers_.Add(-1);
    delete[] buffer;
    if (ImpaladMetrics::IO_MGR_NUM_BUFFERS != NULL) {
      ImpaladMetrics::IO_MGR_NUM_BUFFERS->Increment(-1L);
    }
    if (ImpaladMetrics::IO_MGR_TOTAL_BYTES != NULL) {
      ImpaladMetrics::IO_MGR_TOTAL_BYTES->Increment(-buffer_size);
    }
  }

here is the bug：the io_buffers owned by the row batch A may by used by the row batch B in next ScanNode::GetNext at the same time，but when the row batch B is need to be read，the io_buffers may has been released because row batch A has been reset。for example:

1. in scanner, get row batch A and owned the io_buffer O1。
2. row batch A has been consumed，and the io_buffer O1 is released。
3. in scanner, get row batch B，but some tuple in row batch B is pointed to io_buffer O1，for example，some string tuples。Especially when row batch A is AtCapacity()，the io_buffer is very likely not only used by row batch A。
4. when row batch B need to be consumed，some tuples will produce error data，because io_buffer O1 has been released。

this bug is easy to reproduce when use starts option: "disable_mem_pools=true"，because in this situation, the io_buffers will be free really instead of being returned to the mem_pool。

Attachments

Issue Links

is duplicated by

IMPALA-4715 Aggregate functions on nested type sometimes returns wrong results

Resolved

Activity

Ascending order - Click to sort in descending order

Tim Armstrong added a comment - 27/Nov/16 23:33

What release version/git hash did you see this on? Do you have more information about what query you ran to reproduce the problem and the symptoms of the problem?

Tim Armstrong added a comment - 27/Nov/16 23:33 What release version/git hash did you see this on? Do you have more information about what query you ran to reproduce the problem and the symptoms of the problem?

Fu Lili added a comment - 28/Nov/16 16:28

git status
# On branch cdh5-2.7.0_5.9.0

${IMPALA_HOME}/bin/start-impala-cluster.py --build_type=release -s 3 --impalad_args '--disable_mem_pools=true' --log_level 1
 
impala-shell.sh

[localhost.localdomain:21000] > select distinct p_type from bugtest where p_type='yydbyydb';
Query submitted at: 2016-11-29 00:25:27 (Coordinator: http://10-10-212-60:25000)
Query progress can be monitored at: http://10-10-212-60:25000/query_plan?query_id=4846650025a1f96b:a3c5552100000000
+------------+
| p_type |
+------------+
| yydbyydb   |
|            |
+------------+
Fetched 2 row(s) in 7.64s
[localhost.localdomain:21000] >

Fu Lili added a comment - 28/Nov/16 16:28 git status # On branch cdh5-2.7.0_5.9.0 ${IMPALA_HOME}/bin/start-impala-cluster.py --build_type=release -s 3 --impalad_args '--disable_mem_pools= true ' --log_level 1 impala-shell.sh [localhost.localdomain:21000] > select distinct p_type from bugtest where p_type= 'yydbyydb' ; Query submitted at: 2016-11-29 00:25:27 (Coordinator: http: //10-10-212-60:25000) Query progress can be monitored at: http: //10-10-212-60:25000/query_plan?query_id=4846650025a1f96b:a3c5552100000000 +------------+ | p_type | +------------+ | yydbyydb | | | +------------+ Fetched 2 row(s) in 7.64s [localhost.localdomain:21000] >

Tim Armstrong added a comment - 28/Nov/16 18:17

Did you see a crash or incorrect results? I wasn't able to reproduce any problems when running with disable_mem_pools=true. I tried a few different file formats.

I don't understand based on your description why row batch B would reference memory from an IO buffer that was attached earlier. If your scenario is possible, the bug is that row batch B is referencing io_buffer O1.

We have a wiki page that describes the memory transfer model here: https://cwiki.apache.org/confluence/display/IMPALA/Impala+Row+Batches

Tim Armstrong added a comment - 28/Nov/16 18:17 Did you see a crash or incorrect results? I wasn't able to reproduce any problems when running with disable_mem_pools=true. I tried a few different file formats. I don't understand based on your description why row batch B would reference memory from an IO buffer that was attached earlier. If your scenario is possible, the bug is that row batch B is referencing io_buffer O1. We have a wiki page that describes the memory transfer model here: https://cwiki.apache.org/confluence/display/IMPALA/Impala+Row+Batches

Tim Armstrong added a comment - 02/Dec/16 18:57

I wasn't able to reproduce the bug locally or see a memory management error in the code. It's not clear from the description whether there was a crash or other symptoms.

Feel free to reopen if you have more info.

Tim Armstrong added a comment - 02/Dec/16 18:57 I wasn't able to reproduce the bug locally or see a memory management error in the code. It's not clear from the description whether there was a crash or other symptoms. Feel free to reopen if you have more info.

Fu Lili added a comment - 06/Dec/16 16:52

I can reproduce the bug in my test situation, but the test data is too large(about ~10GB) to provide。Bug I have already done some debug and try to find out the cause of the bug as follows:

1. init row batch A in HdfsParquetScanner::StartNewParquetRowBatch。
2. in HdfsParquetScanner::AssembleRows, scanner read one data page to io_buffer O1，and fill scratch_batch_ 。
3. in TransferScratchTuples，tuple rows transfer from scratch_batch_ to row batch A。In this case, row batch A is full and return, but some tuple rows is still in scratch_batch_。
4. in CommitRows, because row batch A is full，so io_buffer O1 was attached to row batch A in ReleaseCompletedResources。
5. init row batch B in HdfsParquetScanner::StartNewParquetRowBatch。
6. in HdfsParquetScanner::GetNextInternal, the rest tuple rows in scratch_batch_ added to row batch B，but this tuple rows was reference memory io_buffer O1。
7. row batch A is consumed，and the io_buffer O1 is released。
8. row batch B is in error status。

Fu Lili added a comment - 06/Dec/16 16:52 I can reproduce the bug in my test situation, but the test data is too large(about ~10GB) to provide。Bug I have already done some debug and try to find out the cause of the bug as follows: 1. init row batch A in HdfsParquetScanner::StartNewParquetRowBatch。 2. in HdfsParquetScanner::AssembleRows, scanner read one data page to io_buffer O1，and fill scratch_batch_ 。 3. in TransferScratchTuples，tuple rows transfer from scratch_batch_ to row batch A。In this case, row batch A is full and return, but some tuple rows is still in scratch_batch_。 4. in CommitRows, because row batch A is full，so io_buffer O1 was attached to row batch A in ReleaseCompletedResources。 5. init row batch B in HdfsParquetScanner::StartNewParquetRowBatch。 6. in HdfsParquetScanner::GetNextInternal, the rest tuple rows in scratch_batch_ added to row batch B，but this tuple rows was reference memory io_buffer O1。 7. row batch A is consumed，and the io_buffer O1 is released。 8. row batch B is in error status。

Tim Armstrong added a comment - 07/Dec/16 19:31

Thanks for the more detailed explanation. Can you please tell me what symptoms you saw so I can know if I reproduced it - did you see incorrect results or a crash or?

I think this may require non-dictionary-encoded string columns to trigger, which may explain why we haven't seen it.

Tim Armstrong added a comment - 07/Dec/16 19:31 Thanks for the more detailed explanation. Can you please tell me what symptoms you saw so I can know if I reproduced it - did you see incorrect results or a crash or? I think this may require non-dictionary-encoded string columns to trigger, which may explain why we haven't seen it.

Tim Armstrong added a comment - 07/Dec/16 19:41

This looks like a real bug - the root cause is that resources can be an attached to an output batch while still referenced by rows in scratch_batch_.

Tim Armstrong added a comment - 07/Dec/16 19:41 This looks like a real bug - the root cause is that resources can be an attached to an output batch while still referenced by rows in scratch_batch_.

Tim Armstrong added a comment - 07/Dec/16 21:13

I was able to reproduce under ASAN with --disable_mem_pools=true. This only seems to be reproducible on plain-encoded uncompressed strings. I reproduced it quicker by hacking the parquet writer to always write plain-encoded data, but it will automatically switch to plain-encoding when writing distinct UUID values.

start-impala-cluster.py --impalad_args="--disable_mem_pools=true"

[tarmstrong-box.ca.cloudera.com:21000] > set compression_codec=none;
COMPRESSION_CODEC set to none
[tarmstrong-box.ca.cloudera.com:21000] > create table big_uuids stored as parquet as select uuid() from tpch_20_parquet.lineitem;
Query: create table big_uuids stored as parquet as select uuid() from tpch_20_parquet.lineitem
Query submitted at: 2016-12-07 13:08:37 (Coordinator: http://tarmstrong-box:25000)
Query progress can be monitored at: http://tarmstrong-box:25000/query_plan?query_id=b74cc52a10d50c79:a367b9600000000
+---------------------------+
| summary                   |
+---------------------------+
| Inserted 119994608 row(s) |
+---------------------------+
Fetched 1 row(s) in 69.20s
[tarmstrong-box.ca.cloudera.com:21000] > select ndv(_c0) from big_uuids where substring(_c0, 1, 2) != 'a1' limit 10;
Query: select ndv(_c0) from big_uuids where substring(_c0, 1, 2) != 'a1' limit 10
Query submitted at: 2016-12-07 13:10:54 (Coordinator: http://tarmstrong-box:25000)
Query progress can be monitored at: http://tarmstrong-box:25000/query_plan?query_id=204990812e8ba7ab:83b7aa3d00000000


WARNINGS: Cancelled due to unreachable impalad(s): tarmstrong-box:22002

=================================================================
==10656==ERROR: AddressSanitizer: heap-use-after-free on address 0x7f15c0782354 at pc 0x000000f9cbf9 bp 0x7f162dec3240 sp 0x7f162dec29f0
READ of size 2 at 0x7f15c0782354 thread T316
    #0 0xf9cbf8 in __interceptor_strncmp /data/jenkins/workspace/verify-impala-toolchain-package-build/label/ec2-package-ubuntu-16-04/toolchain/source/llvm/llvm-3.8.0.src-p1/projects/compiler-rt/lib/asan/../sanitizer_common/sanitizer_common_interceptors.inc:253
    #1 0x7f1641b70244  (<unknown module>)
    #2 0x17e5a79 in impala::HdfsParquetScanner::TransferScratchTuples(impala::RowBatch*) /tmp/be/src/exec/hdfs-parquet-scanner.cc:642:25
    #3 0x17e3ced in impala::HdfsParquetScanner::GetNextInternal(impala::RowBatch*) /tmp/be/src/exec/hdfs-parquet-scanner.cc:389:29
    #4 0x17e36c4 in impala::HdfsParquetScanner::ProcessSplit() /tmp/be/src/exec/hdfs-parquet-scanner.cc:351:31
    #5 0x1775375 in impala::HdfsScanNode::ProcessSplit(std::vector<impala::FilterContext, std::allocator<impala::FilterContext> > const&, impala::DiskIoMgr::ScanRange*) /tmp/be/src/exec/hdfs-scan-node.cc:539:12
    #6 0x1774541 in impala::HdfsScanNode::ScannerThread() /tmp/be/src/exec/hdfs-scan-node.cc:430:16
    #7 0x177e457 in boost::_bi::bind_t<void, boost::_mfi::mf0<void, impala::HdfsScanNode>, boost::_bi::list1<boost::_bi::value<impala::HdfsScanNode*> > >::operator()() /opt/Impala-Toolchain/boost-1.57.0/include/boost/bind/bind_template.hpp:20:16
    #8 0x12eb4e2 in boost::function0<void>::operator()() const /opt/Impala-Toolchain/boost-1.57.0/include/boost/function/function_template.hpp:766:14
    #9 0x16b21f5 in impala::Thread::SuperviseThread(std::string const&, std::string const&, boost::function<void ()>, impala::Promise<long>*) /tmp/be/src/util/thread.cc:317:3
    #10 0x16bafda in void boost::_bi::list4<boost::_bi::value<std::string>, boost::_bi::value<std::string>, boost::_bi::value<boost::function<void ()> >, boost::_bi::value<impala::Promise<long>*> >::operator()<void (*)(std::string const&, std::string const&, boost::function<void ()>, impala::Promise<long>*), boost::_bi::list0>(boost::_bi::type<void>, void (*&)(std::string const&, std::string const&, boost::function<void ()>, impala::Promise<long>*), boost::_bi::list0&, int) /opt/Impala-Toolchain/boost-1.57.0/include/boost/bind/bind.hpp:457:9
    #11 0x16bae67 in boost::_bi::bind_t<void, void (*)(std::string const&, std::string const&, boost::function<void ()>, impala::Promise<long>*), boost::_bi::list4<boost::_bi::value<std::string>, boost::_bi::value<std::string>, boost::_bi::value<boost::function<void ()> >, boost::_bi::value<impala::Promise<long>*> > >::operator()() /opt/Impala-Toolchain/boost-1.57.0/include/boost/bind/bind_template.hpp:20:16
    #12 0x1cfe579 in thread_proxy (/home/tarmstrong/Impala/incubator-impala/be/build/debug/service/impalad+0x1cfe579)
    #13 0x7f18a3920709 in start_thread (/lib/x86_64-linux-gnu/libpthread.so.0+0x7709)
    #14 0x7f18a344082c in clone /build/glibc-Qz8a69/glibc-2.23/misc/../sysdeps/unix/sysv/linux/x86_64/clone.S:109

0x7f15c0782354 is located 8313684 bytes inside of 8388608-byte region [0x7f15bff94800,0x7f15c0794800)
freed by thread T310 here:
    #0 0x1025350 in operator delete[](void*) /data/jenkins/workspace/verify-impala-toolchain-package-build/label/ec2-package-ubuntu-16-04/toolchain/source/llvm/llvm-3.8.0.src-p1/projects/compiler-rt/lib/asan/asan_new_delete.cc:98
    #1 0x13b3d5f in impala::DiskIoMgr::FreeBufferMemory(impala::DiskIoMgr::BufferDescriptor*) /tmp/be/src/runtime/disk-io-mgr.cc:806:7
    #2 0x13ae80c in impala::DiskIoMgr::ReturnBuffer(impala::DiskIoMgr::BufferDescriptor*) /tmp/be/src/runtime/disk-io-mgr.cc:664:7
    #3 0x13ae641 in impala::DiskIoMgr::BufferDescriptor::Return() /tmp/be/src/runtime/disk-io-mgr.cc:259:3
    #4 0x13eb4d6 in impala::RowBatch::Reset() /tmp/be/src/runtime/row-batch.cc:314:5
    #5 0x18852dc in impala::PartitionedAggregationNode::Open(impala::RuntimeState*) /tmp/be/src/exec/partitioned-aggregation-node.cc:354:5
    #6 0x1c86f02 in impala::PlanFragmentExecutor::OpenInternal() /tmp/be/src/runtime/plan-fragment-executor.cc:329:31
    #7 0x1c86872 in impala::PlanFragmentExecutor::Open() /tmp/be/src/runtime/plan-fragment-executor.cc:301:14
    #8 0x158c3a5 in impala::FragmentMgr::FragmentExecState::Exec() /tmp/be/src/service/fragment-exec-state.cc:53:9
    #9 0x157fd75 in impala::FragmentMgr::FragmentThread(impala::TUniqueId) /tmp/be/src/service/fragment-mgr.cc:86:3
    #10 0x15853fd in boost::_mfi::mf1<void, impala::FragmentMgr, impala::TUniqueId>::operator()(impala::FragmentMgr*, impala::TUniqueId) const /opt/Impala-Toolchain/boost-1.57.0/include/boost/bind/mem_fn_template.hpp:165:16
    #11 0x1585257 in void boost::_bi::list2<boost::_bi::value<impala::FragmentMgr*>, boost::_bi::value<impala::TUniqueId> >::operator()<boost::_mfi::mf1<void, impala::FragmentMgr, impala::TUniqueId>, boost::_bi::list0>(boost::_bi::type<void>, boost::_mfi::mf1<void, impala::FragmentMgr, impala::TUniqueId>&, boost::_bi::list0&, int) /opt/Impala-Toolchain/boost-1.57.0/include/boost/bind/bind.hpp:313:9
    #12 0x1585107 in boost::_bi::bind_t<void, boost::_mfi::mf1<void, impala::FragmentMgr, impala::TUniqueId>, boost::_bi::list2<boost::_bi::value<impala::FragmentMgr*>, boost::_bi::value<impala::TUniqueId> > >::operator()() /opt/Impala-Toolchain/boost-1.57.0/include/boost/bind/bind_template.hpp:20:16
    #13 0x12eb4e2 in boost::function0<void>::operator()() const /opt/Impala-Toolchain/boost-1.57.0/include/boost/function/function_template.hpp:766:14
    #14 0x16b21f5 in impala::Thread::SuperviseThread(std::string const&, std::string const&, boost::function<void ()>, impala::Promise<long>*) /tmp/be/src/util/thread.cc:317:3
    #15 0x16bafda in void boost::_bi::list4<boost::_bi::value<std::string>, boost::_bi::value<std::string>, boost::_bi::value<boost::function<void ()> >, boost::_bi::value<impala::Promise<long>*> >::operator()<void (*)(std::string const&, std::string const&, boost::function<void ()>, impala::Promise<long>*), boost::_bi::list0>(boost::_bi::type<void>, void (*&)(std::string const&, std::string const&, boost::function<void ()>, impala::Promise<long>*), boost::_bi::list0&, int) /opt/Impala-Toolchain/boost-1.57.0/include/boost/bind/bind.hpp:457:9
    #16 0x16bae67 in boost::_bi::bind_t<void, void (*)(std::string const&, std::string const&, boost::function<void ()>, impala::Promise<long>*), boost::_bi::list4<boost::_bi::value<std::string>, boost::_bi::value<std::string>, boost::_bi::value<boost::function<void ()> >, boost::_bi::value<impala::Promise<long>*> > >::operator()() /opt/Impala-Toolchain/boost-1.57.0/include/boost/bind/bind_template.hpp:20:16
    #17 0x1cfe579 in thread_proxy (/home/tarmstrong/Impala/incubator-impala/be/build/debug/service/impalad+0x1cfe579)

previously allocated by thread T226 here:
    #0 0x1024d50 in operator new[](unsigned long) /data/jenkins/workspace/verify-impala-toolchain-package-build/label/ec2-package-ubuntu-16-04/toolchain/source/llvm/llvm-3.8.0.src-p1/projects/compiler-rt/lib/asan/asan_new_delete.cc:64
tarmstrong@tarmstrong-box:~/Impala/incubator-impala$ gvim be/src/exec/hdfs-parquet-scanner.cc

Tim Armstrong added a comment - 07/Dec/16 21:13 I was able to reproduce under ASAN with --disable_mem_pools=true. This only seems to be reproducible on plain-encoded uncompressed strings. I reproduced it quicker by hacking the parquet writer to always write plain-encoded data, but it will automatically switch to plain-encoding when writing distinct UUID values. start-impala-cluster.py --impalad_args= "--disable_mem_pools= true " [tarmstrong-box.ca.cloudera.com:21000] > set compression_codec=none; COMPRESSION_CODEC set to none [tarmstrong-box.ca.cloudera.com:21000] > create table big_uuids stored as parquet as select uuid() from tpch_20_parquet.lineitem; Query: create table big_uuids stored as parquet as select uuid() from tpch_20_parquet.lineitem Query submitted at: 2016-12-07 13:08:37 (Coordinator: http: //tarmstrong-box:25000) Query progress can be monitored at: http: //tarmstrong-box:25000/query_plan?query_id=b74cc52a10d50c79:a367b9600000000 +---------------------------+ | summary | +---------------------------+ | Inserted 119994608 row(s) | +---------------------------+ Fetched 1 row(s) in 69.20s [tarmstrong-box.ca.cloudera.com:21000] > select ndv(_c0) from big_uuids where substring(_c0, 1, 2) != 'a1' limit 10; Query: select ndv(_c0) from big_uuids where substring(_c0, 1, 2) != 'a1' limit 10 Query submitted at: 2016-12-07 13:10:54 (Coordinator: http: //tarmstrong-box:25000) Query progress can be monitored at: http: //tarmstrong-box:25000/query_plan?query_id=204990812e8ba7ab:83b7aa3d00000000 WARNINGS: Cancelled due to unreachable impalad(s): tarmstrong-box:22002 ================================================================= ==10656==ERROR: AddressSanitizer: heap-use-after-free on address 0x7f15c0782354 at pc 0x000000f9cbf9 bp 0x7f162dec3240 sp 0x7f162dec29f0 READ of size 2 at 0x7f15c0782354 thread T316 #0 0xf9cbf8 in __interceptor_strncmp /data/jenkins/workspace/verify-impala-toolchain- package -build/label/ec2- package -ubuntu-16-04/toolchain/source/llvm/llvm-3.8.0.src-p1/projects/compiler-rt/lib/asan/../sanitizer_common/sanitizer_common_interceptors.inc:253 #1 0x7f1641b70244 (<unknown module>) #2 0x17e5a79 in impala::HdfsParquetScanner::TransferScratchTuples(impala::RowBatch*) /tmp/be/src/exec/hdfs-parquet-scanner.cc:642:25 #3 0x17e3ced in impala::HdfsParquetScanner::GetNextInternal(impala::RowBatch*) /tmp/be/src/exec/hdfs-parquet-scanner.cc:389:29 #4 0x17e36c4 in impala::HdfsParquetScanner::ProcessSplit() /tmp/be/src/exec/hdfs-parquet-scanner.cc:351:31 #5 0x1775375 in impala::HdfsScanNode::ProcessSplit(std::vector<impala::FilterContext, std::allocator<impala::FilterContext> > const &, impala::DiskIoMgr::ScanRange*) /tmp/be/src/exec/hdfs-scan-node.cc:539:12 #6 0x1774541 in impala::HdfsScanNode::ScannerThread() /tmp/be/src/exec/hdfs-scan-node.cc:430:16 #7 0x177e457 in boost::_bi::bind_t<void, boost::_mfi::mf0<void, impala::HdfsScanNode>, boost::_bi::list1<boost::_bi::value<impala::HdfsScanNode*> > >:: operator ()() /opt/Impala-Toolchain/boost-1.57.0/include/boost/bind/bind_template.hpp:20:16 #8 0x12eb4e2 in boost::function0<void>:: operator ()() const /opt/Impala-Toolchain/boost-1.57.0/include/boost/function/function_template.hpp:766:14 #9 0x16b21f5 in impala:: Thread ::SuperviseThread(std::string const &, std::string const &, boost::function<void ()>, impala::Promise< long >*) /tmp/be/src/util/thread.cc:317:3 #10 0x16bafda in void boost::_bi::list4<boost::_bi::value<std::string>, boost::_bi::value<std::string>, boost::_bi::value<boost::function<void ()> >, boost::_bi::value<impala::Promise< long >*> >:: operator ()<void (*)(std::string const &, std::string const &, boost::function<void ()>, impala::Promise< long >*), boost::_bi::list0>(boost::_bi::type<void>, void (*&)(std::string const &, std::string const &, boost::function<void ()>, impala::Promise< long >*), boost::_bi::list0&, int ) /opt/Impala-Toolchain/boost-1.57.0/include/boost/bind/bind.hpp:457:9 #11 0x16bae67 in boost::_bi::bind_t<void, void (*)(std::string const &, std::string const &, boost::function<void ()>, impala::Promise< long >*), boost::_bi::list4<boost::_bi::value<std::string>, boost::_bi::value<std::string>, boost::_bi::value<boost::function<void ()> >, boost::_bi::value<impala::Promise< long >*> > >:: operator ()() /opt/Impala-Toolchain/boost-1.57.0/include/boost/bind/bind_template.hpp:20:16 #12 0x1cfe579 in thread_proxy (/home/tarmstrong/Impala/incubator-impala/be/build/debug/service/impalad+0x1cfe579) #13 0x7f18a3920709 in start_thread (/lib/x86_64-linux-gnu/libpthread.so.0+0x7709) #14 0x7f18a344082c in clone /build/glibc-Qz8a69/glibc-2.23/misc/../sysdeps/unix/sysv/linux/x86_64/clone.S:109 0x7f15c0782354 is located 8313684 bytes inside of 8388608- byte region [0x7f15bff94800,0x7f15c0794800) freed by thread T310 here: #0 0x1025350 in operator delete[](void*) /data/jenkins/workspace/verify-impala-toolchain- package -build/label/ec2- package -ubuntu-16-04/toolchain/source/llvm/llvm-3.8.0.src-p1/projects/compiler-rt/lib/asan/asan_new_delete.cc:98 #1 0x13b3d5f in impala::DiskIoMgr::FreeBufferMemory(impala::DiskIoMgr::BufferDescriptor*) /tmp/be/src/runtime/disk-io-mgr.cc:806:7 #2 0x13ae80c in impala::DiskIoMgr::ReturnBuffer(impala::DiskIoMgr::BufferDescriptor*) /tmp/be/src/runtime/disk-io-mgr.cc:664:7 #3 0x13ae641 in impala::DiskIoMgr::BufferDescriptor::Return() /tmp/be/src/runtime/disk-io-mgr.cc:259:3 #4 0x13eb4d6 in impala::RowBatch::Reset() /tmp/be/src/runtime/row-batch.cc:314:5 #5 0x18852dc in impala::PartitionedAggregationNode::Open(impala::RuntimeState*) /tmp/be/src/exec/partitioned-aggregation-node.cc:354:5 #6 0x1c86f02 in impala::PlanFragmentExecutor::OpenInternal() /tmp/be/src/runtime/plan-fragment-executor.cc:329:31 #7 0x1c86872 in impala::PlanFragmentExecutor::Open() /tmp/be/src/runtime/plan-fragment-executor.cc:301:14 #8 0x158c3a5 in impala::FragmentMgr::FragmentExecState::Exec() /tmp/be/src/service/fragment-exec-state.cc:53:9 #9 0x157fd75 in impala::FragmentMgr::FragmentThread(impala::TUniqueId) /tmp/be/src/service/fragment-mgr.cc:86:3 #10 0x15853fd in boost::_mfi::mf1<void, impala::FragmentMgr, impala::TUniqueId>:: operator ()(impala::FragmentMgr*, impala::TUniqueId) const /opt/Impala-Toolchain/boost-1.57.0/include/boost/bind/mem_fn_template.hpp:165:16 #11 0x1585257 in void boost::_bi::list2<boost::_bi::value<impala::FragmentMgr*>, boost::_bi::value<impala::TUniqueId> >:: operator ()<boost::_mfi::mf1<void, impala::FragmentMgr, impala::TUniqueId>, boost::_bi::list0>(boost::_bi::type<void>, boost::_mfi::mf1<void, impala::FragmentMgr, impala::TUniqueId>&, boost::_bi::list0&, int ) /opt/Impala-Toolchain/boost-1.57.0/include/boost/bind/bind.hpp:313:9 #12 0x1585107 in boost::_bi::bind_t<void, boost::_mfi::mf1<void, impala::FragmentMgr, impala::TUniqueId>, boost::_bi::list2<boost::_bi::value<impala::FragmentMgr*>, boost::_bi::value<impala::TUniqueId> > >:: operator ()() /opt/Impala-Toolchain/boost-1.57.0/include/boost/bind/bind_template.hpp:20:16 #13 0x12eb4e2 in boost::function0<void>:: operator ()() const /opt/Impala-Toolchain/boost-1.57.0/include/boost/function/function_template.hpp:766:14 #14 0x16b21f5 in impala:: Thread ::SuperviseThread(std::string const &, std::string const &, boost::function<void ()>, impala::Promise< long >*) /tmp/be/src/util/thread.cc:317:3 #15 0x16bafda in void boost::_bi::list4<boost::_bi::value<std::string>, boost::_bi::value<std::string>, boost::_bi::value<boost::function<void ()> >, boost::_bi::value<impala::Promise< long >*> >:: operator ()<void (*)(std::string const &, std::string const &, boost::function<void ()>, impala::Promise< long >*), boost::_bi::list0>(boost::_bi::type<void>, void (*&)(std::string const &, std::string const &, boost::function<void ()>, impala::Promise< long >*), boost::_bi::list0&, int ) /opt/Impala-Toolchain/boost-1.57.0/include/boost/bind/bind.hpp:457:9 #16 0x16bae67 in boost::_bi::bind_t<void, void (*)(std::string const &, std::string const &, boost::function<void ()>, impala::Promise< long >*), boost::_bi::list4<boost::_bi::value<std::string>, boost::_bi::value<std::string>, boost::_bi::value<boost::function<void ()> >, boost::_bi::value<impala::Promise< long >*> > >:: operator ()() /opt/Impala-Toolchain/boost-1.57.0/include/boost/bind/bind_template.hpp:20:16 #17 0x1cfe579 in thread_proxy (/home/tarmstrong/Impala/incubator-impala/be/build/debug/service/impalad+0x1cfe579) previously allocated by thread T226 here: #0 0x1024d50 in operator new [](unsigned long ) /data/jenkins/workspace/verify-impala-toolchain- package -build/label/ec2- package -ubuntu-16-04/toolchain/source/llvm/llvm-3.8.0.src-p1/projects/compiler-rt/lib/asan/asan_new_delete.cc:64 tarmstrong@tarmstrong-box:~/Impala/incubator-impala$ gvim be/src/exec/hdfs-parquet-scanner.cc

Tim Armstrong added a comment - 07/Dec/16 22:23

I also hit this DCHECK while experimenting with some interesting row batch/I/O buffers size combinations.

F1207 14:18:50.471175 20525 hdfs-parquet-scanner.cc:257] Check failed: scratch_batch_->mem_pool()->total_allocated_bytes() == 0 (17408 vs. 0) 
*** Check failure stack trace: ***
    @          0x2ae7c36  google::DumpStackTraceAndExit()
    @          0x2ae10ad  google::LogMessage::Fail()
    @          0x2ae39d6  google::LogMessage::SendToLog()
    @          0x2ae0bcd  google::LogMessage::Flush()
    @          0x2ae447e  google::LogMessageFatal::~LogMessageFatal()
    @          0x17e30b8  impala::HdfsParquetScanner::Close()
    @          0x17a92be  impala::HdfsScanNodeMt::Close()
    @          0x173ceb7  impala::ExecNode::Close()
    @          0x188c449  impala::PartitionedAggregationNode::Close()
    @          0x1c89453  impala::PlanFragmentExecutor::Close()
    @          0x158c400  impala::FragmentMgr::FragmentExecState::Exec()
    @          0x157fd76  impala::FragmentMgr::FragmentThread()
    @          0x15853fe  boost::_mfi::mf1<>::operator()()
    @          0x1585258  boost::_bi::list2<>::operator()<>()
    @          0x1585108  boost::_bi::bind_t<>::operator()()
    @          0x12eb4e3  boost::function0<>::operator()()
    @          0x16b21f6  impala::Thread::SuperviseThread()
    @          0x16bafdb  boost::_bi::list4<>::operator()<>()
    @          0x16bae68  boost::_bi::bind_t<>::operator()()
    @          0x1cfe5ba  thread_proxy
    @     0x7fa856a2070a  start_thread
    @     0x7fa85654082d  (unknown)
Picked up JAVA_TOOL_OPTIONS: -agentlib:jdwp=transport=dt_socket,address=30000,server=y,suspend=n

Tim Armstrong added a comment - 07/Dec/16 22:23 I also hit this DCHECK while experimenting with some interesting row batch/I/O buffers size combinations. F1207 14:18:50.471175 20525 hdfs-parquet-scanner.cc:257] Check failed: scratch_batch_->mem_pool()->total_allocated_bytes() == 0 (17408 vs. 0) *** Check failure stack trace: *** @ 0x2ae7c36 google::DumpStackTraceAndExit() @ 0x2ae10ad google::LogMessage::Fail() @ 0x2ae39d6 google::LogMessage::SendToLog() @ 0x2ae0bcd google::LogMessage::Flush() @ 0x2ae447e google::LogMessageFatal::~LogMessageFatal() @ 0x17e30b8 impala::HdfsParquetScanner::Close() @ 0x17a92be impala::HdfsScanNodeMt::Close() @ 0x173ceb7 impala::ExecNode::Close() @ 0x188c449 impala::PartitionedAggregationNode::Close() @ 0x1c89453 impala::PlanFragmentExecutor::Close() @ 0x158c400 impala::FragmentMgr::FragmentExecState::Exec() @ 0x157fd76 impala::FragmentMgr::FragmentThread() @ 0x15853fe boost::_mfi::mf1<>:: operator ()() @ 0x1585258 boost::_bi::list2<>:: operator ()<>() @ 0x1585108 boost::_bi::bind_t<>:: operator ()() @ 0x12eb4e3 boost::function0<>:: operator ()() @ 0x16b21f6 impala:: Thread ::SuperviseThread() @ 0x16bafdb boost::_bi::list4<>:: operator ()<>() @ 0x16bae68 boost::_bi::bind_t<>:: operator ()() @ 0x1cfe5ba thread_proxy @ 0x7fa856a2070a start_thread @ 0x7fa85654082d (unknown) Picked up JAVA_TOOL_OPTIONS: -agentlib:jdwp=transport=dt_socket,address=30000,server=y,suspend=n

Tim Armstrong added a comment - 08/Dec/16 05:10

~~IMPALA-4539~~: fix bug when scratch batch references I/O buffers

The bug occurs when scanning uncompressed plain-encoded Parquet
string data.

Testing:
I could manually reproduce it reliably with ASAN and
--disable_mem_pools=true using the following steps:

The repro is more predictable when there is no row batch queue.
set mt_dop=1;
A unique string column should switch to plain encoding
set compression_codec=none;
create table big_uuids stored as parquet as select uuid() from tpch_20_parquet.lineitem;
The repro requires that some rows are filtered out, so that we end
up in a state where the output batch is full before all rows are
copied from the scratch batch
select ndv(_c0) from big_uuids where substring(_c0, 1, 2) != 'a1' limit 10;

After the fix it no longer reproduces.

I do not yet have a practical test case that triggers the bug on a
normal ASAN setup. I will continue to try to create one.

Change-Id: Ic27e7251e0f633cb694b506f6eb62beed6e66ad9
Reviewed-on: http://gerrit.cloudera.org:8080/5406
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
—
M be/src/exec/hdfs-parquet-scanner.cc
1 file changed, 10 insertions, 8 deletions

Approvals:
Impala Public Jenkins: Verified
Tim Armstrong: Looks good to me, approved

Tim Armstrong added a comment - 08/Dec/16 05:10 IMPALA-4539 : fix bug when scratch batch references I/O buffers The bug occurs when scanning uncompressed plain-encoded Parquet string data. Testing: I could manually reproduce it reliably with ASAN and --disable_mem_pools=true using the following steps: The repro is more predictable when there is no row batch queue. set mt_dop=1; A unique string column should switch to plain encoding set compression_codec=none; create table big_uuids stored as parquet as select uuid() from tpch_20_parquet.lineitem; The repro requires that some rows are filtered out, so that we end up in a state where the output batch is full before all rows are copied from the scratch batch select ndv(_c0) from big_uuids where substring(_c0, 1, 2) != 'a1' limit 10; After the fix it no longer reproduces. I do not yet have a practical test case that triggers the bug on a normal ASAN setup. I will continue to try to create one. Change-Id: Ic27e7251e0f633cb694b506f6eb62beed6e66ad9 Reviewed-on: http://gerrit.cloudera.org:8080/5406 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins — M be/src/exec/hdfs-parquet-scanner.cc 1 file changed, 10 insertions , 8 deletions Approvals: Impala Public Jenkins: Verified Tim Armstrong: Looks good to me, approved

Tim Armstrong added a comment - 08/Dec/16 05:10

cfreely thank you for the bug report and your persistence.

Tim Armstrong added a comment - 08/Dec/16 05:10 cfreely thank you for the bug report and your persistence.

Alexander Behm added a comment - 08/Mar/17 05:48

In addition to Tim's instructions above, I've found that lowering the batch_size really helps with reproducing the issue, even with scan nodes that have a row batch queue. For example, a batch_size of 100 reproduced the issue reliably for me.

Alexander Behm added a comment - 08/Mar/17 05:48 In addition to Tim's instructions above, I've found that lowering the batch_size really helps with reproducing the issue, even with scan nodes that have a row batch queue. For example, a batch_size of 100 reproduced the issue reliably for me.

People

Assignee:: Tim Armstrong

Reporter:: Fu Lili

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 24/Nov/16 13:04

Updated:: 20/Jun/18 06:33

Resolved:: 08/Dec/16 05:10