Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-4539

Parquet scanner memory bug: I/O buffer is attached to output batch while scratch batch rows still reference it

Details

    Description

      in the HdfsScanner,RowBatch own some io_buffers when Scanner has complete some io read:

        // We need to pass the row batch to the scan node if there is too much memory attached,
        // which can happen if the query is very selective. We need to release memory even
        // if no rows passed predicates.
        if (batch_->AtCapacity() || context_->num_completed_io_buffers() > 0) {
          context_->ReleaseCompletedResources(batch_, /* done */ false);
        }
      

      when the row batch is reset, the io_buffers will be free or return to the mem_pool。

        if (!FLAGS_disable_mem_pools && free_buffers_[idx].size() < FLAGS_max_free_io_buffers) {
          free_buffers_[idx].push_back(buffer);
          if (ImpaladMetrics::IO_MGR_NUM_UNUSED_BUFFERS != NULL) {
            ImpaladMetrics::IO_MGR_NUM_UNUSED_BUFFERS->Increment(1L);
          }
        } else {
          process_mem_tracker_->Release(buffer_size);
          num_allocated_buffers_.Add(-1);
          delete[] buffer;
          if (ImpaladMetrics::IO_MGR_NUM_BUFFERS != NULL) {
            ImpaladMetrics::IO_MGR_NUM_BUFFERS->Increment(-1L);
          }
          if (ImpaladMetrics::IO_MGR_TOTAL_BYTES != NULL) {
            ImpaladMetrics::IO_MGR_TOTAL_BYTES->Increment(-buffer_size);
          }
        }
      

      here is the bug:the io_buffers owned by the row batch A may by used by the row batch B in next ScanNode::GetNext at the same time,but when the row batch B is need to be read,the io_buffers may has been released because row batch A has been reset。for example:

      1. in scanner, get row batch A and owned the io_buffer O1。
      2. row batch A has been consumed,and the io_buffer O1 is released。
      3. in scanner, get row batch B,but some tuple in row batch B is pointed to io_buffer O1,for example,some string tuples。Especially when row batch A is AtCapacity(),the io_buffer is very likely not only used by row batch A。
      4. when row batch B need to be consumed,some tuples will produce error data,because io_buffer O1 has been released。
      

      this bug is easy to reproduce when use starts option: "disable_mem_pools=true",because in this situation, the io_buffers will be free really instead of being returned to the mem_pool。

      Attachments

        Issue Links

          Activity

            tarmstrong Tim Armstrong added a comment -

            What release version/git hash did you see this on? Do you have more information about what query you ran to reproduce the problem and the symptoms of the problem?

            tarmstrong Tim Armstrong added a comment - What release version/git hash did you see this on? Do you have more information about what query you ran to reproduce the problem and the symptoms of the problem?
            cfreely Fu Lili added a comment -
            git status
            # On branch cdh5-2.7.0_5.9.0
            
            ${IMPALA_HOME}/bin/start-impala-cluster.py --build_type=release -s 3 --impalad_args '--disable_mem_pools=true' --log_level 1
             
            impala-shell.sh
            
            [localhost.localdomain:21000] > select distinct p_type from bugtest where p_type='yydbyydb';
            Query submitted at: 2016-11-29 00:25:27 (Coordinator: http://10-10-212-60:25000)
            Query progress can be monitored at: http://10-10-212-60:25000/query_plan?query_id=4846650025a1f96b:a3c5552100000000
            +------------+
            | p_type |
            +------------+
            | yydbyydb   |
            |            |
            +------------+
            Fetched 2 row(s) in 7.64s
            [localhost.localdomain:21000] >
            
            
            cfreely Fu Lili added a comment - git status # On branch cdh5-2.7.0_5.9.0 ${IMPALA_HOME}/bin/start-impala-cluster.py --build_type=release -s 3 --impalad_args '--disable_mem_pools= true ' --log_level 1 impala-shell.sh [localhost.localdomain:21000] > select distinct p_type from bugtest where p_type= 'yydbyydb' ; Query submitted at: 2016-11-29 00:25:27 (Coordinator: http: //10-10-212-60:25000) Query progress can be monitored at: http: //10-10-212-60:25000/query_plan?query_id=4846650025a1f96b:a3c5552100000000 +------------+ | p_type | +------------+ | yydbyydb | | | +------------+ Fetched 2 row(s) in 7.64s [localhost.localdomain:21000] >
            tarmstrong Tim Armstrong added a comment -

            Did you see a crash or incorrect results? I wasn't able to reproduce any problems when running with disable_mem_pools=true. I tried a few different file formats.

            I don't understand based on your description why row batch B would reference memory from an IO buffer that was attached earlier. If your scenario is possible, the bug is that row batch B is referencing io_buffer O1.

            We have a wiki page that describes the memory transfer model here: https://cwiki.apache.org/confluence/display/IMPALA/Impala+Row+Batches

            tarmstrong Tim Armstrong added a comment - Did you see a crash or incorrect results? I wasn't able to reproduce any problems when running with disable_mem_pools=true. I tried a few different file formats. I don't understand based on your description why row batch B would reference memory from an IO buffer that was attached earlier. If your scenario is possible, the bug is that row batch B is referencing io_buffer O1. We have a wiki page that describes the memory transfer model here: https://cwiki.apache.org/confluence/display/IMPALA/Impala+Row+Batches
            tarmstrong Tim Armstrong added a comment -

            I wasn't able to reproduce the bug locally or see a memory management error in the code. It's not clear from the description whether there was a crash or other symptoms.

            Feel free to reopen if you have more info.

            tarmstrong Tim Armstrong added a comment - I wasn't able to reproduce the bug locally or see a memory management error in the code. It's not clear from the description whether there was a crash or other symptoms. Feel free to reopen if you have more info.
            cfreely Fu Lili added a comment -

            I can reproduce the bug in my test situation, but the test data is too large(about ~10GB) to provide。Bug I have already done some debug and try to find out the cause of the bug as follows:

            1. init row batch A in HdfsParquetScanner::StartNewParquetRowBatch。
            2. in HdfsParquetScanner::AssembleRows, scanner read one data page to io_buffer O1,and fill scratch_batch_ 。
            3. in TransferScratchTuples,tuple rows transfer from scratch_batch_ to row batch A。In this case, row batch A is full and return, but some tuple rows is still in scratch_batch_。
            4. in CommitRows, because row batch A is full,so io_buffer O1 was attached to row batch A in ReleaseCompletedResources。
            5. init row batch B in HdfsParquetScanner::StartNewParquetRowBatch。
            6. in HdfsParquetScanner::GetNextInternal, the rest tuple rows in scratch_batch_ added to row batch B,but this tuple rows was reference memory io_buffer O1。
            7. row batch A is consumed,and the io_buffer O1 is released。
            8. row batch B is in error status。

            cfreely Fu Lili added a comment - I can reproduce the bug in my test situation, but the test data is too large(about ~10GB) to provide。Bug I have already done some debug and try to find out the cause of the bug as follows: 1. init row batch A in HdfsParquetScanner::StartNewParquetRowBatch。 2. in HdfsParquetScanner::AssembleRows, scanner read one data page to io_buffer O1,and fill scratch_batch_ 。 3. in TransferScratchTuples,tuple rows transfer from scratch_batch_ to row batch A。In this case, row batch A is full and return, but some tuple rows is still in scratch_batch_。 4. in CommitRows, because row batch A is full,so io_buffer O1 was attached to row batch A in ReleaseCompletedResources。 5. init row batch B in HdfsParquetScanner::StartNewParquetRowBatch。 6. in HdfsParquetScanner::GetNextInternal, the rest tuple rows in scratch_batch_ added to row batch B,but this tuple rows was reference memory io_buffer O1。 7. row batch A is consumed,and the io_buffer O1 is released。 8. row batch B is in error status。
            tarmstrong Tim Armstrong added a comment -

            Thanks for the more detailed explanation. Can you please tell me what symptoms you saw so I can know if I reproduced it - did you see incorrect results or a crash or?

            I think this may require non-dictionary-encoded string columns to trigger, which may explain why we haven't seen it.

            tarmstrong Tim Armstrong added a comment - Thanks for the more detailed explanation. Can you please tell me what symptoms you saw so I can know if I reproduced it - did you see incorrect results or a crash or? I think this may require non-dictionary-encoded string columns to trigger, which may explain why we haven't seen it.
            tarmstrong Tim Armstrong added a comment -

            This looks like a real bug - the root cause is that resources can be an attached to an output batch while still referenced by rows in scratch_batch_.

            tarmstrong Tim Armstrong added a comment - This looks like a real bug - the root cause is that resources can be an attached to an output batch while still referenced by rows in scratch_batch_.
            tarmstrong Tim Armstrong added a comment -

            I was able to reproduce under ASAN with --disable_mem_pools=true. This only seems to be reproducible on plain-encoded uncompressed strings. I reproduced it quicker by hacking the parquet writer to always write plain-encoded data, but it will automatically switch to plain-encoding when writing distinct UUID values.

            start-impala-cluster.py --impalad_args="--disable_mem_pools=true"
            
            [tarmstrong-box.ca.cloudera.com:21000] > set compression_codec=none;
            COMPRESSION_CODEC set to none
            [tarmstrong-box.ca.cloudera.com:21000] > create table big_uuids stored as parquet as select uuid() from tpch_20_parquet.lineitem;
            Query: create table big_uuids stored as parquet as select uuid() from tpch_20_parquet.lineitem
            Query submitted at: 2016-12-07 13:08:37 (Coordinator: http://tarmstrong-box:25000)
            Query progress can be monitored at: http://tarmstrong-box:25000/query_plan?query_id=b74cc52a10d50c79:a367b9600000000
            +---------------------------+
            | summary                   |
            +---------------------------+
            | Inserted 119994608 row(s) |
            +---------------------------+
            Fetched 1 row(s) in 69.20s
            [tarmstrong-box.ca.cloudera.com:21000] > select ndv(_c0) from big_uuids where substring(_c0, 1, 2) != 'a1' limit 10;
            Query: select ndv(_c0) from big_uuids where substring(_c0, 1, 2) != 'a1' limit 10
            Query submitted at: 2016-12-07 13:10:54 (Coordinator: http://tarmstrong-box:25000)
            Query progress can be monitored at: http://tarmstrong-box:25000/query_plan?query_id=204990812e8ba7ab:83b7aa3d00000000
            
            
            WARNINGS: Cancelled due to unreachable impalad(s): tarmstrong-box:22002
            
            =================================================================
            ==10656==ERROR: AddressSanitizer: heap-use-after-free on address 0x7f15c0782354 at pc 0x000000f9cbf9 bp 0x7f162dec3240 sp 0x7f162dec29f0
            READ of size 2 at 0x7f15c0782354 thread T316
                #0 0xf9cbf8 in __interceptor_strncmp /data/jenkins/workspace/verify-impala-toolchain-package-build/label/ec2-package-ubuntu-16-04/toolchain/source/llvm/llvm-3.8.0.src-p1/projects/compiler-rt/lib/asan/../sanitizer_common/sanitizer_common_interceptors.inc:253
                #1 0x7f1641b70244  (<unknown module>)
                #2 0x17e5a79 in impala::HdfsParquetScanner::TransferScratchTuples(impala::RowBatch*) /tmp/be/src/exec/hdfs-parquet-scanner.cc:642:25
                #3 0x17e3ced in impala::HdfsParquetScanner::GetNextInternal(impala::RowBatch*) /tmp/be/src/exec/hdfs-parquet-scanner.cc:389:29
                #4 0x17e36c4 in impala::HdfsParquetScanner::ProcessSplit() /tmp/be/src/exec/hdfs-parquet-scanner.cc:351:31
                #5 0x1775375 in impala::HdfsScanNode::ProcessSplit(std::vector<impala::FilterContext, std::allocator<impala::FilterContext> > const&, impala::DiskIoMgr::ScanRange*) /tmp/be/src/exec/hdfs-scan-node.cc:539:12
                #6 0x1774541 in impala::HdfsScanNode::ScannerThread() /tmp/be/src/exec/hdfs-scan-node.cc:430:16
                #7 0x177e457 in boost::_bi::bind_t<void, boost::_mfi::mf0<void, impala::HdfsScanNode>, boost::_bi::list1<boost::_bi::value<impala::HdfsScanNode*> > >::operator()() /opt/Impala-Toolchain/boost-1.57.0/include/boost/bind/bind_template.hpp:20:16
                #8 0x12eb4e2 in boost::function0<void>::operator()() const /opt/Impala-Toolchain/boost-1.57.0/include/boost/function/function_template.hpp:766:14
                #9 0x16b21f5 in impala::Thread::SuperviseThread(std::string const&, std::string const&, boost::function<void ()>, impala::Promise<long>*) /tmp/be/src/util/thread.cc:317:3
                #10 0x16bafda in void boost::_bi::list4<boost::_bi::value<std::string>, boost::_bi::value<std::string>, boost::_bi::value<boost::function<void ()> >, boost::_bi::value<impala::Promise<long>*> >::operator()<void (*)(std::string const&, std::string const&, boost::function<void ()>, impala::Promise<long>*), boost::_bi::list0>(boost::_bi::type<void>, void (*&)(std::string const&, std::string const&, boost::function<void ()>, impala::Promise<long>*), boost::_bi::list0&, int) /opt/Impala-Toolchain/boost-1.57.0/include/boost/bind/bind.hpp:457:9
                #11 0x16bae67 in boost::_bi::bind_t<void, void (*)(std::string const&, std::string const&, boost::function<void ()>, impala::Promise<long>*), boost::_bi::list4<boost::_bi::value<std::string>, boost::_bi::value<std::string>, boost::_bi::value<boost::function<void ()> >, boost::_bi::value<impala::Promise<long>*> > >::operator()() /opt/Impala-Toolchain/boost-1.57.0/include/boost/bind/bind_template.hpp:20:16
                #12 0x1cfe579 in thread_proxy (/home/tarmstrong/Impala/incubator-impala/be/build/debug/service/impalad+0x1cfe579)
                #13 0x7f18a3920709 in start_thread (/lib/x86_64-linux-gnu/libpthread.so.0+0x7709)
                #14 0x7f18a344082c in clone /build/glibc-Qz8a69/glibc-2.23/misc/../sysdeps/unix/sysv/linux/x86_64/clone.S:109
            
            0x7f15c0782354 is located 8313684 bytes inside of 8388608-byte region [0x7f15bff94800,0x7f15c0794800)
            freed by thread T310 here:
                #0 0x1025350 in operator delete[](void*) /data/jenkins/workspace/verify-impala-toolchain-package-build/label/ec2-package-ubuntu-16-04/toolchain/source/llvm/llvm-3.8.0.src-p1/projects/compiler-rt/lib/asan/asan_new_delete.cc:98
                #1 0x13b3d5f in impala::DiskIoMgr::FreeBufferMemory(impala::DiskIoMgr::BufferDescriptor*) /tmp/be/src/runtime/disk-io-mgr.cc:806:7
                #2 0x13ae80c in impala::DiskIoMgr::ReturnBuffer(impala::DiskIoMgr::BufferDescriptor*) /tmp/be/src/runtime/disk-io-mgr.cc:664:7
                #3 0x13ae641 in impala::DiskIoMgr::BufferDescriptor::Return() /tmp/be/src/runtime/disk-io-mgr.cc:259:3
                #4 0x13eb4d6 in impala::RowBatch::Reset() /tmp/be/src/runtime/row-batch.cc:314:5
                #5 0x18852dc in impala::PartitionedAggregationNode::Open(impala::RuntimeState*) /tmp/be/src/exec/partitioned-aggregation-node.cc:354:5
                #6 0x1c86f02 in impala::PlanFragmentExecutor::OpenInternal() /tmp/be/src/runtime/plan-fragment-executor.cc:329:31
                #7 0x1c86872 in impala::PlanFragmentExecutor::Open() /tmp/be/src/runtime/plan-fragment-executor.cc:301:14
                #8 0x158c3a5 in impala::FragmentMgr::FragmentExecState::Exec() /tmp/be/src/service/fragment-exec-state.cc:53:9
                #9 0x157fd75 in impala::FragmentMgr::FragmentThread(impala::TUniqueId) /tmp/be/src/service/fragment-mgr.cc:86:3
                #10 0x15853fd in boost::_mfi::mf1<void, impala::FragmentMgr, impala::TUniqueId>::operator()(impala::FragmentMgr*, impala::TUniqueId) const /opt/Impala-Toolchain/boost-1.57.0/include/boost/bind/mem_fn_template.hpp:165:16
                #11 0x1585257 in void boost::_bi::list2<boost::_bi::value<impala::FragmentMgr*>, boost::_bi::value<impala::TUniqueId> >::operator()<boost::_mfi::mf1<void, impala::FragmentMgr, impala::TUniqueId>, boost::_bi::list0>(boost::_bi::type<void>, boost::_mfi::mf1<void, impala::FragmentMgr, impala::TUniqueId>&, boost::_bi::list0&, int) /opt/Impala-Toolchain/boost-1.57.0/include/boost/bind/bind.hpp:313:9
                #12 0x1585107 in boost::_bi::bind_t<void, boost::_mfi::mf1<void, impala::FragmentMgr, impala::TUniqueId>, boost::_bi::list2<boost::_bi::value<impala::FragmentMgr*>, boost::_bi::value<impala::TUniqueId> > >::operator()() /opt/Impala-Toolchain/boost-1.57.0/include/boost/bind/bind_template.hpp:20:16
                #13 0x12eb4e2 in boost::function0<void>::operator()() const /opt/Impala-Toolchain/boost-1.57.0/include/boost/function/function_template.hpp:766:14
                #14 0x16b21f5 in impala::Thread::SuperviseThread(std::string const&, std::string const&, boost::function<void ()>, impala::Promise<long>*) /tmp/be/src/util/thread.cc:317:3
                #15 0x16bafda in void boost::_bi::list4<boost::_bi::value<std::string>, boost::_bi::value<std::string>, boost::_bi::value<boost::function<void ()> >, boost::_bi::value<impala::Promise<long>*> >::operator()<void (*)(std::string const&, std::string const&, boost::function<void ()>, impala::Promise<long>*), boost::_bi::list0>(boost::_bi::type<void>, void (*&)(std::string const&, std::string const&, boost::function<void ()>, impala::Promise<long>*), boost::_bi::list0&, int) /opt/Impala-Toolchain/boost-1.57.0/include/boost/bind/bind.hpp:457:9
                #16 0x16bae67 in boost::_bi::bind_t<void, void (*)(std::string const&, std::string const&, boost::function<void ()>, impala::Promise<long>*), boost::_bi::list4<boost::_bi::value<std::string>, boost::_bi::value<std::string>, boost::_bi::value<boost::function<void ()> >, boost::_bi::value<impala::Promise<long>*> > >::operator()() /opt/Impala-Toolchain/boost-1.57.0/include/boost/bind/bind_template.hpp:20:16
                #17 0x1cfe579 in thread_proxy (/home/tarmstrong/Impala/incubator-impala/be/build/debug/service/impalad+0x1cfe579)
            
            previously allocated by thread T226 here:
                #0 0x1024d50 in operator new[](unsigned long) /data/jenkins/workspace/verify-impala-toolchain-package-build/label/ec2-package-ubuntu-16-04/toolchain/source/llvm/llvm-3.8.0.src-p1/projects/compiler-rt/lib/asan/asan_new_delete.cc:64
            tarmstrong@tarmstrong-box:~/Impala/incubator-impala$ gvim be/src/exec/hdfs-parquet-scanner.cc 
            
            tarmstrong Tim Armstrong added a comment - I was able to reproduce under ASAN with --disable_mem_pools=true. This only seems to be reproducible on plain-encoded uncompressed strings. I reproduced it quicker by hacking the parquet writer to always write plain-encoded data, but it will automatically switch to plain-encoding when writing distinct UUID values. start-impala-cluster.py --impalad_args= "--disable_mem_pools= true " [tarmstrong-box.ca.cloudera.com:21000] > set compression_codec=none; COMPRESSION_CODEC set to none [tarmstrong-box.ca.cloudera.com:21000] > create table big_uuids stored as parquet as select uuid() from tpch_20_parquet.lineitem; Query: create table big_uuids stored as parquet as select uuid() from tpch_20_parquet.lineitem Query submitted at: 2016-12-07 13:08:37 (Coordinator: http: //tarmstrong-box:25000) Query progress can be monitored at: http: //tarmstrong-box:25000/query_plan?query_id=b74cc52a10d50c79:a367b9600000000 +---------------------------+ | summary | +---------------------------+ | Inserted 119994608 row(s) | +---------------------------+ Fetched 1 row(s) in 69.20s [tarmstrong-box.ca.cloudera.com:21000] > select ndv(_c0) from big_uuids where substring(_c0, 1, 2) != 'a1' limit 10; Query: select ndv(_c0) from big_uuids where substring(_c0, 1, 2) != 'a1' limit 10 Query submitted at: 2016-12-07 13:10:54 (Coordinator: http: //tarmstrong-box:25000) Query progress can be monitored at: http: //tarmstrong-box:25000/query_plan?query_id=204990812e8ba7ab:83b7aa3d00000000 WARNINGS: Cancelled due to unreachable impalad(s): tarmstrong-box:22002 ================================================================= ==10656==ERROR: AddressSanitizer: heap-use-after-free on address 0x7f15c0782354 at pc 0x000000f9cbf9 bp 0x7f162dec3240 sp 0x7f162dec29f0 READ of size 2 at 0x7f15c0782354 thread T316 #0 0xf9cbf8 in __interceptor_strncmp /data/jenkins/workspace/verify-impala-toolchain- package -build/label/ec2- package -ubuntu-16-04/toolchain/source/llvm/llvm-3.8.0.src-p1/projects/compiler-rt/lib/asan/../sanitizer_common/sanitizer_common_interceptors.inc:253 #1 0x7f1641b70244 (<unknown module>) #2 0x17e5a79 in impala::HdfsParquetScanner::TransferScratchTuples(impala::RowBatch*) /tmp/be/src/exec/hdfs-parquet-scanner.cc:642:25 #3 0x17e3ced in impala::HdfsParquetScanner::GetNextInternal(impala::RowBatch*) /tmp/be/src/exec/hdfs-parquet-scanner.cc:389:29 #4 0x17e36c4 in impala::HdfsParquetScanner::ProcessSplit() /tmp/be/src/exec/hdfs-parquet-scanner.cc:351:31 #5 0x1775375 in impala::HdfsScanNode::ProcessSplit(std::vector<impala::FilterContext, std::allocator<impala::FilterContext> > const &, impala::DiskIoMgr::ScanRange*) /tmp/be/src/exec/hdfs-scan-node.cc:539:12 #6 0x1774541 in impala::HdfsScanNode::ScannerThread() /tmp/be/src/exec/hdfs-scan-node.cc:430:16 #7 0x177e457 in boost::_bi::bind_t<void, boost::_mfi::mf0<void, impala::HdfsScanNode>, boost::_bi::list1<boost::_bi::value<impala::HdfsScanNode*> > >:: operator ()() /opt/Impala-Toolchain/boost-1.57.0/include/boost/bind/bind_template.hpp:20:16 #8 0x12eb4e2 in boost::function0<void>:: operator ()() const /opt/Impala-Toolchain/boost-1.57.0/include/boost/function/function_template.hpp:766:14 #9 0x16b21f5 in impala:: Thread ::SuperviseThread(std::string const &, std::string const &, boost::function<void ()>, impala::Promise< long >*) /tmp/be/src/util/thread.cc:317:3 #10 0x16bafda in void boost::_bi::list4<boost::_bi::value<std::string>, boost::_bi::value<std::string>, boost::_bi::value<boost::function<void ()> >, boost::_bi::value<impala::Promise< long >*> >:: operator ()<void (*)(std::string const &, std::string const &, boost::function<void ()>, impala::Promise< long >*), boost::_bi::list0>(boost::_bi::type<void>, void (*&)(std::string const &, std::string const &, boost::function<void ()>, impala::Promise< long >*), boost::_bi::list0&, int ) /opt/Impala-Toolchain/boost-1.57.0/include/boost/bind/bind.hpp:457:9 #11 0x16bae67 in boost::_bi::bind_t<void, void (*)(std::string const &, std::string const &, boost::function<void ()>, impala::Promise< long >*), boost::_bi::list4<boost::_bi::value<std::string>, boost::_bi::value<std::string>, boost::_bi::value<boost::function<void ()> >, boost::_bi::value<impala::Promise< long >*> > >:: operator ()() /opt/Impala-Toolchain/boost-1.57.0/include/boost/bind/bind_template.hpp:20:16 #12 0x1cfe579 in thread_proxy (/home/tarmstrong/Impala/incubator-impala/be/build/debug/service/impalad+0x1cfe579) #13 0x7f18a3920709 in start_thread (/lib/x86_64-linux-gnu/libpthread.so.0+0x7709) #14 0x7f18a344082c in clone /build/glibc-Qz8a69/glibc-2.23/misc/../sysdeps/unix/sysv/linux/x86_64/clone.S:109 0x7f15c0782354 is located 8313684 bytes inside of 8388608- byte region [0x7f15bff94800,0x7f15c0794800) freed by thread T310 here: #0 0x1025350 in operator delete[](void*) /data/jenkins/workspace/verify-impala-toolchain- package -build/label/ec2- package -ubuntu-16-04/toolchain/source/llvm/llvm-3.8.0.src-p1/projects/compiler-rt/lib/asan/asan_new_delete.cc:98 #1 0x13b3d5f in impala::DiskIoMgr::FreeBufferMemory(impala::DiskIoMgr::BufferDescriptor*) /tmp/be/src/runtime/disk-io-mgr.cc:806:7 #2 0x13ae80c in impala::DiskIoMgr::ReturnBuffer(impala::DiskIoMgr::BufferDescriptor*) /tmp/be/src/runtime/disk-io-mgr.cc:664:7 #3 0x13ae641 in impala::DiskIoMgr::BufferDescriptor::Return() /tmp/be/src/runtime/disk-io-mgr.cc:259:3 #4 0x13eb4d6 in impala::RowBatch::Reset() /tmp/be/src/runtime/row-batch.cc:314:5 #5 0x18852dc in impala::PartitionedAggregationNode::Open(impala::RuntimeState*) /tmp/be/src/exec/partitioned-aggregation-node.cc:354:5 #6 0x1c86f02 in impala::PlanFragmentExecutor::OpenInternal() /tmp/be/src/runtime/plan-fragment-executor.cc:329:31 #7 0x1c86872 in impala::PlanFragmentExecutor::Open() /tmp/be/src/runtime/plan-fragment-executor.cc:301:14 #8 0x158c3a5 in impala::FragmentMgr::FragmentExecState::Exec() /tmp/be/src/service/fragment-exec-state.cc:53:9 #9 0x157fd75 in impala::FragmentMgr::FragmentThread(impala::TUniqueId) /tmp/be/src/service/fragment-mgr.cc:86:3 #10 0x15853fd in boost::_mfi::mf1<void, impala::FragmentMgr, impala::TUniqueId>:: operator ()(impala::FragmentMgr*, impala::TUniqueId) const /opt/Impala-Toolchain/boost-1.57.0/include/boost/bind/mem_fn_template.hpp:165:16 #11 0x1585257 in void boost::_bi::list2<boost::_bi::value<impala::FragmentMgr*>, boost::_bi::value<impala::TUniqueId> >:: operator ()<boost::_mfi::mf1<void, impala::FragmentMgr, impala::TUniqueId>, boost::_bi::list0>(boost::_bi::type<void>, boost::_mfi::mf1<void, impala::FragmentMgr, impala::TUniqueId>&, boost::_bi::list0&, int ) /opt/Impala-Toolchain/boost-1.57.0/include/boost/bind/bind.hpp:313:9 #12 0x1585107 in boost::_bi::bind_t<void, boost::_mfi::mf1<void, impala::FragmentMgr, impala::TUniqueId>, boost::_bi::list2<boost::_bi::value<impala::FragmentMgr*>, boost::_bi::value<impala::TUniqueId> > >:: operator ()() /opt/Impala-Toolchain/boost-1.57.0/include/boost/bind/bind_template.hpp:20:16 #13 0x12eb4e2 in boost::function0<void>:: operator ()() const /opt/Impala-Toolchain/boost-1.57.0/include/boost/function/function_template.hpp:766:14 #14 0x16b21f5 in impala:: Thread ::SuperviseThread(std::string const &, std::string const &, boost::function<void ()>, impala::Promise< long >*) /tmp/be/src/util/thread.cc:317:3 #15 0x16bafda in void boost::_bi::list4<boost::_bi::value<std::string>, boost::_bi::value<std::string>, boost::_bi::value<boost::function<void ()> >, boost::_bi::value<impala::Promise< long >*> >:: operator ()<void (*)(std::string const &, std::string const &, boost::function<void ()>, impala::Promise< long >*), boost::_bi::list0>(boost::_bi::type<void>, void (*&)(std::string const &, std::string const &, boost::function<void ()>, impala::Promise< long >*), boost::_bi::list0&, int ) /opt/Impala-Toolchain/boost-1.57.0/include/boost/bind/bind.hpp:457:9 #16 0x16bae67 in boost::_bi::bind_t<void, void (*)(std::string const &, std::string const &, boost::function<void ()>, impala::Promise< long >*), boost::_bi::list4<boost::_bi::value<std::string>, boost::_bi::value<std::string>, boost::_bi::value<boost::function<void ()> >, boost::_bi::value<impala::Promise< long >*> > >:: operator ()() /opt/Impala-Toolchain/boost-1.57.0/include/boost/bind/bind_template.hpp:20:16 #17 0x1cfe579 in thread_proxy (/home/tarmstrong/Impala/incubator-impala/be/build/debug/service/impalad+0x1cfe579) previously allocated by thread T226 here: #0 0x1024d50 in operator new [](unsigned long ) /data/jenkins/workspace/verify-impala-toolchain- package -build/label/ec2- package -ubuntu-16-04/toolchain/source/llvm/llvm-3.8.0.src-p1/projects/compiler-rt/lib/asan/asan_new_delete.cc:64 tarmstrong@tarmstrong-box:~/Impala/incubator-impala$ gvim be/src/exec/hdfs-parquet-scanner.cc
            tarmstrong Tim Armstrong added a comment -

            I also hit this DCHECK while experimenting with some interesting row batch/I/O buffers size combinations.

            F1207 14:18:50.471175 20525 hdfs-parquet-scanner.cc:257] Check failed: scratch_batch_->mem_pool()->total_allocated_bytes() == 0 (17408 vs. 0) 
            *** Check failure stack trace: ***
                @          0x2ae7c36  google::DumpStackTraceAndExit()
                @          0x2ae10ad  google::LogMessage::Fail()
                @          0x2ae39d6  google::LogMessage::SendToLog()
                @          0x2ae0bcd  google::LogMessage::Flush()
                @          0x2ae447e  google::LogMessageFatal::~LogMessageFatal()
                @          0x17e30b8  impala::HdfsParquetScanner::Close()
                @          0x17a92be  impala::HdfsScanNodeMt::Close()
                @          0x173ceb7  impala::ExecNode::Close()
                @          0x188c449  impala::PartitionedAggregationNode::Close()
                @          0x1c89453  impala::PlanFragmentExecutor::Close()
                @          0x158c400  impala::FragmentMgr::FragmentExecState::Exec()
                @          0x157fd76  impala::FragmentMgr::FragmentThread()
                @          0x15853fe  boost::_mfi::mf1<>::operator()()
                @          0x1585258  boost::_bi::list2<>::operator()<>()
                @          0x1585108  boost::_bi::bind_t<>::operator()()
                @          0x12eb4e3  boost::function0<>::operator()()
                @          0x16b21f6  impala::Thread::SuperviseThread()
                @          0x16bafdb  boost::_bi::list4<>::operator()<>()
                @          0x16bae68  boost::_bi::bind_t<>::operator()()
                @          0x1cfe5ba  thread_proxy
                @     0x7fa856a2070a  start_thread
                @     0x7fa85654082d  (unknown)
            Picked up JAVA_TOOL_OPTIONS: -agentlib:jdwp=transport=dt_socket,address=30000,server=y,suspend=n 
            
            tarmstrong Tim Armstrong added a comment - I also hit this DCHECK while experimenting with some interesting row batch/I/O buffers size combinations. F1207 14:18:50.471175 20525 hdfs-parquet-scanner.cc:257] Check failed: scratch_batch_->mem_pool()->total_allocated_bytes() == 0 (17408 vs. 0) *** Check failure stack trace: *** @ 0x2ae7c36 google::DumpStackTraceAndExit() @ 0x2ae10ad google::LogMessage::Fail() @ 0x2ae39d6 google::LogMessage::SendToLog() @ 0x2ae0bcd google::LogMessage::Flush() @ 0x2ae447e google::LogMessageFatal::~LogMessageFatal() @ 0x17e30b8 impala::HdfsParquetScanner::Close() @ 0x17a92be impala::HdfsScanNodeMt::Close() @ 0x173ceb7 impala::ExecNode::Close() @ 0x188c449 impala::PartitionedAggregationNode::Close() @ 0x1c89453 impala::PlanFragmentExecutor::Close() @ 0x158c400 impala::FragmentMgr::FragmentExecState::Exec() @ 0x157fd76 impala::FragmentMgr::FragmentThread() @ 0x15853fe boost::_mfi::mf1<>:: operator ()() @ 0x1585258 boost::_bi::list2<>:: operator ()<>() @ 0x1585108 boost::_bi::bind_t<>:: operator ()() @ 0x12eb4e3 boost::function0<>:: operator ()() @ 0x16b21f6 impala:: Thread ::SuperviseThread() @ 0x16bafdb boost::_bi::list4<>:: operator ()<>() @ 0x16bae68 boost::_bi::bind_t<>:: operator ()() @ 0x1cfe5ba thread_proxy @ 0x7fa856a2070a start_thread @ 0x7fa85654082d (unknown) Picked up JAVA_TOOL_OPTIONS: -agentlib:jdwp=transport=dt_socket,address=30000,server=y,suspend=n
            tarmstrong Tim Armstrong added a comment -

            IMPALA-4539: fix bug when scratch batch references I/O buffers

            The bug occurs when scanning uncompressed plain-encoded Parquet
            string data.

            Testing:
            I could manually reproduce it reliably with ASAN and
            --disable_mem_pools=true using the following steps:

            1. The repro is more predictable when there is no row batch queue.
              set mt_dop=1;
            2. A unique string column should switch to plain encoding
              set compression_codec=none;
              create table big_uuids stored as parquet as select uuid() from tpch_20_parquet.lineitem;
            3. The repro requires that some rows are filtered out, so that we end
            4. up in a state where the output batch is full before all rows are
            5. copied from the scratch batch
              select ndv(_c0) from big_uuids where substring(_c0, 1, 2) != 'a1' limit 10;

            After the fix it no longer reproduces.

            I do not yet have a practical test case that triggers the bug on a
            normal ASAN setup. I will continue to try to create one.

            Change-Id: Ic27e7251e0f633cb694b506f6eb62beed6e66ad9
            Reviewed-on: http://gerrit.cloudera.org:8080/5406
            Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
            Tested-by: Impala Public Jenkins

            M be/src/exec/hdfs-parquet-scanner.cc
            1 file changed, 10 insertions, 8 deletions

            Approvals:
            Impala Public Jenkins: Verified
            Tim Armstrong: Looks good to me, approved

            tarmstrong Tim Armstrong added a comment - IMPALA-4539 : fix bug when scratch batch references I/O buffers The bug occurs when scanning uncompressed plain-encoded Parquet string data. Testing: I could manually reproduce it reliably with ASAN and --disable_mem_pools=true using the following steps: The repro is more predictable when there is no row batch queue. set mt_dop=1; A unique string column should switch to plain encoding set compression_codec=none; create table big_uuids stored as parquet as select uuid() from tpch_20_parquet.lineitem; The repro requires that some rows are filtered out, so that we end up in a state where the output batch is full before all rows are copied from the scratch batch select ndv(_c0) from big_uuids where substring(_c0, 1, 2) != 'a1' limit 10; After the fix it no longer reproduces. I do not yet have a practical test case that triggers the bug on a normal ASAN setup. I will continue to try to create one. Change-Id: Ic27e7251e0f633cb694b506f6eb62beed6e66ad9 Reviewed-on: http://gerrit.cloudera.org:8080/5406 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins — M be/src/exec/hdfs-parquet-scanner.cc 1 file changed, 10 insertions , 8 deletions Approvals: Impala Public Jenkins: Verified Tim Armstrong: Looks good to me, approved
            tarmstrong Tim Armstrong added a comment -

            cfreely thank you for the bug report and your persistence.

            tarmstrong Tim Armstrong added a comment - cfreely thank you for the bug report and your persistence.
            alex.behm Alexander Behm added a comment -

            In addition to Tim's instructions above, I've found that lowering the batch_size really helps with reproducing the issue, even with scan nodes that have a row batch queue. For example, a batch_size of 100 reproduced the issue reliably for me.

            alex.behm Alexander Behm added a comment - In addition to Tim's instructions above, I've found that lowering the batch_size really helps with reproducing the issue, even with scan nodes that have a row batch queue. For example, a batch_size of 100 reproduced the issue reliably for me.

            People

              tarmstrong Tim Armstrong
              cfreely Fu Lili
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: