Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
Impala 4.0.0, Impala 4.1.0, Impala 4.2.0, Impala 4.1.1
-
None
-
ghx-label-8
Description
We saw a crash in a query that aggregates the string partition column of an Avro table with MT_DOP setting to 4. The query is quite simple:
create external table date_str_avro (v int) partitioned by (date_str string) stored as avro; -- Import files attached in this JIRA, repeat the following query. -- It will crash in 10 runs. set MT_DOP=2; select count(*), date_str from date_str_avro group by date_str;
It needs specifit data set to reproduce the crash. Files and steps given later.
Disable codegen (by "set disable_codegen=1") and reproduce the crash. The stacktrace is
Crash reason: SIGSEGV /SEGV_MAPERR Crash address: 0x0 Process uptime: not available Thread 512 (crashed) 0 impalad!impala::HashTableCtx::Hash(void const*, int, unsigned int) const [sse-util.h : 227 + 0x2] 1 impalad!impala::HashTableCtx::HashVariableLenRow(unsigned char const*, unsigned char const*) const [hash-table.cc : 306 + 0x8] 2 impalad!impala::HashTableCtx::HashRow(unsigned char const*, unsigned char const*) const [hash-table.cc : 255 + 0x5] 3 impalad!void impala::GroupingAggregator::EvalAndHashPrefetchGroup<false>(impala::RowBatch*, int, impala::TPrefetchMode::type, impala::HashTableCtx*) [hash-table.inline.h : 39 + 0xe] 4 impalad!impala::GroupingAggregator::AddBatchStreamingImpl(int, bool, impala::TPrefetchMode::type, impala::RowBatch*, impala::RowBatch*, impala::HashTableCtx*, int*) [grouping-aggregator-ir.cc : 185 + 0x1c] 5 impalad!impala::GroupingAggregator::AddBatchStreaming(impala::RuntimeState*, impala::RowBatch*, impala::RowBatch*, bool*) [grouping-aggregator.cc : 520 + 0x2d] 6 impalad!impala::StreamingAggregationNode::GetRowsStreaming(impala::RuntimeState*, impala::RowBatch*) [streaming-aggregation-node.cc : 120 + 0x3] 7 impalad!impala::StreamingAggregationNode::GetNext(impala::RuntimeState*, impala::RowBatch*, bool*) [streaming-aggregation-node.cc : 77 + 0x19] 8 impalad!impala::FragmentInstanceState::ExecInternal() [fragment-instance-state.cc : 446 + 0x3] 9 impalad!impala::FragmentInstanceState::Exec() [fragment-instance-state.cc : 104 + 0xb] 10 impalad!impala::QueryState::ExecFInstance(impala::FragmentInstanceState*) [query-state.cc : 950 + 0x19] 11 impalad!impala::Thread::SuperviseThread(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, boost::function<void ()>, impala::ThreadDebugInfo const*, impala::Promise<long, (impala::PromiseMode)0>*) [function_template.hpp : 763 + 0x3] 12 impalad!boost::detail::thread_data<boost::_bi::bind_t<void, void (*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, boost::function<void ()>, impala::ThreadDebugInfo const*, impala::Promise<long, (impala::PromiseMode)0>*), boost::_bi::list5<boost::_bi::value<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, boost::_bi::value<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, boost::_bi::value<boost::function<void ()> >, boost::_bi::value<impala::ThreadDebugInfo*>, boost::_bi::value<impala::Promise<long, (impala::PromiseMode)0>*> > > >::run() [bind.hpp : 531 + 0x3] 13 impalad!thread_proxy + 0x67 14 libpthread.so.0 + 0x76ba 15 libc.so.6 + 0x1074dd
This is reproduced on commit 2733d039a of the master branch.
Reproducing the bug requires the following conditions:
- Partitioned Avro table
- MT_DOP is set to be larger than 1
- Query needs follow-up processing (e.g. GROUP BY, JOIN, etc.) on the partition values or default values of missing fields in the files.
- num of files(blocks) > num of impalads. So multiple scan fragment instances run on one impalad.
- Some scan node instances finish earlier than others, e.g. when there are both small files and large files.
Steps to import the attached Avro data files
$ tar zxf date_str_avro.tar.gz $ hdfs dfs -put date_str_avro/* hdfs_location_of_table_dir impala-shell> alter table date_str_avro recover partitions;
RCA
This is a bug introduces by IMPALA-9655.
Each avro file requires at least two scan ranges. The initial range reads the file header and initializes the template tuple. The initial scanner then issues follow-up scan ranges to read the file content. Mem of the template tuple is transferred to the ScanNode. Note that partition values are materialized into the template tuple.
After IMPALA-9655, the ranges of a file could be scheduled to different ScanNode instances when MT_DOP > 1. In the following sequence, there is an illegal mem access of "heap-use-after-free", which could cause a crash.
t0:
Scanner of ScanNode-1 reads header of a large avro file.
Scanner of ScanNode-2 reads header of a small avro file.
Varlen memory of the template_tuple transfers to the corresponding ScanNode.
t1:
Scanner of ScanNode-1 reads content of the small avro file.
Scanner of ScanNode-2 reads content of the large avro file.
Scanner will reuse the template_tuple created by the header scanners [1]. So RowBatch produced by ScanNode-2 actually reference mem owned by ScanNode-1.
t2:
ScanNode-1 finishes first and closes (assuming no more files to read).
Downstream consumer of ScanNode-2 will crash if accessing the partition string values.
Attachments
Attachments
Issue Links
- is caused by
-
IMPALA-9655 Dynamic intra-node load balancing for HDFS scans
- Resolved
- links to