Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
Impala 2.3.0, Impala 2.5.0, Impala 2.4.0, Impala 2.6.0, Impala 2.7.0, Impala 2.8.0, Impala 2.9.0
-
ghx-label-1
Description
Canceling query while remote fragments are in PhjBuilder::Partition::InsertBatch kills the query but some of the remote fragments continue to spin indefinitely and doesn't release memory back.
stack for stuck thread
Thread 1 (process 26668): #0 0x00007f8f0a138500 in impala::PhjBuilder::Partition::InsertBatch () #1 0x0000000000d37f72 in impala::PhjBuilder::Partition::BuildHashTable(bool*) () #2 0x0000000000d38d65 in impala::PhjBuilder::BuildHashTablesAndPrepareProbeStreams() () #3 0x0000000000d392a0 in impala::PhjBuilder::FlushFinal(impala::RuntimeState*) () #4 0x0000000000d20447 in impala::Status impala::BlockingJoinNode::SendBuildInputToSink<true>(impala::RuntimeState*, impala::DataSink*) () #5 0x0000000000d1ed5f in impala::BlockingJoinNode::ProcessBuildInputAsync(impala::RuntimeState*, impala::DataSink*, impala::Status*) () #6 0x0000000000bec7d9 in impala::Thread::SuperviseThread(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, boost::function<void ()()>, impala::Promise<long>*) () #7 0x0000000000bed194 in boost::detail::thread_data<boost::_bi::bind_t<void, void (*)(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, boost::function<void ()()>, impala::Promise<long>*), boost::_bi::list4<boost::_bi::value<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, boost::_bi::value<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, boost::_bi::value<boost::function<void ()()> >, boost::_bi::value<impala::Promise<long>*> > > >::run() () #8 0x0000000000e540ea in ?? () #9 0x000000316da07aa1 in start_thread () from /lib64/libpthread.so.0 #10 0x000000316d6e893d in clone () from /lib64/libc.so.6
Query
SELECT /* +straight_join */ count(*) FROM lineitem JOIN /* +broadcast */ supplier ON l_suppkey = s_suppkey JOIN /* +broadcast */ part ON l_partkey = p_partkey JOIN /* +broadcast */ orders ON l_orderkey = o_orderkey WHERE s_name='Supplier#001880004' AND p_brand < 'Brand#30' AND o_orderdate < '1994-01-01'
Snapshot from memz page
Process: Limit=201.73 GB Total=148.77 GB Peak=182.63 GB Free Disk IO Buffers: Total=1.88 GB Peak=1.88 GB RequestPool=root.default: Total=133.39 GB Peak=167.99 GB Query(d84d4f69e2366d60:72d5ad1800000000): Total=133.39 GB Peak=167.99 GB Fragment d84d4f69e2366d60:72d5ad18000001d1: Total=39.18 GB Peak=163.04 GB Runtime Filter Bank: Total=4.01 MB Peak=4.01 MB AGGREGATION_NODE (id=7): Total=40.00 KB Peak=8.51 MB Exprs: Total=4.00 KB Peak=4.00 KB HASH_JOIN_NODE (id=6): Total=39.17 GB Peak=39.17 GB Hash Join Builder (join_node_id=6): Total=39.17 GB Peak=39.17 GB HASH_JOIN_NODE (id=5): Total=50.25 KB Peak=129.25 GB Hash Join Builder (join_node_id=5): Total=13.12 KB Peak=129.24 GB HASH_JOIN_NODE (id=4): Total=42.25 KB Peak=1.05 MB Hash Join Builder (join_node_id=4): Total=13.12 KB Peak=1.01 MB HDFS_SCAN_NODE (id=0): Total=0 Peak=2.24 GB EXCHANGE_NODE (id=8): Total=0 Peak=0 DataStreamRecvr: Total=0 Peak=4.01 KB EXCHANGE_NODE (id=9): Total=0 Peak=0 DataStreamRecvr: Total=0 Peak=29.98 MB EXCHANGE_NODE (id=10): Total=0 Peak=0 DataStreamRecvr: Total=0 Peak=16.39 MB DataStreamSender (dst_id=11): Total=16.00 KB Peak=16.00 KB CodeGen: Total=22.08 KB Peak=3.56 MB Block Manager: Limit=161.39 GB Total=133.38 GB Peak=161.38 GB Query(73440af14cf09128:21f5ed1400000000): Total=31.27 KB Peak=216.54 MB Fragment 73440af14cf09128:21f5ed140000003a: Total=31.27 KB Peak=216.54 MB AGGREGATION_NODE (id=1): Total=8.00 KB Peak=1.89 MB Exprs: Total=4.00 KB Peak=4.00 KB HDFS_SCAN_NODE (id=0): Total=0 Peak=216.45 MB DataStreamSender (dst_id=2): Total=14.22 KB Peak=14.22 KB CodeGen: Total=1.05 KB Peak=188.00 KB Block Manager: Limit=161.39 GB Total=0 Peak=0 Untracked Memory: Total=13.50 GB
From the impalad.INFO log
[mmokhtar@va1026 impalad]$ cat impalad.va1026.foo.net.impala.log.INFO.20170531-170204.96499 | grep d84d4f69e2366d60:72d5ad1800000000 | head -n 500 I0601 00:45:10.883314 96906 admission-controller.cc:442] Schedule for id=d84d4f69e2366d60:72d5ad1800000000 in pool_name=root.default cluster_mem_needed=17179.57 GB PoolConfig: max_requests=100 max_queued=200 max_mem=-1.00 B I0601 00:45:10.891855 96906 admission-controller.cc:453] Admitted query id=d84d4f69e2366d60:72d5ad1800000000 I0601 00:45:10.903158 96906 coordinator.cc:121] Exec() query_id=d84d4f69e2366d60:72d5ad1800000000 stmt=-- Description : Selective broadcast joins between lineitem, supplier, part and orders. I0601 00:45:10.935317 96906 coordinator.cc:383] starting execution on 130 backends for query d84d4f69e2366d60:72d5ad1800000000 I0601 00:45:10.996497 135837 impala-internal-service.cc:44] ExecQueryFInstances(): query_id=d84d4f69e2366d60:72d5ad1800000000 I0601 00:45:10.996796 135837 query-exec-mgr.cc:46] StartQueryFInstances() query_id=d84d4f69e2366d60:72d5ad1800000000 coord=va1026.foo.net:22000 I0601 00:45:10.997879 58950 query-state.cc:263] StartFInstances(): query_id=d84d4f69e2366d60:72d5ad1800000000 #instances=1 I0601 00:45:11.011160 58950 query-state.cc:276] descriptor table for query=d84d4f69e2366d60:72d5ad1800000000 I0601 00:45:11.012342 58951 query-state.cc:325] Executing instance. instance_id=d84d4f69e2366d60:72d5ad1800000000 fragment_idx=0 per_fragment_instance_idx=0 coord_state_idx=0 #in-flight=1 I0601 00:45:11.034893 58950 query-exec-mgr.cc:138] ReleaseQueryState(): query_id=d84d4f69e2366d60:72d5ad1800000000 refcnt=3 I0601 00:45:12.325824 96906 coordinator.cc:396] started execution on 130 backends for query d84d4f69e2366d60:72d5ad1800000000 I0601 00:45:12.353492 96906 impala-server.cc:926] Query d84d4f69e2366d60:72d5ad1800000000 has timeout of 10m I0601 02:42:08.973990 115288 coordinator.cc:983] query_id=d84d4f69e2366d60:72d5ad1800000000: first in-progress backend: va1026.foo.net:22000 I0601 02:42:09.172730 11880 coordinator.cc:983] query_id=d84d4f69e2366d60:72d5ad1800000000: first in-progress backend: va1026.foo.net:22000 I0601 02:42:09.579278 127412 coordinator.cc:983] query_id=d84d4f69e2366d60:72d5ad1800000000: first in-progress backend: va1026.foo.net:22000 I0601 02:42:11.640924 112391 coordinator.cc:983] query_id=d84d4f69e2366d60:72d5ad1800000000: first in-progress backend: va1026.foo.net:22000 I0601 02:42:19.061241 115418 coordinator.cc:983] query_id=d84d4f69e2366d60:72d5ad1800000000: first in-progress backend: va1026.foo.net:22000 I0601 02:42:19.554117 112209 coordinator.cc:983] query_id=d84d4f69e2366d60:72d5ad1800000000: first in-progress backend: va1026.foo.net:22000 I0601 02:42:29.405820 127414 coordinator.cc:983] query_id=d84d4f69e2366d60:72d5ad1800000000: first in-progress backend: va1026.foo.net:22000 I0601 02:42:30.752419 145416 coordinator.cc:983] query_id=d84d4f69e2366d60:72d5ad1800000000: first in-progress backend: va1026.foo.net:22000 I0601 02:42:36.649454 111018 coordinator.cc:983] query_id=d84d4f69e2366d60:72d5ad1800000000: first in-progress backend: va1026.foo.net:22000 I0601 02:42:42.374414 123586 coordinator.cc:983] query_id=d84d4f69e2366d60:72d5ad1800000000: first in-progress backend: va1026.foo.net:22000 I0601 02:42:47.766850 143075 coordinator.cc:983] query_id=d84d4f69e2366d60:72d5ad1800000000: first in-progress backend: va1026.foo.net:22000 I0601 02:42:56.262635 11526 coordinator.cc:983] query_id=d84d4f69e2366d60:72d5ad1800000000: first in-progress backend: va1026.foo.net:22000 I0601 02:43:00.584983 132521 coordinator.cc:983] query_id=d84d4f69e2366d60:72d5ad1800000000: first in-progress backend: va1026.foo.net:22000 I0601 02:43:40.918494 132978 coordinator.cc:983] query_id=d84d4f69e2366d60:72d5ad1800000000: first in-progress backend: va1026.foo.net:22000 I0601 08:34:45.016852 106703 impala-server.cc:937] UnregisterQuery(): query_id=d84d4f69e2366d60:72d5ad1800000000 I0601 08:34:45.017141 106703 impala-server.cc:1022] Cancel(): query_id=d84d4f69e2366d60:72d5ad1800000000 I0601 08:34:45.248535 106703 coordinator.cc:920] Cancel() query_id=d84d4f69e2366d60:72d5ad1800000000 I0601 08:34:45.248984 106703 coordinator-backend-state.cc:332] sending CancelQueryFInstances rpc for query_id=d84d4f69e2366d60:72d5ad1800000000 backend=va1026.foo.net:22000 I0601 08:34:45.249454 120728 impala-internal-service.cc:63] CancelQueryFInstances(): query_id=d84d4f69e2366d60:72d5ad1800000000 I0601 08:34:45.249811 120728 query-exec-mgr.cc:83] QueryState: query_id=d84d4f69e2366d60:72d5ad1800000000 refcnt=3 I0601 08:34:45.250104 120728 query-state.cc:342] Cancel: query_id=d84d4f69e2366d60:72d5ad1800000000 I0601 08:34:45.250383 120728 data-stream-mgr.cc:258] cancelling all streams for fragment=d84d4f69e2366d60:72d5ad1800000000 I0601 08:34:45.250669 120728 data-stream-recvr.cc:237] cancelled stream: fragment_instance_id_=d84d4f69e2366d60:72d5ad1800000000 node_id=11 I0601 08:34:45.250993 120728 query-exec-mgr.cc:138] ReleaseQueryState(): query_id=d84d4f69e2366d60:72d5ad1800000000 refcnt=3 I0601 08:34:45.251452 106703 coordinator-backend-state.cc:332] sending CancelQueryFInstances rpc for query_id=d84d4f69e2366d60:72d5ad1800000000 backend=va1029.foo.net:22000 I0601 08:34:45.253317 58951 data-stream-mgr.cc:226] DeregisterRecvr(): fragment_instance_id=d84d4f69e2366d60:72d5ad1800000000, node=11 I0601 08:34:45.255867 106703 coordinator-backend-state.cc:332] sending CancelQueryFInstances rpc for query_id=d84d4f69e2366d60:72d5ad1800000000 backend=va1030.foo.net:22000 I0601 08:34:45.266269 106703 coordinator-backend-state.cc:332] sending CancelQueryFInstances rpc for query_id=d84d4f69e2366d60:72d5ad1800000000 backend=va1031.foo.net:22000 I0601 08:34:45.273082 58951 query-state.cc:332] Instance completed. instance_id=d84d4f69e2366d60:72d5ad1800000000 #in-flight=0 status=CANCELLED: Cancelled I0601 08:34:45.273423 58951 query-state.cc:342] Cancel: query_id=d84d4f69e2366d60:72d5ad1800000000 I0601 08:34:45.273550 106703 coordinator-backend-state.cc:332] sending CancelQueryFInstances rpc for query_id=d84d4f69e2366d60:72d5ad1800000000 backend=va1032.foo.net:22000 I0601 08:34:45.273679 58951 query-exec-mgr.cc:138] ReleaseQueryState(): query_id=d84d4f69e2366d60:72d5ad1800000000 refcnt=2 I0601 08:34:45.278386 106703 coordinator-backend-state.cc:332] sending CancelQueryFInstances rpc for query_id=d84d4f69e2366d60:72d5ad1800000000 backend=va1033.foo.net:22000 I0601 08:34:45.283445 106703 coordinator-backend-state.cc:332] sending CancelQueryFInstances rpc for query_id=d84d4f69e2366d60:72d5ad1800000000 backend=va1034.foo.net:22000 I0601 08:34:45.291430 106703 coordinator-backend-state.cc:332] sending CancelQueryFInstances rpc for query_id=d84d4f69e2366d60:72d5ad1800000000 backend=va1035.foo.net:22000 I0601 08:34:45.296653 106703 coordinator-backend-state.cc:332] sending CancelQueryFInstances rpc for query_id=d84d4f69e2366d60:72d5ad1800000000 backend=va1036.foo.net:22000 I0601 08:34:45.301268 106703 coordinator-backend-state.cc:332] sending CancelQueryFInstances rpc for query_id=d84d4f69e2366d60:72d5ad1800000000 backend=va1037.foo.net:22000 I0601 08:34:45.306454 106703 coordinator-backend-state.cc:332] sending CancelQueryFInstances rpc for query_id=d84d4f69e2366d60:72d5ad1800000000 backend=va1038.foo.net:22000 I0601 08:34:45.309927 106703 coordinator-backend-state.cc:332] sending CancelQueryFInstances rpc for query_id=d84d4f69e2366d60:72d5ad1800000000 backend=va1039.foo.net:22000 I0601 08:34:45.315886 106703 coordinator-backend-state.cc:332] sending CancelQueryFInstances rpc for query_id=d84d4f69e2366d60:72d5ad1800000000 backend=va1041.foo.net:22000 I0601 08:34:45.321213 106703 coordinator-backend-state.cc:332] sending CancelQueryFInstances rpc for query_id=d84d4f69e2366d60:72d5ad1800000000 backend=va1042.foo.net:22000 I0601 08:34:45.328903 106703 coordinator-backend-state.cc:332] sending CancelQueryFInstances rpc for query_id=d84d4f69e2366d60:72d5ad1800000000 backend=va1043.foo.net:22000 I0601 08:34:45.333417 106703 coordinator-backend-state.cc:332] sending CancelQueryFInstances rpc for query_id=d84d4f69e2366d60:72d5ad1800000000 backend=va1045.foo.net:22000 I0601 08:34:45.338429 106703 coordinator-backend-state.cc:332] sending CancelQueryFInstances rpc for query_id=d84d4f69e2366d60:72d5ad1800000000 backend=va1046.foo.net:22000 I0601 08:34:45.344604 106703 coordinator-backend-state.cc:332] sending CancelQueryFInstances rpc for query_id=d84d4f69e2366d60:72d5ad1800000000 backend=vc1501.foo.net:22000 I0601 08:34:45.348870 106703 coordinator-backend-state.cc:332] sending CancelQueryFInstances rpc for query_id=d84d4f69e2366d60:72d5ad1800000000 backend=vc1502.foo.net:22000 I0601 08:34:45.355267 106703 coordinator-backend-state.cc:332] sending CancelQueryFInstances rpc for query_id=d84d4f69e2366d60:72d5ad1800000000 backend=vc1503.foo.net:22000 I0601 08:34:45.360085 106703 coordinator-backend-state.cc:332] sending CancelQueryFInstances rpc for query_id=d84d4f69e2366d60:72d5ad1800000000 backend=vc1504.foo.net:22000 I0601 08:34:45.365375 106703 coordinator-backend-state.cc:332] sending CancelQueryFInstances rpc for query_id=d84d4f69e2366d60:72d5ad1800000000 backend=vc1505.foo.net:22000 I0601 08:34:45.371361 106703 coordinator-backend-state.cc:332] sending CancelQueryFInstances rpc for query_id=d84d4f69e2366d60:72d5ad1800000000 backend=vc1507.foo.net:22000 I0601 08:34:45.380437 106703 coordinator-backend-state.cc:332] sending CancelQueryFInstances rpc for query_id=d84d4f69e2366d60:72d5ad1800000000 backend=vc1508.foo.net:22000 I0601 08:34:45.384851 106703 coordinator-backend-state.cc:332] sending CancelQueryFInstances rpc for query_id=d84d4f69e2366d60:72d5ad1800000000 backend=vc1509.foo.net:22000