Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
2.1.0.0-incubating
-
None
Description
process hang on some segment node while QD and QE on other segment nodes terminated.
on segment test2:
[gpadmin@test2 ~]$ pp
gpadmin 21614 0.0 1.2 788636 407428 ? Ss Feb26 1:19 /usr/local/hawq_2_1_0_0/bin/postgres -D /data/pulse-agent-data/HAWQ-main-FeatureTest-opt-YARN/product/segmentdd -p 31100 --silent-mode=true -M segment -i
gpadmin 21615 0.0 0.0 279896 6952 ? Ss Feb26 0:08 postgres: port 31100, logger process
gpadmin 21618 0.0 0.0 282128 6980 ? Ss Feb26 0:00 postgres: port 31100, stats collector process
gpadmin 21619 0.0 0.0 788636 7280 ? Ss Feb26 0:11 postgres: port 31100, writer process
gpadmin 21620 0.0 0.0 788636 7064 ? Ss Feb26 0:01 postgres: port 31100, checkpoint process
gpadmin 21621 0.0 0.0 793048 11752 ? S Feb26 0:19 postgres: port 31100, segment resource manager
gpadmin 91760 0.0 0.0 861000 16840 ? TNsl Feb26 0:07 postgres: port 31100, gpadmin parquetola... 10.32.35.141(15250) con558 seg4 cmd2 slice11 MPPEXEC SELECT
gpadmin 91762 0.0 0.0 861064 17116 ? SNsl Feb26 0:08 postgres: port 31100, gpadmin parquetola... 10.32.35.141(15253) con558 seg5 cmd2 slice11 MPPEXEC SELECT
gpadmin 216648 0.0 0.0 103244 788 pts/0 S+ 19:54 0:00 grep postgres
QE stack trace is:
(gdb) bt #0 0x00000032214e1523 in select () from /lib64/libc.so.6 #1 0x000000000069c2fa in shareinput_writer_waitdone (ctxt=0x1dae520, share_id=0, nsharer_xslice=7) at nodeShareInputScan.c:989 #2 0x0000000000695798 in ExecEndMaterial (node=0x1d2eb50) at nodeMaterial.c:512 #3 0x000000000067048d in ExecEndNode (node=0x1d2eb50) at execProcnode.c:1681 #4 0x000000000069c6b5 in ExecEndShareInputScan (node=0x1d2e6f0) at nodeShareInputScan.c:382 #5 0x000000000067042a in ExecEndNode (node=0x1d2e6f0) at execProcnode.c:1674 #6 0x00000000006ac9be in ExecEndSequence (node=0x1d23890) at nodeSequence.c:165 #7 0x00000000006705f0 in ExecEndNode (node=0x1d23890) at execProcnode.c:1583 #8 0x000000000069a0ab in ExecEndResult (node=0x1d214a0) at nodeResult.c:481 #9 0x000000000067060d in ExecEndNode (node=0x1d214a0) at execProcnode.c:1575 #10 0x000000000069a0ab in ExecEndResult (node=0x1d20860) at nodeResult.c:481 #11 0x000000000067060d in ExecEndNode (node=0x1d20860) at execProcnode.c:1575 #12 0x0000000000698fd2 in ExecEndMotion (node=0x1d20320) at nodeMotion.c:1230 #13 0x0000000000670434 in ExecEndNode (node=0x1d20320) at execProcnode.c:1713 #14 0x0000000000669da7 in ExecEndPlan (planstate=0x1d20320, estate=0x1cb6b40) at execMain.c:2896 #15 0x000000000066a311 in ExecutorEnd (queryDesc=0x1cabf20) at execMain.c:1407 #16 0x00000000006195f2 in PortalCleanupHelper (portal=0x1cbcc40) at portalcmds.c:365 #17 PortalCleanup (portal=0x1cbcc40) at portalcmds.c:317 #18 0x0000000000900544 in AtAbort_Portals () at portalmem.c:693 #19 0x00000000004e697f in AbortTransaction () at xact.c:2800 #20 0x00000000004e7565 in AbortCurrentTransaction () at xact.c:3377 #21 0x00000000007ed0fa in PostgresMain (argc=<value optimized out>, argv=<value optimized out>, username=0x1b47f10 "gpadmin") at postgres.c:4630 #22 0x00000000007a05d0 in BackendRun () at postmaster.c:5915 #23 BackendStartup () at postmaster.c:5484 #24 ServerLoop () at postmaster.c:2163 #25 0x00000000007a3399 in PostmasterMain (argc=Unhandled dwarf expression opcode 0xf3 ) at postmaster.c:1454 #26 0x00000000004a52e9 in main (argc=9, argv=0x1b0cd10) at main.c:226 (gdb) p CurrentTransactionState->state $1 = TRANS_ABORT (gdb) p pctxt->donefd No symbol "pctxt" in current context. (gdb) f 1 #1 0x000000000069c2fa in shareinput_writer_waitdone (ctxt=0x1dae520, share_id=0, nsharer_xslice=7) at nodeShareInputScan.c:989 989 nodeShareInputScan.c: No such file or directory. in nodeShareInputScan.c (gdb) p pctxt->donefd $2 = 15