Uploaded image for project: 'Apache HAWQ (Retired)'
  1. Apache HAWQ (Retired)
  2. HAWQ-1371

QE process hang in shared input scan

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 2.1.0.0-incubating
    • 2.2.0.0-incubating
    • Query Execution
    • None

    Description

      process hang on some segment node while QD and QE on other segment nodes terminated.

      on segment test2:
      [gpadmin@test2 ~]$ pp
      gpadmin   21614  0.0  1.2 788636 407428 ?       Ss   Feb26   1:19 /usr/local/hawq_2_1_0_0/bin/postgres -D /data/pulse-agent-data/HAWQ-main-FeatureTest-opt-YARN/product/segmentdd -p 31100 --silent-mode=true -M segment -i
      gpadmin   21615  0.0  0.0 279896  6952 ?        Ss   Feb26   0:08 postgres: port 31100, logger process
      gpadmin   21618  0.0  0.0 282128  6980 ?        Ss   Feb26   0:00 postgres: port 31100, stats collector process
      gpadmin   21619  0.0  0.0 788636  7280 ?        Ss   Feb26   0:11 postgres: port 31100, writer process
      gpadmin   21620  0.0  0.0 788636  7064 ?        Ss   Feb26   0:01 postgres: port 31100, checkpoint process
      gpadmin   21621  0.0  0.0 793048 11752 ?        S    Feb26   0:19 postgres: port 31100, segment resource manager
      gpadmin   91760  0.0  0.0 861000 16840 ?        TNsl Feb26   0:07 postgres: port 31100, gpadmin parquetola... 10.32.35.141(15250) con558 seg4 cmd2 slice11 MPPEXEC SELECT
      gpadmin   91762  0.0  0.0 861064 17116 ?        SNsl Feb26   0:08 postgres: port 31100, gpadmin parquetola... 10.32.35.141(15253) con558 seg5 cmd2 slice11 MPPEXEC SELECT
      gpadmin  216648  0.0  0.0 103244   788 pts/0    S+   19:54   0:00 grep postgres
      

      QE stack trace is:

      (gdb) bt
      #0  0x00000032214e1523 in select () from /lib64/libc.so.6
      #1  0x000000000069c2fa in shareinput_writer_waitdone (ctxt=0x1dae520, share_id=0, nsharer_xslice=7) at nodeShareInputScan.c:989
      #2  0x0000000000695798 in ExecEndMaterial (node=0x1d2eb50) at nodeMaterial.c:512
      #3  0x000000000067048d in ExecEndNode (node=0x1d2eb50) at execProcnode.c:1681
      #4  0x000000000069c6b5 in ExecEndShareInputScan (node=0x1d2e6f0) at nodeShareInputScan.c:382
      #5  0x000000000067042a in ExecEndNode (node=0x1d2e6f0) at execProcnode.c:1674
      #6  0x00000000006ac9be in ExecEndSequence (node=0x1d23890) at nodeSequence.c:165
      #7  0x00000000006705f0 in ExecEndNode (node=0x1d23890) at execProcnode.c:1583
      #8  0x000000000069a0ab in ExecEndResult (node=0x1d214a0) at nodeResult.c:481
      #9  0x000000000067060d in ExecEndNode (node=0x1d214a0) at execProcnode.c:1575
      #10 0x000000000069a0ab in ExecEndResult (node=0x1d20860) at nodeResult.c:481
      #11 0x000000000067060d in ExecEndNode (node=0x1d20860) at execProcnode.c:1575
      #12 0x0000000000698fd2 in ExecEndMotion (node=0x1d20320) at nodeMotion.c:1230
      #13 0x0000000000670434 in ExecEndNode (node=0x1d20320) at execProcnode.c:1713
      #14 0x0000000000669da7 in ExecEndPlan (planstate=0x1d20320, estate=0x1cb6b40) at execMain.c:2896
      #15 0x000000000066a311 in ExecutorEnd (queryDesc=0x1cabf20) at execMain.c:1407
      #16 0x00000000006195f2 in PortalCleanupHelper (portal=0x1cbcc40) at portalcmds.c:365
      #17 PortalCleanup (portal=0x1cbcc40) at portalcmds.c:317
      #18 0x0000000000900544 in AtAbort_Portals () at portalmem.c:693
      #19 0x00000000004e697f in AbortTransaction () at xact.c:2800
      #20 0x00000000004e7565 in AbortCurrentTransaction () at xact.c:3377
      #21 0x00000000007ed0fa in PostgresMain (argc=<value optimized out>, argv=<value optimized out>, username=0x1b47f10 "gpadmin") at postgres.c:4630
      #22 0x00000000007a05d0 in BackendRun () at postmaster.c:5915
      #23 BackendStartup () at postmaster.c:5484
      #24 ServerLoop () at postmaster.c:2163
      #25 0x00000000007a3399 in PostmasterMain (argc=Unhandled dwarf expression opcode 0xf3
      ) at postmaster.c:1454
      #26 0x00000000004a52e9 in main (argc=9, argv=0x1b0cd10) at main.c:226
      (gdb) p CurrentTransactionState->state
      $1 = TRANS_ABORT
      (gdb) p pctxt->donefd
      No symbol "pctxt" in current context.
      (gdb) f 1
      #1  0x000000000069c2fa in shareinput_writer_waitdone (ctxt=0x1dae520, share_id=0, nsharer_xslice=7) at nodeShareInputScan.c:989
      989    	nodeShareInputScan.c: No such file or directory.
             	in nodeShareInputScan.c
      (gdb) p pctxt->donefd
      $2 = 15
      

      Attachments

        Activity

          People

            abai Amy Bai
            abai Amy Bai
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: