Uploaded image for project: 'Apache HAWQ'
  1. Apache HAWQ
  2. HAWQ-1408

PANICs during COPY ... FROM STDIN

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.1.0.0-incubating
    • 2.2.0.0-incubating
    • Core
    • None

    Description

      We found PANIC (and respective core dumps). From the initial analysis from the logs and core dump, the query causing this PANIC is a "COPY ... FROM STDIN". This query does not always panic.
      This kind of queries are executed from Java/Scala code (by one of IG Spark Jobs). Connection to the DB is managed by connection pool (commons-dbcp2) and validated on borrow by “select 1” validation query. IG is using postgresql-9.4-1206-jdbc41 as a java driver to create those connections. I believe they should be using the driver from DataDirect, available in PivNet; however, I haven't found hard evidence pointing the driver as a root cause.

      My initial analysis on the packcore for the master PANIC. Not sure if this helps or makes sense.

      This is the backtrace of the packcore for process 466858:

      (gdb) bt
      #0  0x00007fd875f906ab in raise () from /data/logs/52280/packcore-core.postgres.466858/lib64/libpthread.so.0
      #1  0x00000000008c0b19 in SafeHandlerForSegvBusIll (postgres_signal_arg=11, processName=<optimized out>) at elog.c:4519
      #2  <signal handler called>
      #3  0x000000000053b9c3 in SetSegnoForWrite (existing_segnos=0x4c46ff0, existing_segnos@entry=0x0, relid=relid@entry=1195061, segment_num=segment_num@entry=6, forNewRel=forNewRel@entry=0 '\000', keepHash=keepHash@entry=1 '\001') at appendonlywriter.c:1166
      #4  0x000000000053c08f in assignPerRelSegno (all_relids=all_relids@entry=0x2b96d68, segment_num=6) at appendonlywriter.c:1212
      #5  0x00000000005f79e8 in DoCopy (stmt=stmt@entry=0x2b2a3d8, queryString=<optimized out>) at copy.c:1591
      #6  0x00000000007ef737 in ProcessUtility (parsetree=parsetree@entry=0x2b2a3d8, queryString=0x2c2f550 "COPY mis_data_ig_client_derived_attributes.client_derived_attributes_src (id, tracking_id, name, value_string, value_timestamp, value_number, value_boolean, environment, account, channel, device, feat"...,
          params=0x0, isTopLevel=isTopLevel@entry=1 '\001', dest=dest@entry=0x2b2a7c8, completionTag=completionTag@entry=0x7ffcb5e318e0 "") at utility.c:1076
      #7  0x00000000007ea95e in PortalRunUtility (portal=portal@entry=0x2b8eab0, utilityStmt=utilityStmt@entry=0x2b2a3d8, isTopLevel=isTopLevel@entry=1 '\001', dest=dest@entry=0x2b2a7c8, completionTag=completionTag@entry=0x7ffcb5e318e0 "") at pquery.c:1969
      #8  0x00000000007ec13e in PortalRunMulti (portal=portal@entry=0x2b8eab0, isTopLevel=isTopLevel@entry=1 '\001', dest=dest@entry=0x2b2a7c8, altdest=altdest@entry=0x2b2a7c8, completionTag=completionTag@entry=0x7ffcb5e318e0 "") at pquery.c:2079
      #9  0x00000000007ede95 in PortalRun (portal=portal@entry=0x2b8eab0, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=1 '\001', dest=dest@entry=0x2b2a7c8, altdest=altdest@entry=0x2b2a7c8, completionTag=completionTag@entry=0x7ffcb5e318e0 "") at pquery.c:1596
      #10 0x00000000007e5ad9 in exec_simple_query (query_string=query_string@entry=0x2b29100 "COPY mis_data_ig_client_derived_attributes.client_derived_attributes_src (id, tracking_id, name, value_string, value_timestamp, value_number, value_boolean, environment, account, channel, device, feat"...,
          seqServerHost=seqServerHost@entry=0x0, seqServerPort=seqServerPort@entry=-1) at postgres.c:1816
      #11 0x00000000007e6cb2 in PostgresMain (argc=<optimized out>, argv=<optimized out>, argv@entry=0x29d7820, username=0x29d75d0 "mis_ig") at postgres.c:4840
      #12 0x0000000000799540 in BackendRun (port=0x29afc50) at postmaster.c:5915
      #13 BackendStartup (port=0x29afc50) at postmaster.c:5484
      #14 ServerLoop () at postmaster.c:2163
      #15 0x000000000079c309 in PostmasterMain (argc=<optimized out>, argv=<optimized out>) at postmaster.c:1454
      #16 0x00000000004a4209 in main (argc=9, argv=0x29af010) at main.c:226
      

      Jumping into the frame 3 and running info locals, we found something odd for "status" variable:

      (gdb) f 3
      #3  0x000000000053b9c3 in SetSegnoForWrite (existing_segnos=0x4c46ff0, existing_segnos@entry=0x0, relid=relid@entry=1195061, segment_num=segment_num@entry=6, forNewRel=forNewRel@entry=0 '\000', keepHash=keepHash@entry=1 '\001') at appendonlywriter.c:1166
      1166	appendonlywriter.c: No such file or directory.
      (gdb) info locals
      status = 0x0
      [...]
      

      This panic comes from this piece of code in "appendonlywritter.c":

      for (int i = 0; i < segment_num; i++)
      {
          AOSegfileStatus *status = maxSegno4Segment[i];
          status->inuse = true;
          status->xid = CurrentXid;
          existing_segnos = lappend_int(existing_segnos,  status->segno);
      }
      

      So, we are pulling a 0x0 (null ?!) entry from maxSegno4Segment... That's extrange, because earlier in this function we populate this array, and we should not reach this section unless this maxSegno4Segment array completely populated.

      I'm assuming that the variable "segment_num" (used in the for loop stop condition) is 6 because we got (4096 MB, 1.00 CORE) x 6 resources, according to master log

      2017-02-15 00:12:20.121004 GMT,"mis_ig","ig",p466858,th2037315872,"10.33.188.8","35247",2017-02-14 08:41:58 GMT,30177010,con20836,cmd48711,seg-10000,,,x30177010,sx1,"LOG","00000","ConnID 42180. Acquired resource from resource manager, (4096 MB, 1.00
      0000 CORE) x 6."
      

      Attachments

        Issue Links

          Activity

            People

              mli Ming Li
              mli Ming Li
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: