Uploaded image for project: 'Apache HAWQ'
  1. Apache HAWQ
  2. HAWQ-478

Bug when shut down cluster during recovery pass3

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 2.0.0.0-incubating
    • Transaction
    • None

    Description

      Shutting down cluster when master recovering in pass3 cause inconsistency between pg_class and gp_persistent table.
      And it cause data loss.

      2016-03-01 01:56:33.032318 PST,,,p119941,th731297984,,,,0,,,seg-10000,,,,,"LOG","00000","checkpoint record is at 0/302AD30",,,,,,,0,,"xlog.c",6304,
      2016-03-01 01:56:33.032337 PST,,,p119941,th731297984,,,,0,,,seg-10000,,,,,"LOG","00000","redo record is at 0/302AD30; undo record is at 0/0; shutdown FALSE",,,,,,,0,,"xlog.c",6338,
      2016-03-01 01:56:33.032353 PST,,,p119941,th731297984,,,,0,,,seg-10000,,,,,"LOG","00000","next transaction ID: 0/1045; next OID: 24726",,,,,,,0,,"xlog.c",6342,
      2016-03-01 01:56:33.032367 PST,,,p119941,th731297984,,,,0,,,seg-10000,,,,,"LOG","00000","next MultiXactId: 1; next MultiXactOffset: 0",,,,,,,0,,"xlog.c",6345,
      2016-03-01 01:56:33.032382 PST,,,p119941,th731297984,,,,0,,,seg-10000,,,,,"LOG","00000","database system was not properly shut down; automatic recovery in progress",,,,,,,0,,"xlog.c",6434,
      2016-03-01 01:56:33.033329 PST,,,p119941,th731297984,,,,0,,,seg-10000,,,,,"LOG","00000","redo starts at 0/302AD80",,,,,,,0,,"xlog.c",6523,
      2016-03-01 01:56:33.089749 PST,,,p119941,th731297984,,,,0,,,seg-10000,,,,,"LOG","00000","record with zero length at 0/77A7708",,,,,,,0,,"xlog.c",4110,
      2016-03-01 01:56:33.089792 PST,,,p119941,th731297984,,,,0,,,seg-10000,,,,,"LOG","00000","redo done at 0/77A76D8",,,,,,,0,,"xlog.c",6560,
      2016-03-01 01:56:33.089893 PST,,,p119941,th731297984,,,,0,,,seg-10000,,,,,"LOG","00000","end of transaction log location is 0/77A7708",,,,,,,0,,"xlog.c",6582,
      2016-03-01 01:56:33.738889 PST,,,p119941,th731297984,,,,0,,,seg-10000,,,,,"LOG","00000","Finished startup pass 1.  Proceeding to startup crash recovery passes 2 and 3.",,,,,,,0,,"xlog.c",6816,
      2016-03-01 01:56:34.525387 PST,,,p118947,th731297984,,,,0,,,seg-10000,,,,,"LOG","00000","received smart shutdown request",,,,,,,0,,"postmaster.c",3447,
      2016-03-01 01:56:35.042857 PST,,,p119958,th731297984,,,,0,,,seg-10000,,,,,"WARNING","XX000","could not remove relation directory 16385/16536/20219: Success (smgr.c:1049)",,,,,"Dropping file-system object -- Relation Directory: '16385/16536/20219'",,0,,"smgr.c",1049,
      2016-03-01 01:56:35.131058 PST,,,p119958,th731297984,,,,0,,,seg-10000,,,,,"WARNING","XX000","could not remove relation directory 16385/16536/16894: Success (smgr.c:1049)",,,,,"Dropping file-system object -- Relation Directory: '16385/16536/16894'",,0,,"smgr.c",1049,
      2016-03-01 01:56:35.584893 PST,,,p119958,th731297984,,,,0,,,seg-10000,,,,,"LOG","00000","Finished startup crash recovery pass 2",,,,,,,0,,"xlog.c",6987,
      2016-03-01 01:56:35.590423 PST,,,p120017,th731297984,,,,0,,,seg-10000,,,,,"LOG","00000","shutting down",,,,,,,0,,"xlog.c",7853,
      2016-03-01 01:56:35.592973 PST,,,p120017,th731297984,,,,0,,,seg-10000,,,,,"LOG","00000","database system is shut down",,,,,,,0,,"xlog.c",7874,
      
      cr_workload=# select *  from pg_class where relname like 'create_insert%' and relname not like '%prt%';
          relname     | relnamespace | reltype | relowner | relam | relfilenode | reltablespace | relpages | reltuples | reltoastrelid | reltoastidxid | relaosegrelid | relaosegidxid | relhasindex | relisshared | relkind | relstorage | relnatts | relchecks | reltriggers | relukeys | relfkeys | relrefs | relhasoids | relh
      aspkey | relhasrules | relhassubclass | relfrozenxid | relacl |    reloptions
      ----------------+--------------+---------+----------+-------+-------------+---------------+----------+-----------+---------------+---------------+---------------+---------------+-------------+-------------+---------+------------+----------+-----------+-------------+----------+----------+---------+------------+-----
      -------+-------------+----------------+--------------+--------+-------------------
       create_insert1 |         2200 |  696503 |       10 |     0 |      702761 |             0 |        0 |         0 |             0 |             0 |             0 |             0 | f           | f           | r       | a          |        3 |         0 |           0 |        0 |        0 |       0 | f          | f
             | f           | t              |        11609 |        | {appendonly=true}
      (1 row)
      
      cr_workload=# \d
      No relations found.
      cr_workload=# select * from create_insert1;
      ERROR:  relation "create_insert1" does not exist
      LINE 1: select * from create_insert1;
                            ^
      cr_workload=# select * from gp_persistent_relation_node where relfilenode_oid = 702761;
       tablespace_oid | database_oid | relfilenode_oid | persistent_state | reserved | parent_xid | persistent_serial_num | previous_free_tid
      ----------------+--------------+-----------------+------------------+----------+------------+-----------------------+-------------------
                16385 |       696501 |          702761 |                2 |        0 |          0 |                 31380 | (0,0)
      (1 row)
      

      Attachments

        Issue Links

          Activity

            People

              mli Ming Li
              doli Dong Li
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: