Uploaded image for project: 'Apache HAWQ (Retired)'
  1. Apache HAWQ (Retired)
  2. HAWQ-1094

Select on INTERNAL table returns wrong results when hdfs blocks have checksum errors

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • backlog
    • Fault Tolerance
    • None

    Description

      I created a parquet table and inserted the following values into the table:

      sr37228_repro=# select * from number;
       id
      ----
        1
        1
        1
        1
        1
      (5 rows)
      

      I then modified the data in two of the three blocks and tried reading the data again.

      Modifying contents of internal table blocks...
      
      Found hdfs://hdm1.hdp.local:8020/hawq_default/16385/16543/17000/10 in hdfs
      
      Modifying block /hadoop/hdfs/data/current/BP-2023073008-172.28.21.63-1462922052672/current/finalized/subdir0/subdir0/blk_1073742008 on 172.28.21.155
      block_script.sh                                                                                                        100%  228     0.2KB/s   00:00
      Modifying block /hadoop/hdfs/data/current/BP-2023073008-172.28.21.63-1462922052672/current/finalized/subdir0/subdir0/blk_1073742008 on 172.28.21.156
      block_script.sh                                                                                                        100%  228     0.2KB/s   00:00
      
      Running count query again, this time with bad data in two of the three blocks
       count |    id
      -------+----------
           1 |        0
           2 |        1
           1 | 16777216
           1 | 16777217
      (4 rows)
      
      
      Checking Showing file health:
      
      Checking hdfs://hdm1.hdp.local:8020/hawq_default/16385/16543/17000/10 health
      Connecting to namenode via http://hdm1.hdp.local:50070/fsck?ugi=gpadmin&blocks=1&locations=1&files=1&path=%2Fhawq_default%2F16385%2F16543%2F17000%2F10
      FSCK started by gpadmin (auth:SIMPLE) from /172.28.21.157 for path /hawq_default/16385/16543/17000/10 at Mon Sep 26 12:07:53 PDT 2016
      /hawq_default/16385/16543/17000/10 206 bytes, 1 block(s):  OK
      0. BP-2023073008-172.28.21.63-1462922052672:blk_1073742008_1186 len=206 repl=3 [DatanodeInfoWithStorage[172.28.21.155:50010,DS-1a18c785-48e5-4ab8-9228-b3f6857b952a,DISK], DatanodeInfoWithStorage[172.28.19.211:50010,DS-6bf49ae7-6745-448b-803d-d12d93acad1d,DISK], DatanodeInfoWithStorage[172.28.21.156:50010,DS-d22b0f7f-7065-42c4-bb66-ea361ec5e56a,DISK]]
      
      Status: HEALTHY
       Total size:    206 B
       Total dirs:    0
       Total files:   1
       Total symlinks:                0
       Total blocks (validated):      1 (avg. block size 206 B)
       Minimally replicated blocks:   1 (100.0 %)
       Over-replicated blocks:        0 (0.0 %)
       Under-replicated blocks:       0 (0.0 %)
       Mis-replicated blocks:         0 (0.0 %)
       Default replication factor:    3
       Average block replication:     3.0
       Corrupt blocks:                0
       Missing replicas:              0 (0.0 %)
       Number of data-nodes:          3
       Number of racks:               1
      FSCK ended at Mon Sep 26 12:07:53 PDT 2016 in 0 milliseconds
      

      When setupBlockReader reads a bad block using the LocalBlockReader, the reader correctly detects a bad checksum.

      2016-09-26 13:02:09.267021 PDT,,,p380682,th795609216,,,,0,,,seg-10000,,,,,"LOG","00000","Resource manager discovered local host IPv4 address 127.0.0.1",,,,,,,0,,"network_utils.c",210,
      2016-09-26 13:02:09.267171 PDT,,,p380682,th795609216,,,,0,,,seg-10000,,,,,"LOG","00000","Resource manager discovered local host IPv4 address 172.28.21.155",,,,,,,0,,"network_utils.c",210,
      2016-09-26 13:02:16.239048 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6227,con143,cmd72,seg1,,,x6227,sx1,"DEBUG1","00000","Dropping in memory mapping OidInMemHeapMapping",,,,,,"SET log_min_messages TO 'debug5'",0,,"cdbinmemheapam.c",293,
      2016-09-26 13:02:16.239289 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6227,con143,cmd72,seg1,,,x6227,sx1,"DEBUG3","00000","CommitTransactionCommand",,,,,,"SET log_min_messages TO 'debug5'",0,,"postgres.c",3131,
      2016-09-26 13:02:16.239435 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6227,con143,cmd72,seg1,,,x6227,sx1,"DEBUG3","00000","CommitTransaction",,,,,,"SET log_min_messages TO 'debug5'",0,,"xact.c",5103,
      2016-09-26 13:02:16.239819 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6227,con143,cmd72,seg1,,,x6227,sx1,"DEBUG3","00000","name: unnamed; blockState:       STARTED; state: INPROGR, xid/subid/cid: 6227/1/0, nestlvl: 1, children: <>",,,,,,"SET log_min_messages TO 'debug5'",0,,"xact.c",5128,
      2016-09-26 13:02:16.239978 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6227,con143,cmd72,seg1,,,x6227,sx1,"DEBUG1","00000","Dropping in memory mapping OidInMemOnlyMapping",,,,,,"SET log_min_messages TO 'debug5'",0,,"cdbinmemheapam.c",293,
      2016-09-26 13:02:25.600367 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,0,con143,,seg1,,,,,"DEBUG5","00000","First char: 'M'; gp_role = 'execute'.",,,,,,,0,,"postgres.c",4737,
      2016-09-26 13:02:25.600639 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,0,con143,cmd74,seg1,,,,,"DEBUG1","00000","Message type M received by from libpq, len = 1412",,,,,,,0,,"postgres.c",4813,
      2016-09-26 13:02:25.600742 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,0,con143,cmd74,seg1,,,,,"DEBUG5","00000","MPP dispatched stmt from QD: explain analyze select * from number;.",,,,,,,0,,"postgres.c",4893,
      2016-09-26 13:02:25.600847 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,0,con143,cmd74,seg1,,,,,"DEBUG1","00000","SetupProcessIdentity: receive msg: ProcessIdentity_Begin_slice_1_idx_0_gang_1_cmd_74_writer_t_End_ProcessIdentity",,,,,,,0,,"identity.c",365,
      2016-09-26 13:02:25.600997 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,0,con143,cmd74,seg1,,,,,"DEBUG1","00000","ProcessIdentity is not init",,,,,,,0,,"identity.c",599,
      2016-09-26 13:02:25.601129 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,0,con143,cmd74,seg1,,,,,"DEBUG1","00000","ProcessIdentity: slice 1 id 0 gang num 1 writer t",,,,,,,0,,"identity.c",602,
      2016-09-26 13:02:25.601250 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,0,con143,cmd74,seg0,slice1,,,,"DEBUG5","00000","Get a temporary directory:/tmp/hawq/segment",,,,,,,0,,"cdbtmpdir.c",48,
      2016-09-26 13:02:25.601351 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,0,con143,cmd74,seg0,slice1,,,,"DEBUG1","00000","getLocalTmpDirFromSegmentConfig session_id:143 command_id:74 qeidx:0 tmpdir:/tmp/hawq/segment",,,,,,,0,,"identity.c",418,
      2016-09-26 13:02:25.601784 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,0,con143,cmd74,seg0,slice1,,,,"DEBUG3","00000","StartTransactionCommand",,,,,,"explain analyze select * from number;",0,,"postgres.c",3107,
      2016-09-26 13:02:25.602075 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG3","00000","StartTransaction",,,,,,"explain analyze select * from number;",0,,"xact.c",5103,
      2016-09-26 13:02:25.602195 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG3","00000","name: unnamed; blockState:       DEFAULT; state: INPROGR, xid/subid/cid: 6228/1/0, nestlvl: 1, children: <>",,,,,,"explain analyze select * from number;",0,,"xact.c",5128,
      2016-09-26 13:02:25.602578 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","add index 0 key 17000 relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",624,
      2016-09-26 13:02:25.602703 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","add index 1 key 17000 relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",624,
      2016-09-26 13:02:25.602836 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","add index 2 key 17000 relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",624,
      2016-09-26 13:02:25.602994 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","add index 3 key 17000 relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",624,
      2016-09-26 13:02:25.603104 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","add index 4 key 17000 relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",624,
      2016-09-26 13:02:25.603211 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","add index 5 key 17000 relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",624,
      2016-09-26 13:02:25.603317 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","add index 6 key 17000 relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",624,
      2016-09-26 13:02:25.603572 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","add index 7 key 17000 relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",624,
      2016-09-26 13:02:25.603751 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","add index 8 key 17002 relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",624,
      2016-09-26 13:02:25.603881 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","add index 9 key 17002 relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",624,
      2016-09-26 13:02:25.604003 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","add index 10 key 17002 relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",624,
      2016-09-26 13:02:25.604110 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","add index 11 key 17002 relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",624,
      2016-09-26 13:02:25.604216 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","add index 12 key 17002 relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",624,
      2016-09-26 13:02:25.604323 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","add index 13 key 17002 relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",624,
      2016-09-26 13:02:25.604555 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","add index 14 key 17002 relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",624,
      2016-09-26 13:02:25.604697 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","add index 15 key 17002 relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",624,
      2016-09-26 13:02:25.604848 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","add index 16 key 17002 relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",624,
      2016-09-26 13:02:25.604959 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","add index 17 key 17002 relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",624,
      2016-09-26 13:02:25.605064 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","add index 18 key 17002 relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",624,
      2016-09-26 13:02:25.605591 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG3","00000","Resource enforcer finds cpu sub-system is disabled",,,,,,"explain analyze select * from number;",0,,"resourceenforcer.c",908,
      2016-09-26 13:02:25.605716 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG2","00000","Current nice level of the process: 19",,,,,,"explain analyze select * from number;",0,,"postgres.c",283,
      2016-09-26 13:02:25.605856 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG2","00000","Reniced process to level 19",,,,,,"explain analyze select * from number;",0,,"postgres.c",302,
      2016-09-26 13:02:25.606073 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG5","00000","GetSnapshotData setting globalxmin and xmin to 6228",,,,,,"explain analyze select * from number;",0,,"procarray.c",552,
      2016-09-26 13:02:25.606306 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","Inserted entry for query (sessionid=143, commandcnt=74)",,,,,,"explain analyze select * from number;",0,,"workfile_queryspace.c",283,
      2016-09-26 13:02:25.606748 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","Have both IPv6 and IPv4 choices",,,,,,"explain analyze select * from number;",0,,"ic_udp.c",1291,
      2016-09-26 13:02:25.606978 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","receive socket ai_family 10 ai_socktype 2 ai_protocol 17",,,,,,"explain analyze select * from number;",0,,"ic_udp.c",1303,
      2016-09-26 13:02:25.607098 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","receive socket 6 ai_family 10 ai_socktype 2 ai_protocol 17",,,,,,"explain analyze select * from number;",0,,"ic_udp.c",1307,
      2016-09-26 13:02:25.607207 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","bind addrlen 28 fam 10",,,,,,"explain analyze select * from number;",0,,"ic_udp.c",1318,
      2016-09-26 13:02:25.607320 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","UDP-IC: xmit default buffer size 124928 bytes",,,,,,"explain analyze select * from number;",0,,"ic_udp.c",2200,
      2016-09-26 13:02:25.607555 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","UDP-IC: xmit use buffer size 2097152 bytes",,,,,,"explain analyze select * from number;",0,,"ic_udp.c",2215,
      2016-09-26 13:02:25.607678 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","UDP-IC: xmit default buffer size 124928 bytes",,,,,,"explain analyze select * from number;",0,,"ic_udp.c",2200,
      2016-09-26 13:02:25.607787 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","UDP-IC: xmit use buffer size 2097152 bytes",,,,,,"explain analyze select * from number;",0,,"ic_udp.c",2215,
      2016-09-26 13:02:25.607939 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","GetSockAddr socket ai_family 2 ai_socktype 2 ai_protocol 17 for 172.28.21.157",,,,,,"explain analyze select * from number;",0,,"ic_udp.c",3058,
      2016-09-26 13:02:25.608052 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","We are inet6, remote is inet.  Converting to v4 mapped address.",,,,,,"explain analyze select * from number;",0,,"ic_udp.c",3137,
      2016-09-26 13:02:25.608249 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","read index 0 key 17000 for relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",499,
      2016-09-26 13:02:25.608706 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","read index 1 key 17000 for relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",499,
      2016-09-26 13:02:25.608836 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","read index 2 key 17000 for relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",499,
      2016-09-26 13:02:25.608966 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","read index 3 key 17000 for relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",499,
      2016-09-26 13:02:25.609083 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","read index 4 key 17000 for relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",499,
      2016-09-26 13:02:25.609200 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","read index 5 key 17000 for relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",499,
      2016-09-26 13:02:25.609316 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","read index 6 key 17000 for relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",499,
      2016-09-26 13:02:25.609657 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG1","00000","read index 7 key 17000 for relation pg_attribute",,,,,,"explain analyze select * from number;",0,,"cdbinmemheapam.c",499,
      2016-09-26 13:02:25.613152 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 12:32:31 PDT,6228,con143,cmd74,seg0,slice1,,x6228,sx1,"DEBUG5","00000","Parquet metadata file footer length index: 198",,,,,,"explain analyze select * from number;",0,,"cdbparquetfooterprocessor.c",141,
      2016-09-26 13:02:25.676719 PDT,,,p380675,th795609216,,,,0,,,seg-10000,,,,,"LOG","00000","3rd party error log:
      2016-09-26 13:02:25.676477, p384452, th140708219193472, ERROR cannot setup block reader for Block: [block pool ID: BP-2023073008-172.28.21.63-1462922052672 block ID 1073742008_1186] file /hawq_default/16385/16543/17000/10 on Datanode: hdw2.hdp.local(172.28.21.155).
      LocalBlockReader.cpp: 127: HdfsIOException: Failed to construct LocalBlockReader for block: [block pool ID: BP-2023073008-172.28.21.63-1462922052672 block ID 1073742008_1186].
              @       Hdfs::Internal::LocalBlockReader::LocalBlockReader(boost::shared_ptr<Hdfs::Internal::ReadShortCircuitInfo> const&, Hdfs::Internal::ExtendedBlock const&, long, bool, Hdfs::Internal::SessionConfig&, std::vector<char, std::allocator<char> >&)
              @       Hdfs::Internal::InputStreamImpl::setupBlockReader(bool)
              @       Hdfs::Internal::InputStreamImpl::readOneBlock(char*, int, bool)
              @       Hdfs::Internal::InputStreamImpl::readInternal(char*, int)
              @       Hdfs::Internal::InputStreamImpl::read(char*, int)
              @       hdfsRead
              @       gpfs_hdfs_read
              @       HdfsRead
              @       FileRead
              @       readParquetFooter
              @       ParquetStorageRead_OpenFile
              @       parquet_getnext
              @       ParquetScanNext
              @       ExecTableScan
              @       ExecProcNode
              @       ExecMotion
              @       ExecProcNode
              @       ExecutePlan
              @       ExecutorRun
              @       PortalRunSelect
              @       PortalRun
              @       PostgresMain
              @       BackendStartup
              @       ServerLoop
              @       PostmasterMain
              @       main
              @       __libc_start_main
              @       Unknown
      Caused by
      LocalBlockReader.cpp: 283: HdfsIOException: LocalBlockReader failed to skip from position: 0, length: 0, block: [block pool ID: BP-2023073008-172.28.21.63-1462922052672 block ID 1073742008_1186].
              @       Hdfs::Internal::LocalBlockReader::skip(long)
              @       Hdfs::Internal::LocalBlockReader::LocalBlockReader(boost::shared_ptr<Hdfs::Internal::ReadShortCircuitInfo> const&, Hdfs::Internal::ExtendedBlock const&, long, bool, Hdfs::Internal::SessionConfig&, std::vector<char, std::allocator<char> >&)
              @       Hdfs::Internal::InputStreamImpl::setupBlockReader(bool)
              @       Hdfs::Internal::InputStreamImpl::readOneBlock(char*, int, bool)
              @       Hdfs::Internal::InputStreamImpl::readInternal(char*, int)
              @       Hdfs::Internal::InputStreamImpl::read(char*, int)
              @       hdfsRead
              @       gpfs_hdfs_read
              @       HdfsRead
              @       FileRead
              @       readParquetFooter
              @       ParquetStorageRead_OpenFile
              @       parquet_getnext
              @       ParquetScanNext
              @       ExecTableScan
              @       ExecProcNode
              @       ExecMotion
              @       ExecProcNode
              @       ExecutePlan
              @       ExecutorRun
              @       PortalRunSelect
              @       PortalRun
              @       PostgresMain
              @       BackendStartup
              @       ServerLoop
              @       PostmasterMain
              @       main
              @       __libc_start_main
              @       Unknown
      Caused by
      LocalBlockReader.cpp: 156: ChecksumException: LocalBlockReader checksum not match for block: [block pool ID: BP-2023073008-172.28.21.63-1462922052672 block ID 1073742008_1186]
              @       Hdfs::Internal::LocalBlockReader::readAndVerify(int)
              @       Hdfs::Internal::LocalBlockReader::skip(long)
              @       Hdfs::Internal::LocalBlockReader::LocalBlockReader(boost::shared_ptr<Hdfs::Internal::ReadShortCircuitInfo> const&, Hdfs::Internal::ExtendedBlock const&, long, bool, Hdfs::Internal::SessionConfig&, std::vector<char, std::allocator<char> >&)
              @       Hdfs::Internal::InputStreamImpl::setupBlockReader(bool)
              @       Hdfs::Internal::InputStreamImpl::readOneBlock(char*, int, bool)
              @       Hdfs::Internal::InputStreamImpl::readInternal(char*, int)
              @       Hdfs::Internal::InputStreamImpl::read(char*, int)
              @       hdfsRead
              @       gpfs_hdfs_read
              @       HdfsRead
              @       FileRead
              @       readParquetFooter
              @       ParquetStorageRead_OpenFile
              @       parquet_getnext
              @       ParquetScanNext
              @       ExecTableScan
              @       ExecProcNode
              @       ExecMotion
              @       ExecProcNode
              @       ExecutePlan
              @       ExecutorRun
              @       PortalRunSelect
              @       PortalRun
              @       PostgresMain
              @       BackendStartup
              @       ServerLoop
              @       PostmasterMain
              @       main
              @       __libc_start_main
              @       Unknown
      
      retry the same node but disable read shortcircuit feature",,,,,,,,"SysLoggerMain","syslogger.c",518,
      2016-09-26 13:02:25.680638 PDT,"gpadmin","sr37228_repro",p384452,th795609216,"172.28.21.157","30347",2016-09-26 
      

      Even though it correctly detected the bad checksum using the LocalBlockReader, when it calls the RemoteBlockReader it does not appear to detect the bad checksum, and the read is allowed to go through.

      sr37228_repro=# select * from number;
          id
      ----------
       16777217
       16777216
              0
              1
              1
      (5 rows)
      
      Checking hdfs://hdm1.hdp.local:8020/hawq_default/16385/16543/17000/10 health
      
      Connecting to namenode via http://hdm1.hdp.local:50070/fsck?ugi=gpadmin&blocks=1&locations=1&files=1&path=%2Fhawq_default%2F16385%2F16543%2F17000%2F10
      FSCK started by gpadmin (auth:SIMPLE) from /172.28.21.157 for path /hawq_default/16385/16543/17000/10 at Mon Sep 26 12:07:53 PDT 2016
      /hawq_default/16385/16543/17000/10 206 bytes, 1 block(s):  OK
      0. BP-2023073008-172.28.21.63-1462922052672:blk_1073742008_1186 len=206 repl=3 [DatanodeInfoWithStorage[172.28.21.155:50010,DS-1a18c785-48e5-4ab8-9228-b3f6857b952a,DISK], DatanodeInfoWithStorage[172.28.19.211:50010,DS-6bf49ae7-6745-448b-803d-d12d93acad1d,DISK], DatanodeInfoWithStorage[172.28.21.156:50010,DS-d22b0f7f-7065-42c4-bb66-ea361ec5e56a,DISK]]
      
      Status: HEALTHY
       Total size:    206 B
       Total dirs:    0
       Total files:   1
       Total symlinks:                0
       Total blocks (validated):      1 (avg. block size 206 B)
       Minimally replicated blocks:   1 (100.0 %)
       Over-replicated blocks:        0 (0.0 %)
       Under-replicated blocks:       0 (0.0 %)
       Mis-replicated blocks:         0 (0.0 %)
       Default replication factor:    3
       Average block replication:     3.0
       Corrupt blocks:                0
       Missing replicas:              0 (0.0 %)
       Number of data-nodes:          3
       Number of racks:               1
      FSCK ended at Mon Sep 26 12:07:53 PDT 2016 in 0 milliseconds
      
      
      The filesystem under path '/hawq_default/16385/16543/17000/10' is HEALTHY
      

      The behavior of InputStreamImpl::setupBlockReader appears to be to:

      1. Attempt to read the block locally using LocalBlockReader
      2. If the local block read fails, attempt to read the block from the next available node using RemoteBlockReader
      3. Continue to read all the available blocks using RemoteBlockReader until we have no more blocks to read.

      In this case, the RemoteBlockReader appears to ignore the bad checksum in the block, and returns wrong results.

      Questions:

      1. When we detect a bad checksum on the local block, why do we not mark the block as corrupt with the NameNode?
      2. When we read the block using RemoteBlockReader, why doesn't it detect the bad block?

      Attachments

        Activity

          People

            mli Ming Li
            mli Ming Li
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: