Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-5448

Invalid number of files reported in Parquet scan node

    XMLWordPrintableJSON

Details

    • ghx-label-8

    Description

      It appears that the number of files reported in the HDFS scan node when reading Parquet data is miscounted, for the scan node below the number of files should be the same as number of RowGroups & Footers but the reported value is 219 which is 73 x NumColumns (3).

        HDFS_SCAN_NODE (id=0):(Total: 13s749ms, non-child: 13s749ms, % non-child: 100.00%)
                Hdfs split stats (<volume id>:<# splits>/<split lengths>): 7:9/1.90 GB 3:12/2.65 GB 2:5/936.63 MB 6:9/1.74 GB 1:8/1.66 GB 5:10/1.83 GB 0:9/2.07 GB 4:11/2.40 GB 
                ExecOption: PARQUET Codegen Enabled, Codegen enabled: 73 out of 73
                Runtime filters: Only following filters arrived: , waited 4s918ms
                Hdfs Read Thread Concurrency Bucket: 0:33.33% 1:48.48% 2:6.061% 3:12.12% 4:0% 5:0% 6:0% 7:0% 8:0% 9:0% 10:0% 11:0% 
                File Formats: PARQUET/SNAPPY:219 
                BytesRead(500.000ms): 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 200.00 KB, 129.86 MB, 314.73 MB, 562.12 MB, 1.09 GB, 1.32 GB, 2.37 GB, 3.68 GB, 4.34 GB, 4.87 GB, 5.22 GB, 5.39 GB, 5.58 GB, 5.63 GB, 5.66 GB, 5.69 GB, 5.71 GB, 5.75 GB, 5.78 GB, 5.82 GB, 5.86 GB, 5.90 GB, 5.94 GB, 5.97 GB
                 - FooterProcessingTime: (Avg: 711.035ms ; Min: 12.738ms ; Max: 1s958ms ; Number of samples: 73)
                 - AverageHdfsReadThreadConcurrency: 0.97 
                 - AverageScannerThreadConcurrency: 17.70 
                 - BytesRead: 6.01 GB (6452101777)
                 - BytesReadDataNodeCache: 0
                 - BytesReadLocal: 6.01 GB (6452101777)
                 - BytesReadRemoteUnexpected: 0
                 - BytesReadShortCircuit: 6.01 GB (6452101777)
                 - DecompressionTime: 16s189ms
                 - MaxCompressedTextFileLength: 0
                 - NumColumns: 3 (3)
                 - NumDisksAccessed: 8 (8)
                 - NumRowGroups: 73 (73)
                 - NumScannerThreadsStarted: 52 (52)
                 - PeakMemoryUsage: 2.09 GB (2248246487)
                 - PerReadThreadRawHdfsThroughput: 363.03 MB/sec
                 - RemoteScanRanges: 0 (0)
                 - RowBatchQueueGetWaitTime: 8s786ms
                 - RowBatchQueuePutWaitTime: 3s079ms
                 - RowsRead: 342.13M (342131176)
                 - RowsReturned: 2.54M (2537896)
                 - RowsReturnedRate: 184.58 K/sec
                 - ScanRangesComplete: 73 (73)
                 - ScannerThreadsInvoluntaryContextSwitches: 3.97K (3967)
                 - ScannerThreadsTotalWallClockTime: 4m41s
                   - MaterializeTupleTime(*): 13s302ms
                   - ScannerThreadsSysTime: 3s043ms
                   - ScannerThreadsUserTime: 26s263ms
                 - ScannerThreadsVoluntaryContextSwitches: 23.15K (23148)
                 - TotalRawHdfsReadTime(*): 16s949ms
                 - TotalReadThroughput: 359.75 MB/sec
      

      Attachments

        Issue Links

          Activity

            People

              stigahuang Quanlong Huang
              mmokhtar Mostafa Mokhtar
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: