Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-5448

Invalid number of files reported in Parquet scan node

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • ghx-label-8

    Description

      It appears that the number of files reported in the HDFS scan node when reading Parquet data is miscounted, for the scan node below the number of files should be the same as number of RowGroups & Footers but the reported value is 219 which is 73 x NumColumns (3).

        HDFS_SCAN_NODE (id=0):(Total: 13s749ms, non-child: 13s749ms, % non-child: 100.00%)
                Hdfs split stats (<volume id>:<# splits>/<split lengths>): 7:9/1.90 GB 3:12/2.65 GB 2:5/936.63 MB 6:9/1.74 GB 1:8/1.66 GB 5:10/1.83 GB 0:9/2.07 GB 4:11/2.40 GB 
                ExecOption: PARQUET Codegen Enabled, Codegen enabled: 73 out of 73
                Runtime filters: Only following filters arrived: , waited 4s918ms
                Hdfs Read Thread Concurrency Bucket: 0:33.33% 1:48.48% 2:6.061% 3:12.12% 4:0% 5:0% 6:0% 7:0% 8:0% 9:0% 10:0% 11:0% 
                File Formats: PARQUET/SNAPPY:219 
                BytesRead(500.000ms): 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 200.00 KB, 129.86 MB, 314.73 MB, 562.12 MB, 1.09 GB, 1.32 GB, 2.37 GB, 3.68 GB, 4.34 GB, 4.87 GB, 5.22 GB, 5.39 GB, 5.58 GB, 5.63 GB, 5.66 GB, 5.69 GB, 5.71 GB, 5.75 GB, 5.78 GB, 5.82 GB, 5.86 GB, 5.90 GB, 5.94 GB, 5.97 GB
                 - FooterProcessingTime: (Avg: 711.035ms ; Min: 12.738ms ; Max: 1s958ms ; Number of samples: 73)
                 - AverageHdfsReadThreadConcurrency: 0.97 
                 - AverageScannerThreadConcurrency: 17.70 
                 - BytesRead: 6.01 GB (6452101777)
                 - BytesReadDataNodeCache: 0
                 - BytesReadLocal: 6.01 GB (6452101777)
                 - BytesReadRemoteUnexpected: 0
                 - BytesReadShortCircuit: 6.01 GB (6452101777)
                 - DecompressionTime: 16s189ms
                 - MaxCompressedTextFileLength: 0
                 - NumColumns: 3 (3)
                 - NumDisksAccessed: 8 (8)
                 - NumRowGroups: 73 (73)
                 - NumScannerThreadsStarted: 52 (52)
                 - PeakMemoryUsage: 2.09 GB (2248246487)
                 - PerReadThreadRawHdfsThroughput: 363.03 MB/sec
                 - RemoteScanRanges: 0 (0)
                 - RowBatchQueueGetWaitTime: 8s786ms
                 - RowBatchQueuePutWaitTime: 3s079ms
                 - RowsRead: 342.13M (342131176)
                 - RowsReturned: 2.54M (2537896)
                 - RowsReturnedRate: 184.58 K/sec
                 - ScanRangesComplete: 73 (73)
                 - ScannerThreadsInvoluntaryContextSwitches: 3.97K (3967)
                 - ScannerThreadsTotalWallClockTime: 4m41s
                   - MaterializeTupleTime(*): 13s302ms
                   - ScannerThreadsSysTime: 3s043ms
                   - ScannerThreadsUserTime: 26s263ms
                 - ScannerThreadsVoluntaryContextSwitches: 23.15K (23148)
                 - TotalRawHdfsReadTime(*): 16s949ms
                 - TotalReadThroughput: 359.75 MB/sec
      

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            stigahuang Quanlong Huang
            mmokhtar Mostafa Mokhtar
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment