Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-22495

Parquet count(*) read in all data

Log workAgile BoardRank to TopRank to BottomBulk Copy AttachmentsBulk Move AttachmentsAdd voteVotersWatch issueWatchersCreate sub-taskConvert to sub-taskMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Patch Available
    • Major
    • Resolution: Unresolved
    • None
    • None
    • Reader
    • None

    Description

      Running a hive query on a Parquet table

      select count ( * ) from test_table

      The query read in all data (all columns) instead of just metadata.

      For comparison, hive 0.13 and Spark read in much less data with my test table.

       

      engine HDFS data read
      Hive 2.3.4           452.9 MB
      Hive 0.13             22.5 KB
      Spark             41.6 KB

       

      Seems cause is that Parquet read support fall back to file schema if indexColumnsWanted is empty, logic still exist in master branch.

      Don't know why this empty list check was added, please suggest if there're any other impact.

       

       

       

      Attachments

        1. HIVE-22495.patch
          1.0 kB
          Jason Xu
        2. HIVE-22495.patch
          1.0 kB
          Jason Xu

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            jason_xu Jason Xu Assign to me
            jason_xu Jason Xu

            Dates

              Created:
              Updated:

              Slack

                Issue deployment