Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-22495

Parquet count(*) read in all data

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Patch Available
    • Major
    • Resolution: Unresolved
    • None
    • None
    • Reader
    • None

    Description

      Running a hive query on a Parquet table

      select count ( * ) from test_table

      The query read in all data (all columns) instead of just metadata.

      For comparison, hive 0.13 and Spark read in much less data with my test table.

       

      engine HDFS data read
      Hive 2.3.4           452.9 MB
      Hive 0.13             22.5 KB
      Spark             41.6 KB

       

      Seems cause is that Parquet read support fall back to file schema if indexColumnsWanted is empty, logic still exist in master branch.

      Don't know why this empty list check was added, please suggest if there're any other impact.

       

       

       

      Attachments

        1. HIVE-22495.patch
          1.0 kB
          Jason Xu
        2. HIVE-22495.patch
          1.0 kB
          Jason Xu

        Activity

          People

            jason_xu Jason Xu
            jason_xu Jason Xu
            Votes:
            1 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: