Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-25827

Parquet file footer is read multiple times, when multiple splits are created in same file

    XMLWordPrintableJSON

Details

    Description

      With large files, it is possible that multiple splits are created in the same file. With current codebase, "ParquetRecordReaderBase" ends up reading file footer for each split. 

      It can be optimized not to read footer information multiple times for the same file.

       

      https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/VectorizedParquetRecordReader.java#L160

       

      https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/ParquetRecordReaderBase.java#L91

       

       

       

      Attachments

        1. image-2021-12-21-03-19-38-577.png
          901 kB
          Rajesh Balamohan

        Issue Links

          Activity

            People

              szita Ádám Szita
              rajesh.balamohan Rajesh Balamohan
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1.5h
                  1.5h