Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-1192

Parquet Pushdown

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.9.0
    • None
    • parquet-pig
    • None
    • Apache Hadoop 2.7.0
      Apache Pig 0.17.0
      Apache Parquet 1.9.0

    Description

      Hi,
      I am doing some experiments with Apache Parquet to test Predicate pushdown and effect of different row group sizes. My assumptions are:

      1) Parquet reader first read the metadata to filter out row groups and data pages
      2) Then, it reads only those row groups and data pages which match the filter.
      3) The total size of read should be the sum of row group size and size of meta data.

      I have a wide table with 1184 columns. 2 columns are long type and remaining columns are binary. One of the long column is sorted and unique. I disabled dictionary encoding and compression. My file size is 34GB in CSV. I converted it to Parquet. I tried with two options

      1) Generate only 1 File of Parquet (i.e. 43GB)
      2) Generate multiple files of Parquet (i.e., overall size 43GB).

      I allow only 1 Mapper to eliminate the effect of parallelism.

      I have a query to search 1 record from the sorted column. The results are for row group 16MB and data page size of 1MB

      When there is only 1 file of Parquet.
      Input(s):
      Successfully read 1 records (22135659519 bytes) from: "/output/wide/16777216/1048576"

      When there is multiple file of Parquet
      Input(s):
      Successfully read 1 records (800413428 bytes) from: "/output/wide/16777216/1048576"

      My questions are:
      1) Why there is big difference. In one file, I am reading 22GB and with multiple file, It is reading 800MB. This is a bug or what?
      2) Why it is not reading 16MB + Size of meta data (which is 252MB). Why it is reading more than that?
      3) Can I rely on the pig statistics for estimating bytes read?
      4) My assumptions are correct or am I missing something?

      Could you please have a look into this problem and guide me if it is a bug ?

      Logs are attached with this email.

      Thank you

      Regards
      Rana Faisal

      Attachments

        1. selc_32_16777216_1048576_sorted_multiplefile.log
          272 kB
          Rana Faisal Munir
        2. selc_32_16777216_1048576_sorted.log
          492 kB
          Rana Faisal Munir

        Activity

          People

            Unassigned Unassigned
            ranafaisal342 Rana Faisal Munir
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: