Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-5266

Parquet Reader produces "low density" record batches - bits vs. bytes

    XMLWordPrintableJSON

Details

    Description

      Testing with the managed sort revealed that, for at least one file, Parquet produces "low-density" batches: batches in which only 5% of each value vector contains actual data, with the rest being unused space. When fed into the sort, we end up buffering 95% of wasted space, using only 5% of available memory to hold actual query data. The result is poor performance of the sort as it must spill far more frequently than expected.

      The managed sort analyzes incoming batches to prepare good memory use estimates. The following the the output from the Parquet file in question:

      Actual batch schema & sizes {
        T1¦¦cs_sold_date_sk(std col. size: 4, actual col. size: 4, total size: 196608, vector size: 131072, data size: 4516, row capacity: 32768, density: 4)
        T1¦¦cs_sold_time_sk(std col. size: 4, actual col. size: 4, total size: 196608, vector size: 131072, data size: 4516, row capacity: 32768, density: 4)
        T1¦¦cs_ship_date_sk(std col. size: 4, actual col. size: 4, total size: 196608, vector size: 131072, data size: 4516, row capacity: 32768, density: 4)
      ...
        c_email_address(std col. size: 54, actual col. size: 27, total size: 53248, vector size: 49152, data size: 30327, row capacity: 4095, density: 62)
        Records: 1129, Total size: 32006144, Row width:28350, Density:5}
      

      Attachments

        Issue Links

          Activity

            People

              paul-rogers Paul Rogers
              paul-rogers Paul Rogers
              Rahul Kumar Challapalli Rahul Kumar Challapalli
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: