Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-20523

Improve table statistics for Parquet format

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Duplicate
    • None
    • None
    • Physical Optimizer
    • None

    Description

      Right now, in the table basic statistics, the raw data size for a row with any data type in the Parquet format is 1. This is an underestimated value when columns are complex data structures, like arrays.

      Having tables with underestimated raw data size makes Hive assign less containers (mappers/reducers) to it, making the overall query slower.
      Heavy underestimation also makes Hive choose MapJoin instead of the ShuffleJoin that can fail with OOM errors.

      In this patch, I compute the columns data size better, taking into account complex structures. I followed the Writer implementation for the ORC format.

      Attachments

        1. HIVE-20523.1.patch
          14 kB
          George Pachitariu
        2. HIVE-20523.10.patch
          267 kB
          George Pachitariu
        3. HIVE-20523.11.patch
          252 kB
          George Pachitariu
        4. HIVE-20523.12.patch
          252 kB
          George Pachitariu
        5. HIVE-20523.2.patch
          15 kB
          George Pachitariu
        6. HIVE-20523.3.patch
          92 kB
          George Pachitariu
        7. HIVE-20523.4.patch
          15 kB
          George Pachitariu
        8. HIVE-20523.5.patch
          16 kB
          George Pachitariu
        9. HIVE-20523.6.patch
          17 kB
          George Pachitariu
        10. HIVE-20523.7.patch
          260 kB
          George Pachitariu
        11. HIVE-20523.8.patch
          979 kB
          George Pachitariu
        12. HIVE-20523.9.patch
          979 kB
          George Pachitariu
        13. HIVE-20523.patch
          5 kB
          George Pachitariu

        Issue Links

          Activity

            People

              george.pachitariu George Pachitariu
              george.pachitariu George Pachitariu
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: