Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-18410

[Performance][Avro] Reading flat Avro tables is very expensive in Hive

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.2.1, 2.1.0, 2.3.2, 3.0.0
    • 3.1.0, 3.0.0
    • None
    • None

    Description

      There's a performance penalty when reading flat [no nested fields] Avro tables. When reading the same flat dataset in Pig, it takes half the time. On profiling, a lot of time is spent in AvroDeserializer.deserializeSingleItemNullableUnion(). The bulk of the time is spent in GenericData.get().resolveUnion(), which calls GenericData.getSchemaName(Object datum), which does a lot of instanceof checks. This could be simplified with performance benefits. A approach is described in this patch which almost halves the runtime.

      Attachments

        1. profiling_with_patch.nps
          4 kB
          Ratandeep Ratti
        2. profiling_without_patch.nps
          11 kB
          Ratandeep Ratti
        3. profiling_with_patch.png
          238 kB
          Ratandeep Ratti
        4. profiling_without_patch.png
          257 kB
          Ratandeep Ratti
        5. HIVE-18410.patch
          13 kB
          Ratandeep Ratti
        6. HIVE-18410_1.patch
          11 kB
          Ratandeep Ratti
        7. HIVE-18410_2.patch
          13 kB
          Ratandeep Ratti
        8. HIVE-18410_3.patch
          13 kB
          Ratandeep Ratti

        Issue Links

          Activity

            People

              rdsr Ratandeep Ratti
              rdsr Ratandeep Ratti
              Votes:
              2 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: