Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-13965

[C++] dynamic_casts in parquet TypedColumnWriterImpl impacting performance

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • None
    • 6.0.0
    • C++
    • arrow 6.0.0-SNAPSHOT on both RHEL8 (gcc 8.4.1) and MacOS 11.5.2 (clang 11.0.0)

    Description

      The methods WriteDictionaryPage(), CheckDictionarySizeLimit(), WriteValues(), and WriteValuesSpaced() in TypedColumnWriterImpl (cpp/src/parquet/column_writer.cc) perform dynamic_casts of the current_dict_ object to either DictEncoder or ValueEncoderType pointers.  When calling WriteBatch() with a large number of values this is ok, but when writing batches of 1 (as when using the stream api), these dynamic casts can consume a great deal of cpu.  Using gperftools against code I wrote to do a log structured merge of several parquet files, I measured the dynamic_casts taking as much as 25% of execution time.

      By modifying TypedColumnWriterImpl to save downcasted observer pointers of the appropriate types, I was able to cut my execution time from 32 to 24 seconds, validating the gpertools results.  I've attached a patch to show what I did.

      Attachments

        1. arrow_downcast.patch
          5 kB
          Edward Seidl

        Issue Links

          Activity

            People

              etseidl Edward Seidl
              etseidl Edward Seidl
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 50m
                  1h 50m