Details
-
Improvement
-
Status: Resolved
-
Minor
-
Resolution: Fixed
-
None
-
arrow 6.0.0-SNAPSHOT on both RHEL8 (gcc 8.4.1) and MacOS 11.5.2 (clang 11.0.0)
Description
The methods WriteDictionaryPage(), CheckDictionarySizeLimit(), WriteValues(), and WriteValuesSpaced() in TypedColumnWriterImpl (cpp/src/parquet/column_writer.cc) perform dynamic_casts of the current_dict_ object to either DictEncoder or ValueEncoderType pointers. When calling WriteBatch() with a large number of values this is ok, but when writing batches of 1 (as when using the stream api), these dynamic casts can consume a great deal of cpu. Using gperftools against code I wrote to do a log structured merge of several parquet files, I measured the dynamic_casts taking as much as 25% of execution time.
By modifying TypedColumnWriterImpl to save downcasted observer pointers of the appropriate types, I was able to cut my execution time from 32 to 24 seconds, validating the gpertools results. I've attached a patch to show what I did.
Attachments
Attachments
Issue Links
- links to