[PARQUET-98] filter2 API performance regression - ASF JIRA

XML

Word

Printable

JSON

The new filter API seems to be much slower (or perhaps I'm using it wrong :)

Code using an UnboundRecordFilter:

ColumnRecordFilter.column(column,
    ColumnPredicates.applyFunctionToBinary(
    input -> Binary.fromString(value).equals(input)));

vs. code using FilterPredicate:

eq(binaryColumn(column), Binary.fromString(value));

The latter performs twice as slow on the same Parquet file (built using 1.6.0rc2).

Note: the reader is constructed using

ParquetReader.builder(new ProtoReadSupport().withFilter(filter).build()

The new filter API based approach seems to create a whole lot more garbage (perhaps due to reconstructing all the rows?).

is related to

PARQUET-182 FilteredRecordReader skips rows it shouldn't for schema with optional columns