[PARQUET-363] Cannot construct empty MessageType for ReadContext.requestedSchema - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.8.0, 1.8.1
Fix Version/s: 1.9.0, 1.8.2
Component/s: parquet-mr
Labels:
None

Description

In parquet-mr 1.8.1, constructing empty GroupType (and thus MessageType) is not allowed anymore (see ~~PARQUET-278~~). This change makes sense in most cases since Parquet doesn't support empty groups. However, there is one use case where an empty MessageType is valid, namely passing an empty MessageType as the requestedSchema constructor argument of ReadContext when counting rows in a Parquet file. The reason why it works is that, Parquet can retrieve row count from block metadata without materializing any columns. Take the following PySpark shell snippet (1.5-SNAPSHOT, which uses parquet-mr 1.7.0) as an example:

>>> path = 'file:///tmp/foo'
>>> # Writes 10 integers into a Parquet file
>>> sqlContext.range(10).coalesce(1).write.mode('overwrite').parquet(path)
>>> sqlContext.read.parquet(path).count()

10

Parquet related log lines:

15/08/21 12:32:04 INFO CatalystReadSupport: Going to read the following fields from the Parquet file:

Parquet form:
message root {
}


Catalyst form:
StructType()

15/08/21 12:32:04 INFO InternalParquetRecordReader: RecordReader initialized will read a total of 10 records.
15/08/21 12:32:04 INFO InternalParquetRecordReader: at row 0. reading next block
15/08/21 12:32:04 INFO InternalParquetRecordReader: block read in memory in 0 ms. row count = 10

We can see that Spark SQL passes no requested columns to the underlying Parquet reader. What happens here is that:

Spark SQL creates a CatalystRowConverter with zero converters (and thus only generates empty rows).
InternalParquetRecordReader first obtain the row count from block metadata (here).
MessageColumnIO returns an EmptyRecordRecorder for reading the Parquet file (here).
InternalParquetRecordReader.nextKeyValue() is invoked n times, where n equals to the row count. Each time, it invokes the converter created by Spark SQL and produces an empty Spark SQL row object.

This issue is also the cause of HIVE-11611. Because when upgrading to Parquet 1.8.1, Hive worked around this issue by using tableSchema as requestedSchema when no columns are requested (here). IMO this introduces a performance regression in cases like counting, because now we need to materialize all columns just for counting.

Attachments

Issue Links

relates to

HIVE-11611 A bad performance regression issue with Parquet happens if Hive does not select any columns

Reopened

PARQUET-278 enforce non empty group on MessageType level

Resolved

links to

PR #263

Activity

People

Assignee:: Ryan Blue

Reporter:: Cheng Lian

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 21/Aug/15 06:46

Updated:: 23/Jun/24 03:27

Resolved:: 11/Sep/15 22:15