[FLINK-21397] BufferUnderflowException when read parquet - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Duplicate
Affects Version/s: 1.12.1
Fix Version/s: 1.13.0
Component/s: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
Labels:
- auto-deprioritized-critical

Description

error when read parquet file .

when the encoding of all pages in parquet file is PLAIN_DICTIONARY , it works well , but if parquet file contains 3 pages, and the encoding of page0 and page1 is PLAIN_DICTIONARY, page2 is PLAIN , then flink throw exception after page0 and page1 read finish.
the souurce parquet file is write by flink 1.11.

the parquet file info :

row group 0
--------------------------------------------------------------------------------
oid: BINARY SNAPPY DO:0 FPO:4 SZ:625876/1748820/2.79 VC:95192 ENC:BIT [more]...oid TV=95192 RL=0 DL=1 DS: 36972 DE:PLAIN_DICTIONARY
{{ ----------------------------------------------------------------------------}}
{{ page 0: DLE:RLE RLE:BIT_PACKED VLE:PLAIN_DICTIONARY [more]... SZ:70314}}
{{ page 1: DLE:RLE RLE:BIT_PACKED VLE:PLAIN_DICTIONARY [more]... SZ:74850}}
{{ page 2: DLE:RLE RLE:BIT_PACKED VLE:PLAIN ST:[m [more]... SZ:568184 }}
BINARY oid

exception msg:

Caused by: java.nio.BufferUnderflowExceptionCaused by: java.nio.BufferUnderflowException at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:151) at java.nio.ByteBuffer.get(ByteBuffer.java:715) at org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.getBytes(Binary.java:422) at org.apache.flink.formats.parquet.vector.reader.BytesColumnReader.readBatchFromDictionaryIds(BytesColumnReader.java:77) at org.apache.flink.formats.parquet.vector.reader.BytesColumnReader.readBatchFromDictionaryIds(BytesColumnReader.java:31) at org.apache.flink.formats.parquet.vector.reader.AbstractColumnReader.readToVector(AbstractColumnReader.java:186) at org.apache.flink.formats.parquet.ParquetVectorizedInputFormat$ParquetReader.nextBatch(ParquetVectorizedInputFormat.java:363) at org.apache.flink.formats.parquet.ParquetVectorizedInputFormat$ParquetReader.readBatch(ParquetVectorizedInputFormat.java:334) at org.apache.flink.connector.file.src.impl.FileSourceSplitReader.fetch(FileSourceSplitReader.java:71) at org.apache.flink.connector.base.source.reader.fetcher.FetchTask.run(FetchTask.java:56) at org.apache.flink.connector.base.source.reader.fetcher.SplitFetcher.runOnce(SplitFetcher.java:138) ... 6 more

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

part-f33924c5-99c3-4177-9a9a-e2d5c71a799a-1-2324.snappy.parquet
18/Feb/21 11:50
611 kB
Lihe Ma

Issue Links

is duplicated by

FLINK-22202 Thread safety in ParquetColumnarRowInputFormat

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Lihe Ma

Votes:: 1 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 18/Feb/21 11:51

Updated:: 07/May/21 02:27

Resolved:: 07/May/21 02:27