[FLINK-22143] Flink returns less rows than expected when using limit in SQL - ASF JIRA

XML

Word

Printable

JSON

Flink's blink runtime returns less rows than expected when querying Hive tables with limit.

// sql
select i_item_sk from tpcds_1g_snappy.item limit 5000;

Above query will return only 4998 lines in some cases.

This problem can be re-produced on below conditions:

A Hive table with parquet format.
Running SQL with limit using blink planner since Flink version 1.12.0
The input table is small. (With only 1 data file in which there is only 1 row group, e.g. 1 GB of TPCDS benchmark data)
The requested count of lines by `limit` is above the batch size (2048 by default)

After investigation, a bug is found lying in the LimitableBulkFormat class.

In this class, for each batch, numRead will be increased 1 more than actual count of rows returned by reader.readBatch().

The reason is that numRead get increased even when next() reaches then end of current batch.

If there is only 1 input split, no more lines will be merged into the final result.

As a result, less lines will be returned by Flink.

links to

GitHub Pull Request #15513

Estimated:

Remaining:

Logged:

Not Specified