Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-22143

Flink returns less rows than expected when using limit in SQL

    XMLWordPrintableJSON

Details

    Description

      Flink's blink runtime returns less rows than expected when querying Hive tables with limit.

      // sql
      select i_item_sk from tpcds_1g_snappy.item limit 5000;
      

       

      Above query will return only 4998 lines in some cases.

       

      This problem can be re-produced on below conditions:

      1. A Hive table with parquet format.
      2. Running SQL with limit using blink planner since Flink version 1.12.0
      3. The input table is small. (With only 1 data file in which there is only 1 row group, e.g. 1 GB of TPCDS benchmark data)
      4. The requested count of lines by `limit` is above the batch size (2048 by default)

       

      After investigation, a bug is found lying in the LimitableBulkFormat class.

      In this class, for each batch, numRead will be increased 1 more than actual count of rows returned by reader.readBatch().

      The reason is that numRead get increased even when next() reaches then end of current batch.

      If there is only 1 input split, no more lines will be merged into the final result. 

      As a result, less lines will be returned by Flink.

       

      Attachments

        Issue Links

          Activity

            People

              iyupeng Peng Yu
              iyupeng Peng Yu
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - 1h
                  1h
                  Remaining:
                  Remaining Estimate - 1h
                  1h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified