Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
2.4.0, 3.1.3, 4.0.0-alpha-2
-
None
Description
Without ORDER BY, Hive doesn't add an additional single reducer, and it pushes OFFSET and LIMIT to the original vertex. It can generate incorrect results because OFFSET should be counted globally unlike LIMIT.
We would make the following changes.
- To fix the incorrect behavior
- To add a new `hive.strict.checks.*` to prevent this usage. Mostly, OFFSET without ORDER BY is meaningless
We can reproduce the issue in the following steps.
Prepare test data
The following SQLs generate a test table with multiple files.
$ beeline -e " drop table test; create table test(id int); insert into test values (1), (2), (3), (4), (5), (6), (7), (8), (9), (10); insert into test values (11), (12), (13), (14), (15), (16), (17), (18), (19), (20); insert into test values (21), (22), (23), (24), (25), (26), (27), (28), (29), (30); "
Reproduce the issue
No rows are returned with `limit 10 offset 10`.
$ beeline -e "select * from test limit 10, 10" --hiveconf hive.fetch.task.conversion=none --hiveconf tez.grouping.min-size=1 --hiveconf tez.grouping.max-size=1 ... +----------+ | test.id | +----------+ +----------+ No rows selected (13.595 seconds)
Expected behavior
It should return any 10 rows like this.
$ beeline -e "select * from test limit 10, 10" ... +----------+ | test.id | +----------+ | 11 | | 12 | | 13 | | 14 | | 15 | | 16 | | 17 | | 18 | | 19 | | 20 | +----------+ 10 rows selected (0.175 seconds)