IMPALA-2523: Make HdfsTableSink aware of clustered input
IMPALA-2521 introduced clustering for insert statements. This change
makes the HdfsTableSink aware of clustered inputs, so that partitions
are opened, written, and closed one by one.
This change also adds/modifies tests in several ways:
- clustered insert tests switch from selecting all rows from
alltypessmall to alltypes. Together with varying settings for
batch_size, this results in a larger number of row batches being
- clustered insert tests select from alltypes instead of
functional.alltypes to make sure we also select from various input
- clustered insert tests have been added to select from alltypestiny to
create inserts with 1 and 2 rows per partition respectively.
- exhaustive insert tests now use different values for batch_size: 1,
16, 0 (meaning default, 1024). This is limited to uncompressed parquet
files, to maintain a reasonable runtime. On my machine execution of
test.insert took 1778 seconds, compared to 1002 seconds with the just
default row batch size.
- There is additional testing in test_insert_behaviour.py to make sure
that insertion over several row batches only creates one file per
- It renames the test_insert method to make it unique in the file and
allow for effective filtering with -k.
- It adds tests to the Analyzer test suite.
Reviewed-by: Lars Volker <firstname.lastname@example.org>
Reviewed-by: Tim Armstrong <email@example.com>
Tested-by: Internal Jenkins