Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-2523

Make HdfsTableSink aware of clustered input

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: Impala 2.2, Impala 2.3.0, Impala 2.5.0, Impala 2.4.0, Impala 2.6.0, Impala 2.7.0
    • Fix Version/s: Impala 2.8.0
    • Component/s: Backend

      Description

      The HdfsParquetTableWriter needs to be aware that incoming data is clustered if the corresponding insert statement has a "clustered" hint. Only a single open partition should be maintained and flushed by calling FinalizePartitionFile() when the partition-key values change.

      For now, no changes should be made to other table sinks.

        Issue Links

          Activity

          Hide
          alex.behm Alexander Behm added a comment -

          This really is part of IMPALA-2521

          Show
          alex.behm Alexander Behm added a comment - This really is part of IMPALA-2521
          Hide
          lv Lars Volker added a comment -

          Alexander Behm - Should we target both in a single change?

          Show
          lv Lars Volker added a comment - Alexander Behm - Should we target both in a single change?
          Hide
          alex.behm Alexander Behm added a comment -

          Lars Volker, let's keep them separate.

          Show
          alex.behm Alexander Behm added a comment - Lars Volker , let's keep them separate.
          Hide
          lv Lars Volker added a comment -

          Partition handling seems to happen in the HDFS table sink. Should we really confine this to parquet files only? It looks to me as if it naturally generalizes to all partitioned HDFS file formats.

          Show
          lv Lars Volker added a comment - Partition handling seems to happen in the HDFS table sink. Should we really confine this to parquet files only? It looks to me as if it naturally generalizes to all partitioned HDFS file formats.
          Hide
          mmokhtar Mostafa Mokhtar added a comment -

          Lars Volker
          Ideally this should handle all supported file formats.

          Show
          mmokhtar Mostafa Mokhtar added a comment - Lars Volker Ideally this should handle all supported file formats.
          Hide
          lv Lars Volker added a comment -

          IMPALA-2523: Make HdfsTableSink aware of clustered input

          IMPALA-2521 introduced clustering for insert statements. This change
          makes the HdfsTableSink aware of clustered inputs, so that partitions
          are opened, written, and closed one by one.

          This change also adds/modifies tests in several ways:

          • clustered insert tests switch from selecting all rows from
            alltypessmall to alltypes. Together with varying settings for
            batch_size, this results in a larger number of row batches being
            written.
          • clustered insert tests select from alltypes instead of
            functional.alltypes to make sure we also select from various input
            formats.
          • clustered insert tests have been added to select from alltypestiny to
            create inserts with 1 and 2 rows per partition respectively.
          • exhaustive insert tests now use different values for batch_size: 1,
            16, 0 (meaning default, 1024). This is limited to uncompressed parquet
            files, to maintain a reasonable runtime. On my machine execution of
            test.insert took 1778 seconds, compared to 1002 seconds with the just
            default row batch size.
          • There is additional testing in test_insert_behaviour.py to make sure
            that insertion over several row batches only creates one file per
            partition.
          • It renames the test_insert method to make it unique in the file and
            allow for effective filtering with -k.
          • It adds tests to the Analyzer test suite.

          Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
          Reviewed-on: http://gerrit.cloudera.org:8080/4863
          Reviewed-by: Lars Volker <lv@cloudera.com>
          Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
          Tested-by: Internal Jenkins

          Show
          lv Lars Volker added a comment - IMPALA-2523 : Make HdfsTableSink aware of clustered input IMPALA-2521 introduced clustering for insert statements. This change makes the HdfsTableSink aware of clustered inputs, so that partitions are opened, written, and closed one by one. This change also adds/modifies tests in several ways: clustered insert tests switch from selecting all rows from alltypessmall to alltypes. Together with varying settings for batch_size, this results in a larger number of row batches being written. clustered insert tests select from alltypes instead of functional.alltypes to make sure we also select from various input formats. clustered insert tests have been added to select from alltypestiny to create inserts with 1 and 2 rows per partition respectively. exhaustive insert tests now use different values for batch_size: 1, 16, 0 (meaning default, 1024). This is limited to uncompressed parquet files, to maintain a reasonable runtime. On my machine execution of test.insert took 1778 seconds, compared to 1002 seconds with the just default row batch size. There is additional testing in test_insert_behaviour.py to make sure that insertion over several row batches only creates one file per partition. It renames the test_insert method to make it unique in the file and allow for effective filtering with -k. It adds tests to the Analyzer test suite. Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 Reviewed-on: http://gerrit.cloudera.org:8080/4863 Reviewed-by: Lars Volker <lv@cloudera.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins

            People

            • Assignee:
              lv Lars Volker
              Reporter:
              mmokhtar Mostafa Mokhtar
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development