Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-3742

INSERTs into Kudu tables should partition and sort

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: Kudu_Impala
    • Fix Version/s: Impala 2.9.0
    • Component/s: Backend
    • Labels:

      Description

      Inserts into Kudu tables should be partitioned (i.e. rows hashed using the same hash partitioning as the Kudu table) and, at the table sink, sorted on the primary key. This would significantly improve performance.

      This will require a local sort (IMPALA-2521), and support from Kudu to provide the partitioning.

        Issue Links

          Activity

          Hide
          lv Lars Volker added a comment -

          Is there a Jira for the second part ("...support from Kudu to provide the partitioning")?

          Show
          lv Lars Volker added a comment - Is there a Jira for the second part ("...support from Kudu to provide the partitioning")?
          Hide
          mjacobs Matthew Jacobs added a comment -

          I filed https://issues.apache.org/jira/browse/KUDU-1713

          If you're interested in doing that work on the Impala side we'll need to talk to the Kudu folks.

          Show
          mjacobs Matthew Jacobs added a comment - I filed https://issues.apache.org/jira/browse/KUDU-1713 If you're interested in doing that work on the Impala side we'll need to talk to the Kudu folks.
          Hide
          twmarshall Thomas Tauber-Marshall added a comment -

          commit 801c95f39f9de6c29380910274f97748ea8e47a9
          Author: Thomas Tauber-Marshall <tmarshall@cloudera.com>
          Date: Wed Apr 5 12:35:53 2017 -0700

          IMPALA-3742: Partitions and sort INSERTs for Kudu tables

          Bulk DMLs (INSERT, UPSERT, UPDATE, and DELETE) for Kudu
          are currently painful because we just send rows randomly,
          which creates a lot of work for Kudu since it partitions
          and sorts data before writing, causing writes to be slow
          and leading to timeouts.

          We can alleviate this by sending the rows to Kudu already
          partitioned and sorted. This patch partitions and sorts
          rows according to Kudu's partitioning scheme for INSERTs
          and UPSERTs. A followup patch will handle UPDATE and DELETE.

          It accomplishes this by inserting an exchange node and a sort
          node into the plan before the operation. Both the exchange and
          the sort are given a KuduPartitionExpr which takes a row and
          calls into the Kudu client to return its partition number.

          It also disallows INSERT hints for Kudu tables, since the
          hints that we support (SHUFFLE, CLUSTER, SORTBY), so longer
          make sense.

          Testing:

          • Updated planner tests.
          • Ran the Kudu functional tests.
          • Ran performance tests demonstrating that we can now handle much
            larger inserts without having timeouts.

          Change-Id: I84ce0032a1b10958fdf31faef225372c5c38fdc4
          Reviewed-on: http://gerrit.cloudera.org:8080/6559
          Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>
          Tested-by: Impala Public Jenkins

          Show
          twmarshall Thomas Tauber-Marshall added a comment - commit 801c95f39f9de6c29380910274f97748ea8e47a9 Author: Thomas Tauber-Marshall <tmarshall@cloudera.com> Date: Wed Apr 5 12:35:53 2017 -0700 IMPALA-3742 : Partitions and sort INSERTs for Kudu tables Bulk DMLs (INSERT, UPSERT, UPDATE, and DELETE) for Kudu are currently painful because we just send rows randomly, which creates a lot of work for Kudu since it partitions and sorts data before writing, causing writes to be slow and leading to timeouts. We can alleviate this by sending the rows to Kudu already partitioned and sorted. This patch partitions and sorts rows according to Kudu's partitioning scheme for INSERTs and UPSERTs. A followup patch will handle UPDATE and DELETE. It accomplishes this by inserting an exchange node and a sort node into the plan before the operation. Both the exchange and the sort are given a KuduPartitionExpr which takes a row and calls into the Kudu client to return its partition number. It also disallows INSERT hints for Kudu tables, since the hints that we support (SHUFFLE, CLUSTER, SORTBY), so longer make sense. Testing: Updated planner tests. Ran the Kudu functional tests. Ran performance tests demonstrating that we can now handle much larger inserts without having timeouts. Change-Id: I84ce0032a1b10958fdf31faef225372c5c38fdc4 Reviewed-on: http://gerrit.cloudera.org:8080/6559 Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com> Tested-by: Impala Public Jenkins

            People

            • Assignee:
              twmarshall Thomas Tauber-Marshall
              Reporter:
              mjacobs Matthew Jacobs
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development