Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-3710

INSERT, UPDATE, DELETE should ignore conflicts by default

    Details

    • Docs Text:
      Doc conflict handling behavior.
    • Target Version:

      Description

      Currently, when an INSERT/UPDATE/DELETE/UPSERT stmt into a Kudu table encounters an error, the query fails but any rows that were inserted remain. The IGNORE keyword was added to avoid failing the query on certain errors (e.g. key already exists), but this doesn't work for all types of errors (e.g. nulls in non-nullable cols, or inserting w/ keys where there are no range partitions). Until Kudu can support the ability to roll back multi-row transactions, we will not be able to roll the entire statement back. However, the behavior that the query may fail part of the way through isn't useful and we should at least simplify the existing behavior rather than provide extra knobs.

      We should remove the IGNORE option for these statements and instead always ignore on conflicts. When there is support in Kudu to support transactions, we can offer more traditional behavior, e.g. isolation levels.

        Activity

        Hide
        mjacobs Matthew Jacobs added a comment -

        Can we factor IMPALA-3145 into this?

        Show
        mjacobs Matthew Jacobs added a comment - Can we factor IMPALA-3145 into this?
        Hide
        mjacobs Matthew Jacobs added a comment -

        I'm making this a blocker for the time being since this will be inconsistent with other databases and eventually the behavior of INSERT without IGNORE will be different (once Kudu can roll back inserted rows after a later row fails).

        Show
        mjacobs Matthew Jacobs added a comment - I'm making this a blocker for the time being since this will be inconsistent with other databases and eventually the behavior of INSERT without IGNORE will be different (once Kudu can roll back inserted rows after a later row fails).
        Hide
        mjacobs Matthew Jacobs added a comment -

        A few things to add:
        1) After discussion with Greg Rahn, we would probably want to require IGNORE but have an option to enable IGNORE by default, e.g. a gflag, in which case it wouldn't need to be explicitly specified. Then users at least opt into this behavior, and that may be necessary to not break other tools.
        2) Maybe this should be IGNORE DUPLICATES (per IMPALA-3710).

        Show
        mjacobs Matthew Jacobs added a comment - A few things to add: 1) After discussion with Greg Rahn , we would probably want to require IGNORE but have an option to enable IGNORE by default, e.g. a gflag, in which case it wouldn't need to be explicitly specified. Then users at least opt into this behavior, and that may be necessary to not break other tools. 2) Maybe this should be IGNORE DUPLICATES (per IMPALA-3710 ).
        Hide
        mjacobs Matthew Jacobs added a comment -

        Greg Rahn I've updated the subject & description of this JIRA, does this look OK? Let me know if you wanna iterate on this more. We'll need to lock this down and assign this for someone to implement soon.

        Show
        mjacobs Matthew Jacobs added a comment - Greg Rahn I've updated the subject & description of this JIRA, does this look OK? Let me know if you wanna iterate on this more. We'll need to lock this down and assign this for someone to implement soon.
        Hide
        mjacobs Matthew Jacobs added a comment -

        I broke this work into a few parts, the first part to remove the IGNORE and just handle the duplicate key (INSERT) and key not found (UPDATE, DELETE) errors by default went in:

        commit 08d89a5cc3d2135896a7d4518dac7d22e5e66ddf
        Author: Matthew Jacobs <mj@cloudera.com>
        Date: Tue Nov 1 17:52:21 2016 -0700

        IMPALA-3710: Kudu DML should ignore conflicts by default

        Removes the non-standard IGNORE syntax that was allowed for
        DML into Kudu tables to indicate that certain errors should
        be ignored, i.e. not fail the query and continue. However,
        because there is no way to 'roll back' mutations that
        occurred before an error occurs, tables are left in an
        inconsistent state and it's difficult to know what rows were
        successfully modified vs which rows were not. Instead, this
        change makes it so that we always 'ignore' these conflicts,
        i.e. a 'best effort'. In the future, when Kudu will provide
        the mechanisms Impala needs to provide a notion of isolation
        levels, then Impala will be able to provide options for more
        traditional semantics.

        After this change, the following errors are ignored:

        • INSERT where the PK already exists
        • UPDATE/DELETE where the PK doesn't exist

        Another follow-up patch will change other violations to be
        handled in this way as well, e.g. nulls inserted in
        non-nullable cols.

        Reporting:
        The number of rows inserted is reported to the coordinator,
        which makes the aggregate available to the shell and via the
        profile.
        TODO: Return rows modified for INSERT via HS2 (IMPALA-1789).
        TODO: Return rows modified for other CRUD (beeswax+hs2) (IMPALA-3713).
        TODO: Return error counts for specific warnings (IMPALA-4416).

        Testing:
        Updated tests. Ran all functional tests. More tests will be
        needed when other conflicts are handled in the same way.

        Change-Id: I83b5beaa982d006da4997a2af061ef7c22cad3f1
        Reviewed-on: http://gerrit.cloudera.org:8080/4911
        Reviewed-by: Alex Behm <alex.behm@cloudera.com>
        Tested-by: Internal Jenkins

        A second patch will add support for handling the following errors as ignored in the same way:

        • NULLs in non-nullable columns, i.e. null constraint violoations.
        • Rows with PKs that are in an 'uncovered range'.
        Show
        mjacobs Matthew Jacobs added a comment - I broke this work into a few parts, the first part to remove the IGNORE and just handle the duplicate key (INSERT) and key not found (UPDATE, DELETE) errors by default went in: commit 08d89a5cc3d2135896a7d4518dac7d22e5e66ddf Author: Matthew Jacobs <mj@cloudera.com> Date: Tue Nov 1 17:52:21 2016 -0700 IMPALA-3710 : Kudu DML should ignore conflicts by default Removes the non-standard IGNORE syntax that was allowed for DML into Kudu tables to indicate that certain errors should be ignored, i.e. not fail the query and continue. However, because there is no way to 'roll back' mutations that occurred before an error occurs, tables are left in an inconsistent state and it's difficult to know what rows were successfully modified vs which rows were not. Instead, this change makes it so that we always 'ignore' these conflicts, i.e. a 'best effort'. In the future, when Kudu will provide the mechanisms Impala needs to provide a notion of isolation levels, then Impala will be able to provide options for more traditional semantics. After this change, the following errors are ignored: INSERT where the PK already exists UPDATE/DELETE where the PK doesn't exist Another follow-up patch will change other violations to be handled in this way as well, e.g. nulls inserted in non-nullable cols. Reporting: The number of rows inserted is reported to the coordinator, which makes the aggregate available to the shell and via the profile. TODO: Return rows modified for INSERT via HS2 ( IMPALA-1789 ). TODO: Return rows modified for other CRUD (beeswax+hs2) ( IMPALA-3713 ). TODO: Return error counts for specific warnings ( IMPALA-4416 ). Testing: Updated tests. Ran all functional tests. More tests will be needed when other conflicts are handled in the same way. Change-Id: I83b5beaa982d006da4997a2af061ef7c22cad3f1 Reviewed-on: http://gerrit.cloudera.org:8080/4911 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins A second patch will add support for handling the following errors as ignored in the same way: NULLs in non-nullable columns, i.e. null constraint violoations. Rows with PKs that are in an 'uncovered range'.
        Hide
        mjacobs Matthew Jacobs added a comment -

        The 2nd part is in this commit:

        commit cfac09de10c996a48852b9a9d50c70cf24cf5f5f
        Author: Matthew Jacobs <mj@cloudera.com>
        AuthorDate: Mon Nov 7 15:55:42 2016 -0800
        Commit: Internal Jenkins <cloudera-hudson@gerrit.cloudera.org>
        CommitDate: Wed Nov 9 06:43:41 2016 +0000

        IMPALA-3710: Kudu DML should ignore conflicts, pt2

        Second part of IMPALA-3710, which removed the IGNORE DML
        option and changed the following errors on Kudu DML
        operations to be ignored:
        1) INSERT where the PK already exists
        2) UPDATE/DELETE where the PK doesn't exist

        This changes other data-related errors to be ignored as
        well:
        3) NULLs in non-nullable columns, i.e. null constraint
        violoations.
        4) Rows with PKs that are in an 'uncovered range'.

        It became clear that we can't differentiate between (3) and
        (4) because both return a Kudu 'NotFound' error code. The
        Impala error codes have been simplified as well: we just
        report a generic KUDU_NOT_FOUND error in these cases.

        This also adds some metadata to the thrift report sent to
        the coordinator from sinks so the total number of rows with
        errors can be added to the profile. Note that this does not
        include a breakdown of error counts by type/code because we
        cannot differentiate between all of these cases yet.

        An upcoming change will add this new info to the beeswax
        interface and show it in the shell output (IMPALA-3713).

        Testing: Updated kudu_crud tests to check the number of rows
        with errors.

        Change-Id: I4eb1ad91dc355ea51de261c3a14df0f9d28c879c
        Reviewed-on: http://gerrit.cloudera.org:8080/4985
        Reviewed-by: Alex Behm <alex.behm@cloudera.com>
        Reviewed-by: Dan Hecht <dhecht@cloudera.com>
        Tested-by: Internal Jenkins

        Show
        mjacobs Matthew Jacobs added a comment - The 2nd part is in this commit: commit cfac09de10c996a48852b9a9d50c70cf24cf5f5f Author: Matthew Jacobs <mj@cloudera.com> AuthorDate: Mon Nov 7 15:55:42 2016 -0800 Commit: Internal Jenkins <cloudera-hudson@gerrit.cloudera.org> CommitDate: Wed Nov 9 06:43:41 2016 +0000 IMPALA-3710 : Kudu DML should ignore conflicts, pt2 Second part of IMPALA-3710 , which removed the IGNORE DML option and changed the following errors on Kudu DML operations to be ignored: 1) INSERT where the PK already exists 2) UPDATE/DELETE where the PK doesn't exist This changes other data-related errors to be ignored as well: 3) NULLs in non-nullable columns, i.e. null constraint violoations. 4) Rows with PKs that are in an 'uncovered range'. It became clear that we can't differentiate between (3) and (4) because both return a Kudu 'NotFound' error code. The Impala error codes have been simplified as well: we just report a generic KUDU_NOT_FOUND error in these cases. This also adds some metadata to the thrift report sent to the coordinator from sinks so the total number of rows with errors can be added to the profile. Note that this does not include a breakdown of error counts by type/code because we cannot differentiate between all of these cases yet. An upcoming change will add this new info to the beeswax interface and show it in the shell output ( IMPALA-3713 ). Testing: Updated kudu_crud tests to check the number of rows with errors. Change-Id: I4eb1ad91dc355ea51de261c3a14df0f9d28c879c Reviewed-on: http://gerrit.cloudera.org:8080/4985 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins

          People

          • Assignee:
            mjacobs Matthew Jacobs
            Reporter:
            dtsirogiannis Dimitris Tsirogiannis
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development