Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: Impala 2.8.0
    • Fix Version/s: Impala 2.9.0
    • Component/s: Infrastructure
    • Labels:
      None

      Description

      After all the INSERT upfront work, UPSERT should be relatively straightforward.

      • Modify Query object to have an UPSERT statement, and implement UpsertClause.
      • Modify QueryProfile to choose columns to be upserted
      • Modify QueryGenerator to generate UPSERT statements
      • Modify SqlWriter to write UPSERT SQL in Impala
      • Modify SqlWriter to write UPSERT SQL in PostgresQL. Note this is really “INSERT ... ON CONFLICT DO UPDATE”.
      • Hook in the same DB comparison after UPSERT that is used for INSERT

        Activity

        Hide
        mikesbrown Michael Brown added a comment -

        Note that for UPSERT to work in PostgresQL, the sql writer needs to support ON CONFLICT DO UPDATE. The general idea is:

        1. Tables in postgresql need to be created with a primary key constraint (IMPALA-4338)
        2. Primary keys and the constraint name are stored in the Table model (IMPALA-4352)

        An UPSERT query needs to know the constraint to update when there's a conflict, and it needs to know the keys to update.

        Here's what a few very simple upserts look like in postgres 9.5:

        create table bar (
          id bigint,
          id2 bigint,
          char_col character varying(1000),
          char_col2 character varying(1000),
          constraint tpch_bar_keys primary key (id,id2)
        );
        
        # success
        insert into bar values (0,0,'a','a'),(0,1,'a','b'),(1,0,'b','a'),(100,100,'a','b');
        # fails
        insert into bar values (0,0,'a','z');
        # ignored
        insert into bar values (0,0,'a','z') on conflict do nothing;
        # updated (0,0,'a','a') becomes (0,0,'x','y')
        insert into bar values (0,0,'x','y') on conflict(id, id2) do update set char_col = excluded.char_col, char_col2 = excluded.char_col2;
        # updated+inserted (0,0,'x','y') becomes (0,0,'m','n') and (7,7,'a','a')
        # is inserted
        insert into bar values (0,0,'m','n'),(7,7,'a','a') on conflict(id, id2) do update set char_col = excluded.char_col, char_col2 = excluded.char_col2;
        

        What's import here:

        1. The conflict in the upsert is described using index inference (conflict(id, id2))
        2. excluded is a special table that holds references to values for insertion
        Show
        mikesbrown Michael Brown added a comment - Note that for UPSERT to work in PostgresQL, the sql writer needs to support ON CONFLICT DO UPDATE . The general idea is: Tables in postgresql need to be created with a primary key constraint ( IMPALA-4338 ) Primary keys and the constraint name are stored in the Table model ( IMPALA-4352 ) An UPSERT query needs to know the constraint to update when there's a conflict, and it needs to know the keys to update. Here's what a few very simple upserts look like in postgres 9.5: create table bar ( id bigint, id2 bigint, char_col character varying(1000), char_col2 character varying(1000), constraint tpch_bar_keys primary key (id,id2) ); # success insert into bar values (0,0,'a','a'),(0,1,'a','b'),(1,0,'b','a'),(100,100,'a','b'); # fails insert into bar values (0,0,'a','z'); # ignored insert into bar values (0,0,'a','z') on conflict do nothing; # updated (0,0,'a','a') becomes (0,0,'x','y') insert into bar values (0,0,'x','y') on conflict(id, id2) do update set char_col = excluded.char_col, char_col2 = excluded.char_col2; # updated+inserted (0,0,'x','y') becomes (0,0,'m','n') and (7,7,'a','a') # is inserted insert into bar values (0,0,'m','n'),(7,7,'a','a') on conflict(id, id2) do update set char_col = excluded.char_col, char_col2 = excluded.char_col2; What's import here: The conflict in the upsert is described using index inference ( conflict(id, id2) ) excluded is a special table that holds references to values for insertion
        Hide
        mikesbrown Michael Brown added a comment -

        I did a bit more testing, and the index inference column ordering doesn't seem to matter. In other words, it need not be the columns as ordered in either the table or any column list.

        Show
        mikesbrown Michael Brown added a comment - I did a bit more testing, and the index inference column ordering doesn't seem to matter. In other words, it need not be the columns as ordered in either the table or any column list.
        Hide
        mikesbrown Michael Brown added a comment -

        Another note:

        insert into bar values (0,0) on conflict(id, id2) do update set char_col = excluded.char_col, char_col2 = excluded.char_col2;
        insert into bar values (0,1) on conflict(id, id2) do update set char_col = excluded.char_col;
        

        The above are both allowed. For the matching row, char_col will be null in both, and char_col2 will be null in the first insert and left alone in the second.

        Show
        mikesbrown Michael Brown added a comment - Another note: insert into bar values (0,0) on conflict(id, id2) do update set char_col = excluded.char_col, char_col2 = excluded.char_col2; insert into bar values (0,1) on conflict(id, id2) do update set char_col = excluded.char_col; The above are both allowed. For the matching row, char_col will be null in both, and char_col2 will be null in the first insert and left alone in the second.
        Hide
        mikesbrown Michael Brown added a comment -

        A postgresql upsert into a table where there are only primary keys is an "insert ... on conflict do nothing": there are no updatable columns, but we must ignore conflicts.

        Show
        mikesbrown Michael Brown added a comment - A postgresql upsert into a table where there are only primary keys is an "insert ... on conflict do nothing": there are no updatable columns, but we must ignore conflicts.
        Hide
        mikesbrown Michael Brown added a comment -
        commit 0154ace61feff65917388dd08591cb9d0d4369ed
        Author: Michael Brown <mikeb@cloudera.com>
        Date:   Mon Jan 23 10:55:47 2017 -0800
        
            IMPALA-4359: qgen: add UPSERT support
        
            UPSERTs are very similar to INSERTs, so the UPSERT support is simply
            folded into that of INSERT. We do this by adding another "conflict
            action", CONFLICT_ACTION_UPDATE. The object responsible for holding the
            conflict_action attribute is now the InsertClause. This is needed here
            because the SqlWriter now needs to know the conflict_action both when
            writing the InsertClause (Impala) and at the tail end of the
            InsertStatement (PostgreSQL). We also add a few properties to the
            InsertStatement interface so that the PostgresqlSqlWriter can form the
            correct "DO UPDATE" conflic action, in which primary key columns and
            updatable columns must be known. More information on that here:
        
            https://www.postgresql.org/docs/9.5/static/sql-insert.html
        
            By default, we will tend to generate 3 UPSERTs for every 1 INSERT.
        
            In addition to adding unit tests to make sure UPSERTs are properly
            written, I used discrepancy_searcher.py --profile dmlonly, both with and
            without --explain-only, do run tests. I made sure we were generating
            syntactically valid UPSERT statements, and that the INSERT/UPSERT ratio
            was roughly 1/3 after 100 statements.
        
            Change-Id: I6382f6ab22ba29c117e39a5d90592d3637df4b25
            Reviewed-on: http://gerrit.cloudera.org:8080/5795
            Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
            Tested-by: Impala Public Jenkins
        
        Show
        mikesbrown Michael Brown added a comment - commit 0154ace61feff65917388dd08591cb9d0d4369ed Author: Michael Brown <mikeb@cloudera.com> Date: Mon Jan 23 10:55:47 2017 -0800 IMPALA-4359: qgen: add UPSERT support UPSERTs are very similar to INSERTs, so the UPSERT support is simply folded into that of INSERT. We do this by adding another "conflict action", CONFLICT_ACTION_UPDATE. The object responsible for holding the conflict_action attribute is now the InsertClause. This is needed here because the SqlWriter now needs to know the conflict_action both when writing the InsertClause (Impala) and at the tail end of the InsertStatement (PostgreSQL). We also add a few properties to the InsertStatement interface so that the PostgresqlSqlWriter can form the correct "DO UPDATE" conflic action, in which primary key columns and updatable columns must be known. More information on that here: https://www.postgresql.org/docs/9.5/static/sql-insert.html By default, we will tend to generate 3 UPSERTs for every 1 INSERT. In addition to adding unit tests to make sure UPSERTs are properly written, I used discrepancy_searcher.py --profile dmlonly, both with and without --explain-only, do run tests. I made sure we were generating syntactically valid UPSERT statements, and that the INSERT/UPSERT ratio was roughly 1/3 after 100 statements. Change-Id: I6382f6ab22ba29c117e39a5d90592d3637df4b25 Reviewed-on: http://gerrit.cloudera.org:8080/5795 Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Impala Public Jenkins

          People

          • Assignee:
            mikesbrown Michael Brown
            Reporter:
            mikesbrown Michael Brown
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development