Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: Impala 2.8.0
    • Fix Version/s: Impala 2.8.0
    • Component/s: Infrastructure
    • Labels:
      None

      Description

      Since Kudu has a notion of primary keys, we need Postgres tables of migrated Impala/Kudu table to have the same. This Jira is about altering the migration support in data_generator.py to read source tables (DESCRIBE EXTENDED), detect whether they have primary keys kudu.key_columns, and create the postgres table with the same primary keys.

        Activity

        Hide
        mikesbrown Michael Brown added a comment -
        commit 822b2ca1c6a9bb09328857b59851035260be851f
        Author: Michael Brown <mikeb@cloudera.com>
        Date:   Tue Nov 1 17:54:14 2016 -0700
        
            IMPALA-4338: test infra data migrator: include tables' primary keys in PostgreSQL
        
            This patch adds the ability for the test infrastructure's
            Impala-to-PostgreSQL data migration tool to recognize whether the Impala
            source tables have primary keys, and if so, CREATE the tables in
            PostgreSQL with the same primary keys. This is needed especially for
            performing CRUD operations by the random query generator for comparison
            with Impala/Kudu tables and equivalent PostgreSQL tables.
        
            I modified the make_create_table_sql() implementation to check the
            "universal" Python object model of the table's columns. We generate
            CREATE TABLE statements with, or without, a PRIMARY KEY clause. For
            Impala-side tables that this tool may create, we also ensure that we
            only write such a clause when the table's format supports primary keys
            (currently Kudu).
        
            When the random query generator runs, it needs to know that the tables
            it's examining in both databases are equivalent. It does this by
            examining the tables' names, column names, and column types. I have
            added whether the column is a primary key as part of this equivalence
            test.
        
            Testing:
            - The patch includes some unit and system tests for the tool.
            - Actually migrated a few small Kudu and HDFS tables from Impala into
              both PostgreSQL 9.3 and 9.5 and examined the tables in PostgreSQL to
              make sure they had primary keys (or not) as expected.
            - Very short discrepancy_searcher.py --explain-only runs to test positive
              and negative cases of Impala/PostgreSQL equivalency.
        
            Change-Id: I447f022e2dc3d4fc8373b7f388c7875a869921b8
            Reviewed-on: http://gerrit.cloudera.org:8080/4951
            Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
            Tested-by: Internal Jenkins
        
        Show
        mikesbrown Michael Brown added a comment - commit 822b2ca1c6a9bb09328857b59851035260be851f Author: Michael Brown <mikeb@cloudera.com> Date: Tue Nov 1 17:54:14 2016 -0700 IMPALA-4338: test infra data migrator: include tables' primary keys in PostgreSQL This patch adds the ability for the test infrastructure's Impala-to-PostgreSQL data migration tool to recognize whether the Impala source tables have primary keys, and if so, CREATE the tables in PostgreSQL with the same primary keys. This is needed especially for performing CRUD operations by the random query generator for comparison with Impala/Kudu tables and equivalent PostgreSQL tables. I modified the make_create_table_sql() implementation to check the "universal" Python object model of the table's columns. We generate CREATE TABLE statements with, or without, a PRIMARY KEY clause. For Impala-side tables that this tool may create, we also ensure that we only write such a clause when the table's format supports primary keys (currently Kudu). When the random query generator runs, it needs to know that the tables it's examining in both databases are equivalent. It does this by examining the tables' names, column names, and column types. I have added whether the column is a primary key as part of this equivalence test. Testing: - The patch includes some unit and system tests for the tool. - Actually migrated a few small Kudu and HDFS tables from Impala into both PostgreSQL 9.3 and 9.5 and examined the tables in PostgreSQL to make sure they had primary keys (or not) as expected. - Very short discrepancy_searcher.py --explain-only runs to test positive and negative cases of Impala/PostgreSQL equivalency. Change-Id: I447f022e2dc3d4fc8373b7f388c7875a869921b8 Reviewed-on: http://gerrit.cloudera.org:8080/4951 Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Internal Jenkins
        Show
        mikesbrown Michael Brown added a comment - https://gerrit.cloudera.org/#/c/4951/
        Hide
        mikesbrown Michael Brown added a comment -

        So I thought this was needed in the Postgres CREATE TABLE syntax but I misread what was required. I also thought it was needed for upsert, but naming the constraint is optional (you can use index inference instead).

        This works:

        CREATE TABLE tablename (
            col_name col_type,
            [other cols]
            PRIMARY KEY (pk1 [, other pks])
        );
        

        ... and the index inference for upsert is ...ON CONFLICT(pk1, pk2) DO UPDATE SET....

        Show
        mikesbrown Michael Brown added a comment - So I thought this was needed in the Postgres CREATE TABLE syntax but I misread what was required. I also thought it was needed for upsert, but naming the constraint is optional (you can use index inference instead). This works: CREATE TABLE tablename ( col_name col_type, [other cols] PRIMARY KEY (pk1 [, other pks]) ); ... and the index inference for upsert is ...ON CONFLICT(pk1, pk2) DO UPDATE SET....
        Hide
        dtsirogiannis Dimitris Tsirogiannis added a comment -

        I don't think it really matters how you name it. As I recall, the only purpose of the name is when you want to remove a constraint, so you need to refer to it somehow. I don't expect us to be doing anything like that but you know better of course.

        Show
        dtsirogiannis Dimitris Tsirogiannis added a comment - I don't think it really matters how you name it. As I recall, the only purpose of the name is when you want to remove a constraint, so you need to refer to it somehow. I don't expect us to be doing anything like that but you know better of course.
        Hide
        mikesbrown Michael Brown added a comment -

        Quick note, the postgres syntax for this is:

        CREATE TABLE tablename (
            col_name col_type,
            [other cols]
            CONSTRAINT name PRIMARY KEY (pk1 [, other pks])
        );
        

        This form works when there is either a 1 single primary key column or more than one. I suggest we employ a name of <database>_<table>_pkey.

        Show
        mikesbrown Michael Brown added a comment - Quick note, the postgres syntax for this is: CREATE TABLE tablename ( col_name col_type, [other cols] CONSTRAINT name PRIMARY KEY (pk1 [, other pks]) ); This form works when there is either a 1 single primary key column or more than one. I suggest we employ a name of <database>_<table>_pkey .
        Hide
        dtsirogiannis Dimitris Tsirogiannis added a comment -

        They should show up but it's not going to be in the form of a table property, otherwise it will be quite confusing to end users. Ping me if you're blocked on this and can't work around it.

        Show
        dtsirogiannis Dimitris Tsirogiannis added a comment - They should show up but it's not going to be in the form of a table property, otherwise it will be quite confusing to end users. Ping me if you're blocked on this and can't work around it.
        Hide
        mikesbrown Michael Brown added a comment -

        Thanks Dimitris Tsirogiannis. For some reason I though the key_columns would still show up in DESCRIBE even though CREATE syntax was being simplified.

        Show
        mikesbrown Michael Brown added a comment - Thanks Dimitris Tsirogiannis . For some reason I though the key_columns would still show up in DESCRIBE even though CREATE syntax was being simplified.
        Hide
        dtsirogiannis Dimitris Tsirogiannis added a comment -

        Keep in mind that kudu.key_columns no longer exist in tblproperties. We probably need to fix the output of describe extended first. Alternatively, a hacky workaround is to parse the output of SHOW CREATE TABLE and look for PRIMARY KEY (col1, col2,.., coln).

        Show
        dtsirogiannis Dimitris Tsirogiannis added a comment - Keep in mind that kudu.key_columns no longer exist in tblproperties. We probably need to fix the output of describe extended first. Alternatively, a hacky workaround is to parse the output of SHOW CREATE TABLE and look for PRIMARY KEY (col1, col2,.., coln).

          People

          • Assignee:
            mikesbrown Michael Brown
            Reporter:
            mikesbrown Michael Brown
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development