Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: Impala 2.8.0
    • Fix Version/s: Impala 2.9.0
    • Component/s: Infrastructure
    • Labels:
      None

      Description

      Generate random INSERT queries for Impala/Kudu tables. The syntax is roughly:

      [with_statement]
      INSERT IGNORE INTO <KUDU_TBL> SELECT <Statement>
      INSERT IGNORE INTO <KUDU_TBL> <column list> VALUES <values list>
      
      • The WITH statement is optional
      • IGNORE will be required. This means ignore primary key duplications.
      • We can have IGNORE SELECT or IGNORE VALUES statements.

      The IGNORE requires comparison with Postgres 9.5 or higher (see IMPALA-4340).

      The scope of this Jira is to take advantage of dependent work (IMPALA-4340, IMPALA-4338, IMPALA-4343, IMPALA-4351, IMPALA-4352) and add methods to the QueryGenerator to generate Pythonic representations of queries.

      The primary key considerations are important:

      1. Primary keys can't be NULL
      2. Primary keys must be unique
      3. The IGNORE keyword means that duplicate-PK rows inserted will race to win. The determinism will be difficult to manage.
      4. The IGNORE keyword means that if a row with that PK already exists, any new rows attempted to be inserted with the same PK will also be ignored.

      This means the query generator needs to be smarter than before about the queries it generates. For example, it shouldn't generate a query in which the expression for the inserted rows' PK column evaluates to a constant: at most 1 of the rows would actually get inserted. One option (for example, in the case of a numerical PK) would be to employ a special expression that applies an offset from the MAX() value in the column.

        Activity

        Show
        mikesbrown Michael Brown added a comment - https://gerrit.cloudera.org/#/c/5486/
        Hide
        mikesbrown Michael Brown added a comment -
        commit db7facdee0017965e276c830e2aee16590f7ce37
        Author: Michael Brown <mikeb@cloudera.com>
        Date:   Wed Dec 7 14:20:05 2016 -0800
        
            IMPALA-4351,IMPALA-4353: [qgen] randomly generate INSERT statements
        
            - Generate INSERT statements that are either INSERT ... VALUES or INSERT
              ... SELECT
        
            - On both types of INSERTs, we either insert into all columns, or into
              some column list. If the column list exists, all primary keys will be
              present, and 0 or more additional columns will also be in the list.
              The ordering of the column list is random.
        
            - For INSERT ... SELECT, occasionally generate a WITH clause
        
            - For INSERT ... VALUES, generate non-null constants for the primary
              keys, but for the non-primary keys, randomly generate a value
              expression.
        
            The type system in the random statement/query generator isn't
            sophisticated enough to the implicit type of a SELECT item or a value
            expression. It knows it will be some INT-based type, but not if it's
            going to be a SMALLINT or a BIGINT. To get around this, the easiest
            thing seems to be to explicitly cast the SELECT items or value
            expressions to the columns' so-called exact_type attribute.
        
            Much of the testing here involved running discrepancy_searcher.py
            --explain-only on both tpch_kudu and a random HDFS table, using both the
            default profile and DML-only profile. This was done to quickly find bugs
            in the statement generation, as they tend to bubble up as analysis
            errors. I expect to make other changes as follow on patches and more
            random statements find small test issues.
        
            For actual use against Kudu data, you need to migrate data from Kudu
            into PostgreSQL 5 (instructions tests/comparison/POSTGRES.txt) and run
            something like:
        
            tests/comparison/discrepancy_searcher.py \
              --use-postgresql \
              --postgresql-port 5433 \
              --profile dmlonly \
              --timeout 300 \
              --db-name tpch_kudu \
              --query-count 10
        
            Change-Id: I842b41f0eed07ab30ec76d8fc3cdd5affb525af6
            Reviewed-on: http://gerrit.cloudera.org:8080/5486
            Reviewed-by: Jim Apple <jbapple-impala@apache.org>
            Tested-by: Impala Public Jenkins
        
        Show
        mikesbrown Michael Brown added a comment - commit db7facdee0017965e276c830e2aee16590f7ce37 Author: Michael Brown <mikeb@cloudera.com> Date: Wed Dec 7 14:20:05 2016 -0800 IMPALA-4351,IMPALA-4353: [qgen] randomly generate INSERT statements - Generate INSERT statements that are either INSERT ... VALUES or INSERT ... SELECT - On both types of INSERTs, we either insert into all columns, or into some column list. If the column list exists, all primary keys will be present, and 0 or more additional columns will also be in the list. The ordering of the column list is random. - For INSERT ... SELECT, occasionally generate a WITH clause - For INSERT ... VALUES, generate non-null constants for the primary keys, but for the non-primary keys, randomly generate a value expression. The type system in the random statement/query generator isn't sophisticated enough to the implicit type of a SELECT item or a value expression. It knows it will be some INT-based type, but not if it's going to be a SMALLINT or a BIGINT. To get around this, the easiest thing seems to be to explicitly cast the SELECT items or value expressions to the columns' so-called exact_type attribute. Much of the testing here involved running discrepancy_searcher.py --explain-only on both tpch_kudu and a random HDFS table, using both the default profile and DML-only profile. This was done to quickly find bugs in the statement generation, as they tend to bubble up as analysis errors. I expect to make other changes as follow on patches and more random statements find small test issues. For actual use against Kudu data, you need to migrate data from Kudu into PostgreSQL 5 (instructions tests/comparison/POSTGRES.txt) and run something like: tests/comparison/discrepancy_searcher.py \ --use-postgresql \ --postgresql-port 5433 \ --profile dmlonly \ --timeout 300 \ --db-name tpch_kudu \ --query-count 10 Change-Id: I842b41f0eed07ab30ec76d8fc3cdd5affb525af6 Reviewed-on: http://gerrit.cloudera.org:8080/5486 Reviewed-by: Jim Apple <jbapple-impala@apache.org> Tested-by: Impala Public Jenkins

          People

          • Assignee:
            mikesbrown Michael Brown
            Reporter:
            mikesbrown Michael Brown
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development