Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-3641

DROP / CREATE sequence on same table failed with "table already exists"

    Details

      Description

      Michael, I assigned you hoping you'll know what's going on or have seen that before.

      This GVM failed, due to what looks to me like a race condition in test_show_create_table.py: http://sandbox.jenkins.cloudera.com/job/impala-external-gerrit-verify-merge/2525/

      [gw2] linux2 -- Python 2.6.6 /data/jenkins/workspace/impala-umbrella-build-and-test/repos/Impala/bin/../infra/python/env/bin/python
      metadata/test_show_create_table.py:55: in test_show_create_table
          unique_database)
      metadata/test_show_create_table.py:109: in __run_show_create_table_test_case
          self.__exec(create_table_result)
      metadata/test_show_create_table.py:120: in __exec
          return self.execute_query_expect_success(self.client, sql_str)
      common/impala_test_suite.py:416: in wrapper
          return function(*args, **kwargs)
      common/impala_test_suite.py:423: in execute_query_expect_success
          result = cls.__execute_query(impalad_client, query, query_options)
      common/impala_test_suite.py:509: in __execute_query
          return impalad_client.execute(query, user=user)
      common/impala_connection.py:160: in execute
          return self.__beeswax_client.execute(sql_stmt, user=user)
      beeswax/impala_beeswax.py:173: in execute
          handle = self.__execute_query(query_string.strip(), user=user)
      beeswax/impala_beeswax.py:337: in __execute_query
          handle = self.execute_query_async(query_string, user=user)
      beeswax/impala_beeswax.py:333: in execute_query_async
          return self.__do_rpc(lambda: self.imp_service.query(query,))
      beeswax/impala_beeswax.py:458: in __do_rpc
          raise ImpalaBeeswaxException(self.__build_error_message(b), b)
      E   ImpalaBeeswaxException: ImpalaBeeswaxException:
      E    INNER EXCEPTION: <class 'beeswaxd.ttypes.BeeswaxException'>
      E    MESSAGE: AnalysisException: Table already exists: test_show_create_table_f1598d0b.i_1687_p
      
      1. consoleText.txt
        1.02 MB
        Lars Volker
      2. extra_logging.patch
        11 kB
        Alexander Behm
      3. impalad.INFO.bz2
        7.63 MB
        Alexander Behm

        Issue Links

          Activity

          Hide
          mikesbrown Michael Brown added a comment -

          Lars Volker The Jenkins job has gone away, so thanks for attaching the console log at least. Sorry I couldn't look into this sooner.

          I don't think pytest ran the test twice.

          I think what you are seeing is a trick of the eye.

          1. You're seeing the way py.test and py.test-xdist print progress during test execution. Each test is printed twice: one for the test starting to execute, and one for the test's completion.
          2. You're seeing tests that have very similar names. They have the same module and class name, but the two function names are different: test_show_create_table vs. test_kudu_show_create_table

          If I search the attach console log for the failing test's full test ID, which is

          metadata/test_show_create_table.py::TestShowCreateTable::test_show_create_table[table_format: text/none]
          

          , it only appears twice: one for the test starting to execute, and one for the test's completion (failure).

          Here's what it looks like when I run this locally:

          mikeb@mikeb-desktop:0:~/Impala/tests (custom-cluster-skips) $ impala-py.test -n4 metadata/test_show_create_table.py 
          ======================================================================================================= test session starts ========================================================================================================
          platform linux2 -- Python 2.7.6 -- py-1.4.30 -- pytest-2.7.2 -- /home/mikeb/Impala/bin/../infra/python/env/bin/python
          rootdir: /home/mikeb/Impala/tests, inifile: pytest.ini
          plugins: random, xdist
          gw0 [2] / gw1 [2] / gw2 [2] / gw3 [2]
          scheduling tests via LoadScheduling
          
          metadata/test_show_create_table.py::TestShowCreateTable::test_show_create_table[table_format: text/none] 
          [gw1] PASSED metadata/test_show_create_table.py::TestShowCreateTable::test_show_create_table[table_format: text/none] 
          metadata/test_show_create_table.py::TestShowCreateTable::test_kudu_show_create_table[table_format: text/none] 
          [gw1] SKIPPED metadata/test_show_create_table.py::TestShowCreateTable::test_kudu_show_create_table[table_format: text/none] 
          
          ============================================================================================== 1 passed, 1 skipped in 153.43 seconds ===============================================================================================
          mikeb@mikeb-desktop:0:~/Impala/tests (custom-cluster-skips) $
          

          So, I don't think this is a pytest double-execution problem.

          Show
          mikesbrown Michael Brown added a comment - Lars Volker The Jenkins job has gone away, so thanks for attaching the console log at least. Sorry I couldn't look into this sooner. I don't think pytest ran the test twice. I think what you are seeing is a trick of the eye. 1. You're seeing the way py.test and py.test-xdist print progress during test execution. Each test is printed twice: one for the test starting to execute, and one for the test's completion. 2. You're seeing tests that have very similar names. They have the same module and class name, but the two function names are different: test_show_create_table vs. test_ kudu _show_create_table If I search the attach console log for the failing test's full test ID, which is metadata/test_show_create_table.py::TestShowCreateTable::test_show_create_table[table_format: text/none] , it only appears twice: one for the test starting to execute, and one for the test's completion (failure). Here's what it looks like when I run this locally: mikeb@mikeb-desktop:0:~/Impala/tests (custom-cluster-skips) $ impala-py.test -n4 metadata/test_show_create_table.py ======================================================================================================= test session starts ======================================================================================================== platform linux2 -- Python 2.7.6 -- py-1.4.30 -- pytest-2.7.2 -- /home/mikeb/Impala/bin/../infra/python/env/bin/python rootdir: /home/mikeb/Impala/tests, inifile: pytest.ini plugins: random, xdist gw0 [2] / gw1 [2] / gw2 [2] / gw3 [2] scheduling tests via LoadScheduling metadata/test_show_create_table.py::TestShowCreateTable::test_show_create_table[table_format: text/none] [gw1] PASSED metadata/test_show_create_table.py::TestShowCreateTable::test_show_create_table[table_format: text/none] metadata/test_show_create_table.py::TestShowCreateTable::test_kudu_show_create_table[table_format: text/none] [gw1] SKIPPED metadata/test_show_create_table.py::TestShowCreateTable::test_kudu_show_create_table[table_format: text/none] ============================================================================================== 1 passed, 1 skipped in 153.43 seconds =============================================================================================== mikeb@mikeb-desktop:0:~/Impala/tests (custom-cluster-skips) $ So, I don't think this is a pytest double-execution problem.
          Hide
          mikesbrown Michael Brown added a comment -

          The actual failure is this:

          _____ TestShowCreateTable.test_show_create_table[table_format: text/none] ______
          [gw1] linux2 -- Python 2.6.6 /data/jenkins/workspace/impala-external-gerrit-verify-merge/repos/Impala/bin/../infra/python/env/bin/python
          metadata/test_show_create_table.py:52: in test_show_create_table
              unique_database)
          metadata/test_show_create_table.py:111: in __run_show_create_table_test_case
              self.__exec(create_table_result)
          metadata/test_show_create_table.py:122: in __exec
              return self.execute_query_expect_success(self.client, sql_str)
          common/impala_test_suite.py:395: in wrapper
              return function(*args, **kwargs)
          common/impala_test_suite.py:402: in execute_query_expect_success
              result = cls.__execute_query(impalad_client, query, query_options)
          common/impala_test_suite.py:487: in __execute_query
              return impalad_client.execute(query, user=user)
          common/impala_connection.py:161: in execute
              return self.__beeswax_client.execute(sql_stmt, user=user)
          beeswax/impala_beeswax.py:163: in execute
              handle = self.__execute_query(query_string.strip(), user=user)
          beeswax/impala_beeswax.py:327: in __execute_query
              handle = self.execute_query_async(query_string, user=user)
          beeswax/impala_beeswax.py:323: in execute_query_async
              return self.__do_rpc(lambda: self.imp_service.query(query,))
          beeswax/impala_beeswax.py:448: in __do_rpc
              raise ImpalaBeeswaxException(self.__build_error_message(b), b)
          E   ImpalaBeeswaxException: ImpalaBeeswaxException:
          E    INNER EXCEPTION: <class 'impala._thrift_gen.beeswax.ttypes.BeeswaxException'>
          E    MESSAGE: AnalysisException: Table already exists: test_show_create_table_f1598d0b.i_1687_p
          
          Show
          mikesbrown Michael Brown added a comment - The actual failure is this: _____ TestShowCreateTable.test_show_create_table[table_format: text/none] ______ [gw1] linux2 -- Python 2.6.6 /data/jenkins/workspace/impala-external-gerrit-verify-merge/repos/Impala/bin/../infra/python/env/bin/python metadata/test_show_create_table.py:52: in test_show_create_table unique_database) metadata/test_show_create_table.py:111: in __run_show_create_table_test_case self.__exec(create_table_result) metadata/test_show_create_table.py:122: in __exec return self.execute_query_expect_success(self.client, sql_str) common/impala_test_suite.py:395: in wrapper return function(*args, **kwargs) common/impala_test_suite.py:402: in execute_query_expect_success result = cls.__execute_query(impalad_client, query, query_options) common/impala_test_suite.py:487: in __execute_query return impalad_client.execute(query, user=user) common/impala_connection.py:161: in execute return self.__beeswax_client.execute(sql_stmt, user=user) beeswax/impala_beeswax.py:163: in execute handle = self.__execute_query(query_string.strip(), user=user) beeswax/impala_beeswax.py:327: in __execute_query handle = self.execute_query_async(query_string, user=user) beeswax/impala_beeswax.py:323: in execute_query_async return self.__do_rpc(lambda: self.imp_service.query(query,)) beeswax/impala_beeswax.py:448: in __do_rpc raise ImpalaBeeswaxException(self.__build_error_message(b), b) E ImpalaBeeswaxException: ImpalaBeeswaxException: E INNER EXCEPTION: <class 'impala._thrift_gen.beeswax.ttypes.BeeswaxException'> E MESSAGE: AnalysisException: Table already exists: test_show_create_table_f1598d0b.i_1687_p
          Hide
          mikesbrown Michael Brown added a comment -

          Which tests use table i_1687_p?

          mikeb@mikeb-desktop:0:~/Impala (custom-cluster-skips) $ git grep i_1687_p
          testdata/workloads/functional-query/queries/QueryTest/create.test:drop table if exists ddl_test_db.i_1687_p;
          testdata/workloads/functional-query/queries/QueryTest/create.test:create table ddl_test_db.i_1687_p partitioned by (a) as select b, a from ddl_test_db.i_1687_t;
          testdata/workloads/functional-query/queries/QueryTest/create.test:select * from ddl_test_db.i_1687_p;
          testdata/workloads/functional-query/queries/QueryTest/show-create-table.test:create table i_1687_p partitioned by (int_col) as
          testdata/workloads/functional-query/queries/QueryTest/show-create-table.test:CREATE TABLE show_create_table_test_db.i_1687_p (
          mikeb@mikeb-desktop:0:~/Impala (custom-cluster-skips) $ 
          

          QueryTest/create does, but it doesn't use a database from unique_database. That leaves QueryTest/show-create-table.

          Is that test used in multiple tests?

          mikeb@mikeb-desktop:0:~/Impala (custom-cluster-skips) $ git grep 'QueryTest/show-create-table'
          tests/metadata/test_show_create_table.py:    self.__run_show_create_table_test_case('QueryTest/show-create-table', vector,
          mikeb@mikeb-desktop:0:~/Impala (custom-cluster-skips) $ 
          

          Does that test run multiple times with different test vectors?

          mikeb@mikeb-desktop:0:~/Impala/tests (custom-cluster-skips) $ impala-py.test --exploration_strategy exhaustive -k "not kudu" --collect-only metadata/test_show_create_table.py
          ======================================================================================================= test session starts ========================================================================================================
          platform linux2 -- Python 2.7.6 -- py-1.4.30 -- pytest-2.7.2 -- /home/mikeb/Impala/bin/../infra/python/env/bin/python
          rootdir: /home/mikeb/Impala/tests, inifile: pytest.ini
          plugins: random, xdist
          collected 2 items 
          <Module 'metadata/test_show_create_table.py'>
            <Class 'TestShowCreateTable'>
              <Instance '()'>
                <Function 'test_show_create_table[table_format: text/none]'>
          
          ================================================================================================ 1 tests deselected by '-knot kudu' ================================================================================================
          ============================================================================================= 1 deselected, 3 warnings in 0.10 seconds =============================================================================================
          mikeb@mikeb-desktop:0:~/Impala/tests (custom-cluster-skips) $
          

          No, not even on the most exhaustive of possibilities.

          To me this eliminates py.test and inter-test function races as culprits.

          Show
          mikesbrown Michael Brown added a comment - Which tests use table i_1687_p ? mikeb@mikeb-desktop:0:~/Impala (custom-cluster-skips) $ git grep i_1687_p testdata/workloads/functional-query/queries/QueryTest/create.test:drop table if exists ddl_test_db.i_1687_p; testdata/workloads/functional-query/queries/QueryTest/create.test:create table ddl_test_db.i_1687_p partitioned by (a) as select b, a from ddl_test_db.i_1687_t; testdata/workloads/functional-query/queries/QueryTest/create.test:select * from ddl_test_db.i_1687_p; testdata/workloads/functional-query/queries/QueryTest/show-create-table.test:create table i_1687_p partitioned by (int_col) as testdata/workloads/functional-query/queries/QueryTest/show-create-table.test:CREATE TABLE show_create_table_test_db.i_1687_p ( mikeb@mikeb-desktop:0:~/Impala (custom-cluster-skips) $ QueryTest/create does, but it doesn't use a database from unique_database . That leaves QueryTest/show-create-table . Is that test used in multiple tests? mikeb@mikeb-desktop:0:~/Impala (custom-cluster-skips) $ git grep 'QueryTest/show-create-table' tests/metadata/test_show_create_table.py: self.__run_show_create_table_test_case('QueryTest/show-create-table', vector, mikeb@mikeb-desktop:0:~/Impala (custom-cluster-skips) $ Does that test run multiple times with different test vectors? mikeb@mikeb-desktop:0:~/Impala/tests (custom-cluster-skips) $ impala-py.test --exploration_strategy exhaustive -k "not kudu" --collect-only metadata/test_show_create_table.py ======================================================================================================= test session starts ======================================================================================================== platform linux2 -- Python 2.7.6 -- py-1.4.30 -- pytest-2.7.2 -- /home/mikeb/Impala/bin/../infra/python/env/bin/python rootdir: /home/mikeb/Impala/tests, inifile: pytest.ini plugins: random, xdist collected 2 items <Module 'metadata/test_show_create_table.py'> <Class 'TestShowCreateTable'> <Instance '()'> <Function 'test_show_create_table[table_format: text/none]'> ================================================================================================ 1 tests deselected by '-knot kudu' ================================================================================================ ============================================================================================= 1 deselected, 3 warnings in 0.10 seconds ============================================================================================= mikeb@mikeb-desktop:0:~/Impala/tests (custom-cluster-skips) $ No, not even on the most exhaustive of possibilities. To me this eliminates py.test and inter-test function races as culprits.
          Hide
          mikesbrown Michael Brown added a comment -

          Relevant query sequence:

          -- executing against localhost:21000
          create table test_show_create_table_f1598d0b.i_1687_p partitioned by (int_col) as
            select bigint_col, int_col from functional.alltypessmall;
          
          -- executing against localhost:21000
          show create table test_show_create_table_f1598d0b.i_1687_p;
          
          -- executing against localhost:21000
          drop table test_show_create_table_f1598d0b.i_1687_p;
          
          -- executing against localhost:21000
          CREATE TABLE test_show_create_table_f1598d0b.i_1687_p (bigint_col BIGINT) PARTITIONED BY (int_col INT) STORED AS TEXTFILE LOCATION 'hdfs://localhost:20500/test-warehouse/test_show_create_table_f1598d0b.db/i_1687_p' ;
          

          This is also the end of the captured query sequence.

          Show
          mikesbrown Michael Brown added a comment - Relevant query sequence: -- executing against localhost:21000 create table test_show_create_table_f1598d0b.i_1687_p partitioned by (int_col) as select bigint_col, int_col from functional.alltypessmall; -- executing against localhost:21000 show create table test_show_create_table_f1598d0b.i_1687_p; -- executing against localhost:21000 drop table test_show_create_table_f1598d0b.i_1687_p; -- executing against localhost:21000 CREATE TABLE test_show_create_table_f1598d0b.i_1687_p (bigint_col BIGINT) PARTITIONED BY (int_col INT) STORED AS TEXTFILE LOCATION 'hdfs://localhost:20500/test-warehouse/test_show_create_table_f1598d0b.db/i_1687_p' ; This is also the end of the captured query sequence.
          Hide
          mikesbrown Michael Brown added a comment -

          When the test passes otoh, this is the query sequence:

          [more queries with other tables]
          
          -- executing against localhost:21000
          create table test_show_create_table_f1598d0b.i_1687_p partitioned by (int_col) as
            select bigint_col, int_col from functional.alltypessmall;
          
          -- executing against localhost:21000
          show create table test_show_create_table_f1598d0b.i_1687_p;
          
          -- executing against localhost:21000
          drop table test_show_create_table_f1598d0b.i_1687_p;
          
          -- executing against localhost:21000
          CREATE TABLE test_show_create_table_f1598d0b.i_1687_p (bigint_col BIGINT) PARTITIONED BY (int_col INT) STORED AS TEXTFILE LOCATION 'hdfs://localhost:20500/test-warehouse/test_show_create_table_f1598d0b.db/i_1687_p' ;
          
          -- executing against localhost:21000
          show create table test_show_create_table_f1598d0b.i_1687_p;
          
          -- executing against localhost:21000
          drop table test_show_create_table_f1598d0b.i_1687_p;
          
          [more queries with other tables]
          
          Show
          mikesbrown Michael Brown added a comment - When the test passes otoh, this is the query sequence: [more queries with other tables] -- executing against localhost:21000 create table test_show_create_table_f1598d0b.i_1687_p partitioned by (int_col) as select bigint_col, int_col from functional.alltypessmall; -- executing against localhost:21000 show create table test_show_create_table_f1598d0b.i_1687_p; -- executing against localhost:21000 drop table test_show_create_table_f1598d0b.i_1687_p; -- executing against localhost:21000 CREATE TABLE test_show_create_table_f1598d0b.i_1687_p (bigint_col BIGINT) PARTITIONED BY (int_col INT) STORED AS TEXTFILE LOCATION 'hdfs://localhost:20500/test-warehouse/test_show_create_table_f1598d0b.db/i_1687_p' ; -- executing against localhost:21000 show create table test_show_create_table_f1598d0b.i_1687_p; -- executing against localhost:21000 drop table test_show_create_table_f1598d0b.i_1687_p; [more queries with other tables]
          Hide
          lv Lars Volker added a comment -

          Thanks, Michael Brown for the explanation. I had looked at that failed test, too, and figured the only way it could conflict on its own table was by being run in parallel. I also only saw it once during a GVM and haven't been able to reproduce it.

          Any other ideas? Anything that could cause the first drop not to be fully effective when the second create is executed?

          Should we keep this Jira around and see if it happens more often?

          Show
          lv Lars Volker added a comment - Thanks, Michael Brown for the explanation. I had looked at that failed test, too, and figured the only way it could conflict on its own table was by being run in parallel. I also only saw it once during a GVM and haven't been able to reproduce it. Any other ideas? Anything that could cause the first drop not to be fully effective when the second create is executed? Should we keep this Jira around and see if it happens more often?
          Hide
          mikesbrown Michael Brown added a comment -

          Lars Volker I realized there's a flaw in some of the above: I didn't do the code inspection on the branch you were GVM'ing. Do you remember which gerrit review, and which patch set in that review, this GVM was for?

          Show
          mikesbrown Michael Brown added a comment - Lars Volker I realized there's a flaw in some of the above: I didn't do the code inspection on the branch you were GVM'ing. Do you remember which gerrit review, and which patch set in that review, this GVM was for?
          Hide
          lv Lars Volker added a comment -
          Show
          lv Lars Volker added a comment - Michael Brown - Yes, it was this one: https://gerrit.cloudera.org/#/c/3202/11
          Hide
          mikesbrown Michael Brown added a comment -

          Followup: I fetched refs/changes/02/3202/11 from gerrit and went through the same steps. The story was the same: It seems only one end to end test uses that table.

          Are DDLs on the same connection "atomic"? Or maybe synchronous is the right word. If that's not guaranteed then our test is racy--and most tests are similarly racy. If that is guaranteed then we might have a racy product bug. I can't find any reason for the test code or py.test to have made this problem.

          Let's leave this open for now and see if anyone else reports it.

          Show
          mikesbrown Michael Brown added a comment - Followup: I fetched refs/changes/02/3202/11 from gerrit and went through the same steps. The story was the same: It seems only one end to end test uses that table. Are DDLs on the same connection "atomic"? Or maybe synchronous is the right word. If that's not guaranteed then our test is racy--and most tests are similarly racy. If that is guaranteed then we might have a racy product bug. I can't find any reason for the test code or py.test to have made this problem. Let's leave this open for now and see if anyone else reports it.
          Hide
          mikesbrown Michael Brown added a comment -

          I don't think "infra" is the right component here.

          Show
          mikesbrown Michael Brown added a comment - I don't think "infra" is the right component here.
          Hide
          mikesbrown Michael Brown added a comment -

          I also did a survey of tests to see if any were running invalidate metadata as parallel tests. I couldn't find any, but admittedly that's difficult to survey.

          If this happens again please include a link to a Jenkins job containing artifacts which would contain impala logs. You can also "lock" a build in Jenkins so that it lingers beyond usual "garbage collection" time.

          Show
          mikesbrown Michael Brown added a comment - I also did a survey of tests to see if any were running invalidate metadata as parallel tests. I couldn't find any, but admittedly that's difficult to survey. If this happens again please include a link to a Jenkins job containing artifacts which would contain impala logs. You can also "lock" a build in Jenkins so that it lingers beyond usual "garbage collection" time.
          Hide
          lv Lars Volker added a comment -

          Thanks Michael Brown for the thorough investigation. I went through my Downloads folder and found the artifacts of the Jenkins job. The file is too large to attach it here, so I uploaded it to impala-desktop.ca.cloudera.com:/home/dev/IMPALA-3641. Let me know if there's anything else I can do.

          Show
          lv Lars Volker added a comment - Thanks Michael Brown for the thorough investigation. I went through my Downloads folder and found the artifacts of the Jenkins job. The file is too large to attach it here, so I uploaded it to impala-desktop.ca.cloudera.com:/home/dev/ IMPALA-3641 . Let me know if there's anything else I can do.
          Hide
          lv Lars Volker added a comment -

          We hit this one again here: http://sandbox.jenkins.cloudera.com/job/impala-asf-master-core-asan/120/

          Michael Brown - Do you have time to take a look? I marked the Jenkins job as "keep". I'll assign this to you in the meantime, feel free to find someone else or assign back to me if you're swamped. I'm also bumping the priority until we know more since it seems to be a flaky test.

          Show
          lv Lars Volker added a comment - We hit this one again here: http://sandbox.jenkins.cloudera.com/job/impala-asf-master-core-asan/120/ Michael Brown - Do you have time to take a look? I marked the Jenkins job as "keep". I'll assign this to you in the meantime, feel free to find someone else or assign back to me if you're swamped. I'm also bumping the priority until we know more since it seems to be a flaky test.
          Hide
          lv Lars Volker added a comment -

          I still don't understand how the create statement can fail with "Table already exists" right after if has been dropped. I'll run this in a loop on my workstation for a while to see if I can trigger the issue.

          drop table i_1687_p;
          create table i_1687_p partitioned by (int_col) as select bigint_col, int_col from functional.alltypessmall;
          drop table i_1687_p;
          CREATE TABLE i_1687_p (bigint_col BIGINT) PARTITIONED BY (int_col INT) STORED AS TEXTFILE LOCATION 'hdfs://localhost:20500/test-warehouse/default.db/i_1687_p';
          
          Show
          lv Lars Volker added a comment - I still don't understand how the create statement can fail with "Table already exists" right after if has been dropped. I'll run this in a loop on my workstation for a while to see if I can trigger the issue. drop table i_1687_p; create table i_1687_p partitioned by (int_col) as select bigint_col, int_col from functional.alltypessmall; drop table i_1687_p; CREATE TABLE i_1687_p (bigint_col BIGINT) PARTITIONED BY (int_col INT) STORED AS TEXTFILE LOCATION 'hdfs://localhost:20500/test-warehouse/default.db/i_1687_p';
          Hide
          mikesbrown Michael Brown added a comment -

          Lars Volker any update on your repro attempt? I'm going to remove the business about the test running twice in the Description since we've shown that doesn't happen.

          Show
          mikesbrown Michael Brown added a comment - Lars Volker any update on your repro attempt? I'm going to remove the business about the test running twice in the Description since we've shown that doesn't happen.
          Hide
          mikesbrown Michael Brown added a comment -

          Lars Volker I'm sending this back to you. I can't find evidence this is a test issue. My recommendation is to treat this as a FE bug for investigation.

          From the tests logs (under the ee_tests directory), we can see here that we create a table, show it, drop it, and create it again. Meanwhile no other connections are using test_show_create_table_f1598d0b.i_1687_p. We already showed that no other tests do in previous comments.

          $ grep test_show_create_table_f1598d0b.i_1687_p * | grep beeswax | cut -d: -f2- | sort -u
          I1028 22:45:52.890895 21504 impala-beeswax-server.cc:53] query(): query=create table test_show_create_table_f1598d0b.i_1687_p partitioned by (int_col) as
          I1028 22:45:55.733433 21504 impala-beeswax-server.cc:53] query(): query=show create table test_show_create_table_f1598d0b.i_1687_p
          I1028 22:45:56.059283 21504 impala-beeswax-server.cc:53] query(): query=drop table test_show_create_table_f1598d0b.i_1687_p
          I1028 22:45:56.754441 21504 impala-beeswax-server.cc:53] query(): query=CREATE TABLE test_show_create_table_f1598d0b.i_1687_p (bigint_col BIGINT) PARTITIONED BY (int_col INT) STORED AS TEXTFILE LOCATION 'hdfs://localhost:20500/test-warehouse/test_show_create_table_f1598d0b.db/i_1687_p'
          

          It's this last create table that fails. Note: I noticed that the LOCATION in the failing CREATE TABLE is the same location that is used in the first CREATE. The difference is that in the second CREATE TABLE, the location is specified explicitly.

          Show
          mikesbrown Michael Brown added a comment - Lars Volker I'm sending this back to you. I can't find evidence this is a test issue. My recommendation is to treat this as a FE bug for investigation. From the tests logs (under the ee_tests directory), we can see here that we create a table, show it, drop it, and create it again. Meanwhile no other connections are using test_show_create_table_f1598d0b.i_1687_p. We already showed that no other tests do in previous comments. $ grep test_show_create_table_f1598d0b.i_1687_p * | grep beeswax | cut -d: -f2- | sort -u I1028 22:45:52.890895 21504 impala-beeswax-server.cc:53] query(): query=create table test_show_create_table_f1598d0b.i_1687_p partitioned by (int_col) as I1028 22:45:55.733433 21504 impala-beeswax-server.cc:53] query(): query=show create table test_show_create_table_f1598d0b.i_1687_p I1028 22:45:56.059283 21504 impala-beeswax-server.cc:53] query(): query=drop table test_show_create_table_f1598d0b.i_1687_p I1028 22:45:56.754441 21504 impala-beeswax-server.cc:53] query(): query=CREATE TABLE test_show_create_table_f1598d0b.i_1687_p (bigint_col BIGINT) PARTITIONED BY (int_col INT) STORED AS TEXTFILE LOCATION 'hdfs://localhost:20500/test-warehouse/test_show_create_table_f1598d0b.db/i_1687_p' It's this last create table that fails. Note: I noticed that the LOCATION in the failing CREATE TABLE is the same location that is used in the first CREATE. The difference is that in the second CREATE TABLE, the location is specified explicitly.
          Hide
          henryr Henry Robinson added a comment -

          Hit this again (build is http://sandbox.jenkins.cloudera.com/job/impala-asf-master-core-local-filesystem/127):

          =================================== FAILURES ===================================
          _____ TestShowCreateTable.test_show_create_table[table_format: text/none] ______
          [gw1] linux2 -- Python 2.6.6 /data/jenkins/workspace/impala-umbrella-build-and-test/repos/Impala/bin/../infra/python/env/bin/python
          metadata/test_show_create_table.py:55: in test_show_create_table
              unique_database)
          metadata/test_show_create_table.py:109: in __run_show_create_table_test_case
              self.__exec(create_table_result)
          metadata/test_show_create_table.py:120: in __exec
              return self.execute_query_expect_success(self.client, sql_str)
          common/impala_test_suite.py:416: in wrapper
              return function(*args, **kwargs)
          common/impala_test_suite.py:423: in execute_query_expect_success
              result = cls.__execute_query(impalad_client, query, query_options)
          common/impala_test_suite.py:509: in __execute_query
              return impalad_client.execute(query, user=user)
          common/impala_connection.py:160: in execute
              return self.__beeswax_client.execute(sql_stmt, user=user)
          beeswax/impala_beeswax.py:173: in execute
              handle = self.__execute_query(query_string.strip(), user=user)
          beeswax/impala_beeswax.py:337: in __execute_query
              handle = self.execute_query_async(query_string, user=user)
          beeswax/impala_beeswax.py:333: in execute_query_async
              return self.__do_rpc(lambda: self.imp_service.query(query,))
          beeswax/impala_beeswax.py:458: in __do_rpc
              raise ImpalaBeeswaxException(self.__build_error_message(b), b)
          E   ImpalaBeeswaxException: ImpalaBeeswaxException:
          E    INNER EXCEPTION: <class 'beeswaxd.ttypes.BeeswaxException'>
          E    MESSAGE: 
          E   ImpalaRuntimeException: Error making 'createTable' RPC to Hive Metastore: 
          E   CAUSED BY: AlreadyExistsException: Table test8 already exists
          ---------------------------- Captured stderr setup -----------------------------
          -- connecting to: localhost:21000
          SET sync_ddl=False;
          -- executing against localhost:21000
          DROP DATABASE IF EXISTS `test_show_create_table_f1598d0b` CASCADE;
          
          SET sync_ddl=False;
          -- executing against localhost:21000
          CREATE DATABASE `test_show_create_table_f1598d0b`;
          
          MainThread: Created database "test_show_create_table_f1598d0b" for test ID "metadata/test_show_create_table.py::TestShowCreateTable::()::test_show_create_table[table_format: text/none]"
          ----------------------------- Captured stderr call -----------------------------
          -- executing against localhost:21000
          CREATE TABLE test_show_create_table_f1598d0b.test1 (
            id INT
          )
          STORED AS TEXTFILE;
          
          -- executing against localhost:21000
          show create table test_show_create_table_f1598d0b.test1;
          
          -- executing against localhost:21000
          drop table test_show_create_table_f1598d0b.test1;
          
          -- executing against localhost:21000
          CREATE TABLE test_show_create_table_f1598d0b.test1 (id INT) STORED AS TEXTFILE LOCATION 'file:/tmp/test-warehouse/test_show_create_table_f1598d0b.db/test1';
          
          -- executing against localhost:21000
          show create table test_show_create_table_f1598d0b.test1;
          
          -- executing against localhost:21000
          drop table test_show_create_table_f1598d0b.test1;
          
          -- executing against localhost:21000
          CREATE TABLE test_show_create_table_f1598d0b.test2 (
            year INT,
            month INT,
            id INT COMMENT 'Add a comment',
            bool_col BOOLEAN,
            tinyint_col TINYINT,
            smallint_col SMALLINT,
            int_col INT,
            bigint_col BIGINT,
            float_col FLOAT,
            double_col DOUBLE,
            date_string_col STRING,
            string_col STRING,
            timestamp_col TIMESTAMP
          )
          STORED AS TEXTFILE;
          
          -- executing against localhost:21000
          show create table test_show_create_table_f1598d0b.test2;
          
          -- executing against localhost:21000
          drop table test_show_create_table_f1598d0b.test2;
          
          -- executing against localhost:21000
          CREATE TABLE test_show_create_table_f1598d0b.test2 (year INT, month INT, id INT COMMENT 'Add a comment', bool_col BOOLEAN, tinyint_col TINYINT, smallint_col SMALLINT, int_col INT, bigint_col BIGINT, float_col FLOAT, double_col DOUBLE, date_string_col STRING, string_col STRING, timestamp_col TIMESTAMP) STORED AS TEXTFILE LOCATION 'file:/tmp/test-warehouse/test_show_create_table_f1598d0b.db/test2';
          
          -- executing against localhost:21000
          show create table test_show_create_table_f1598d0b.test2;
          
          -- executing against localhost:21000
          drop table test_show_create_table_f1598d0b.test2;
          
          -- executing against localhost:21000
          CREATE TABLE test_show_create_table_f1598d0b.test3 (
            year INT,
            month INT,
            id INT COMMENT 'Add a comment',
            bool_col BOOLEAN,
            tinyint_col TINYINT,
            smallint_col SMALLINT,
            int_col INT,
            bigint_col BIGINT,
            float_col FLOAT,
            double_col DOUBLE,
            date_string_col STRING,
            string_col STRING,
            timestamp_col TIMESTAMP
          )
          PARTITIONED BY (
            x INT,
            y INT,
            a BOOLEAN
          )
          STORED AS TEXTFILE;
          
          -- executing against localhost:21000
          show create table test_show_create_table_f1598d0b.test3;
          
          -- executing against localhost:21000
          drop table test_show_create_table_f1598d0b.test3;
          
          -- executing against localhost:21000
          CREATE TABLE test_show_create_table_f1598d0b.test3 (year INT, month INT, id INT COMMENT 'Add a comment', bool_col BOOLEAN, tinyint_col TINYINT, smallint_col SMALLINT, int_col INT, bigint_col BIGINT, float_col FLOAT, double_col DOUBLE, date_string_col STRING, string_col STRING, timestamp_col TIMESTAMP) PARTITIONED BY (x INT, y INT, a BOOLEAN) STORED AS TEXTFILE LOCATION 'file:/tmp/test-warehouse/test_show_create_table_f1598d0b.db/test3';
          
          -- executing against localhost:21000
          show create table test_show_create_table_f1598d0b.test3;
          
          -- executing against localhost:21000
          drop table test_show_create_table_f1598d0b.test3;
          
          -- executing against localhost:21000
          CREATE TABLE test_show_create_table_f1598d0b.test4 (
            year INT,
            month INT,
            id INT COMMENT 'Add a comment'
          )
          COMMENT 'This is a test'
          STORED AS TEXTFILE;
          
          -- executing against localhost:21000
          show create table test_show_create_table_f1598d0b.test4;
          
          -- executing against localhost:21000
          drop table test_show_create_table_f1598d0b.test4;
          
          -- executing against localhost:21000
          CREATE TABLE test_show_create_table_f1598d0b.test4 (year INT, month INT, id INT COMMENT 'Add a comment') COMMENT 'This is a test' STORED AS TEXTFILE LOCATION 'file:/tmp/test-warehouse/test_show_create_table_f1598d0b.db/test4';
          
          -- executing against localhost:21000
          show create table test_show_create_table_f1598d0b.test4;
          
          -- executing against localhost:21000
          drop table test_show_create_table_f1598d0b.test4;
          
          -- executing against localhost:21000
          CREATE TABLE test_show_create_table_f1598d0b.test5 (
            year INT,
            month INT,
            id INT COMMENT 'Add a comment'
          )
          ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '\\' LINES TERMINATED BY '\n'
          STORED AS TEXTFILE;
          
          -- executing against localhost:21000
          show create table test_show_create_table_f1598d0b.test5;
          
          -- executing against localhost:21000
          drop table test_show_create_table_f1598d0b.test5;
          
          -- executing against localhost:21000
          CREATE TABLE test_show_create_table_f1598d0b.test5 (year INT, month INT, id INT COMMENT 'Add a comment') ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '\\' LINES TERMINATED BY '\n' WITH SERDEPROPERTIES ('line.delim'='\n', 'field.delim'=',', 'serialization.format'=',', 'escape.delim'='\\') STORED AS TEXTFILE LOCATION 'file:/tmp/test-warehouse/test_show_create_table_f1598d0b.db/test5';
          
          -- executing against localhost:21000
          show create table test_show_create_table_f1598d0b.test5;
          
          -- executing against localhost:21000
          drop table test_show_create_table_f1598d0b.test5;
          
          -- executing against localhost:21000
          CREATE TABLE test_show_create_table_f1598d0b.test6 (
            year INT,
            month INT,
            id INT COMMENT 'Add a comment'
          )
          STORED AS PARQUET;
          
          -- executing against localhost:21000
          show create table test_show_create_table_f1598d0b.test6;
          
          -- executing against localhost:21000
          drop table test_show_create_table_f1598d0b.test6;
          
          -- executing against localhost:21000
          CREATE TABLE test_show_create_table_f1598d0b.test6 (year INT, month INT, id INT COMMENT 'Add a comment') STORED AS PARQUET LOCATION 'file:/tmp/test-warehouse/test_show_create_table_f1598d0b.db/test6';
          
          -- executing against localhost:21000
          show create table test_show_create_table_f1598d0b.test6;
          
          -- executing against localhost:21000
          drop table test_show_create_table_f1598d0b.test6;
          
          -- executing against localhost:21000
          CREATE TABLE test_show_create_table_f1598d0b.test7 (
            year INT,
            month INT,
            id INT COMMENT 'Add a comment'
          )
          STORED AS SEQUENCEFILE
          TBLPROPERTIES ('key3'='val3', 'key2'='val2', 'key1'='val1');
          
          -- executing against localhost:21000
          show create table test_show_create_table_f1598d0b.test7;
          
          -- executing against localhost:21000
          drop table test_show_create_table_f1598d0b.test7;
          
          -- executing against localhost:21000
          CREATE TABLE test_show_create_table_f1598d0b.test7 (year INT, month INT, id INT COMMENT 'Add a comment') STORED AS SEQUENCEFILE LOCATION 'file:/tmp/test-warehouse/test_show_create_table_f1598d0b.db/test7' TBLPROPERTIES ('key3'='val3', 'key2'='val2', 'key1'='val1');
          
          -- executing against localhost:21000
          show create table test_show_create_table_f1598d0b.test7;
          
          -- executing against localhost:21000
          drop table test_show_create_table_f1598d0b.test7;
          
          -- executing against localhost:21000
          CREATE TABLE test_show_create_table_f1598d0b.test8 (
            year INT,
            month INT,
            id INT COMMENT 'Add a comment'
          )
          STORED AS RCFILE;
          
          -- executing against localhost:21000
          show create table test_show_create_table_f1598d0b.test8;
          
          -- executing against localhost:21000
          drop table test_show_create_table_f1598d0b.test8;
          
          -- executing against localhost:21000
          CREATE TABLE test_show_create_table_f1598d0b.test8 (year INT, month INT, id INT COMMENT 'Add a comment') STORED AS RCFILE LOCATION 'file:/tmp/test-warehouse/test_show_create_table_f1598d0b.db/test8';
          
          Show
          henryr Henry Robinson added a comment - Hit this again (build is http://sandbox.jenkins.cloudera.com/job/impala-asf-master-core-local-filesystem/127): =================================== FAILURES =================================== _____ TestShowCreateTable.test_show_create_table[table_format: text/none] ______ [gw1] linux2 -- Python 2.6.6 /data/jenkins/workspace/impala-umbrella-build-and-test/repos/Impala/bin/../infra/python/env/bin/python metadata/test_show_create_table.py:55: in test_show_create_table unique_database) metadata/test_show_create_table.py:109: in __run_show_create_table_test_case self.__exec(create_table_result) metadata/test_show_create_table.py:120: in __exec return self.execute_query_expect_success(self.client, sql_str) common/impala_test_suite.py:416: in wrapper return function(*args, **kwargs) common/impala_test_suite.py:423: in execute_query_expect_success result = cls.__execute_query(impalad_client, query, query_options) common/impala_test_suite.py:509: in __execute_query return impalad_client.execute(query, user=user) common/impala_connection.py:160: in execute return self.__beeswax_client.execute(sql_stmt, user=user) beeswax/impala_beeswax.py:173: in execute handle = self.__execute_query(query_string.strip(), user=user) beeswax/impala_beeswax.py:337: in __execute_query handle = self.execute_query_async(query_string, user=user) beeswax/impala_beeswax.py:333: in execute_query_async return self.__do_rpc(lambda: self.imp_service.query(query,)) beeswax/impala_beeswax.py:458: in __do_rpc raise ImpalaBeeswaxException(self.__build_error_message(b), b) E ImpalaBeeswaxException: ImpalaBeeswaxException: E INNER EXCEPTION: <class 'beeswaxd.ttypes.BeeswaxException'> E MESSAGE: E ImpalaRuntimeException: Error making 'createTable' RPC to Hive Metastore: E CAUSED BY: AlreadyExistsException: Table test8 already exists ---------------------------- Captured stderr setup ----------------------------- -- connecting to: localhost:21000 SET sync_ddl=False; -- executing against localhost:21000 DROP DATABASE IF EXISTS `test_show_create_table_f1598d0b` CASCADE; SET sync_ddl=False; -- executing against localhost:21000 CREATE DATABASE `test_show_create_table_f1598d0b`; MainThread: Created database "test_show_create_table_f1598d0b" for test ID "metadata/test_show_create_table.py::TestShowCreateTable::()::test_show_create_table[table_format: text/none]" ----------------------------- Captured stderr call ----------------------------- -- executing against localhost:21000 CREATE TABLE test_show_create_table_f1598d0b.test1 ( id INT ) STORED AS TEXTFILE; -- executing against localhost:21000 show create table test_show_create_table_f1598d0b.test1; -- executing against localhost:21000 drop table test_show_create_table_f1598d0b.test1; -- executing against localhost:21000 CREATE TABLE test_show_create_table_f1598d0b.test1 (id INT) STORED AS TEXTFILE LOCATION 'file:/tmp/test-warehouse/test_show_create_table_f1598d0b.db/test1'; -- executing against localhost:21000 show create table test_show_create_table_f1598d0b.test1; -- executing against localhost:21000 drop table test_show_create_table_f1598d0b.test1; -- executing against localhost:21000 CREATE TABLE test_show_create_table_f1598d0b.test2 ( year INT, month INT, id INT COMMENT 'Add a comment', bool_col BOOLEAN, tinyint_col TINYINT, smallint_col SMALLINT, int_col INT, bigint_col BIGINT, float_col FLOAT, double_col DOUBLE, date_string_col STRING, string_col STRING, timestamp_col TIMESTAMP ) STORED AS TEXTFILE; -- executing against localhost:21000 show create table test_show_create_table_f1598d0b.test2; -- executing against localhost:21000 drop table test_show_create_table_f1598d0b.test2; -- executing against localhost:21000 CREATE TABLE test_show_create_table_f1598d0b.test2 (year INT, month INT, id INT COMMENT 'Add a comment', bool_col BOOLEAN, tinyint_col TINYINT, smallint_col SMALLINT, int_col INT, bigint_col BIGINT, float_col FLOAT, double_col DOUBLE, date_string_col STRING, string_col STRING, timestamp_col TIMESTAMP) STORED AS TEXTFILE LOCATION 'file:/tmp/test-warehouse/test_show_create_table_f1598d0b.db/test2'; -- executing against localhost:21000 show create table test_show_create_table_f1598d0b.test2; -- executing against localhost:21000 drop table test_show_create_table_f1598d0b.test2; -- executing against localhost:21000 CREATE TABLE test_show_create_table_f1598d0b.test3 ( year INT, month INT, id INT COMMENT 'Add a comment', bool_col BOOLEAN, tinyint_col TINYINT, smallint_col SMALLINT, int_col INT, bigint_col BIGINT, float_col FLOAT, double_col DOUBLE, date_string_col STRING, string_col STRING, timestamp_col TIMESTAMP ) PARTITIONED BY ( x INT, y INT, a BOOLEAN ) STORED AS TEXTFILE; -- executing against localhost:21000 show create table test_show_create_table_f1598d0b.test3; -- executing against localhost:21000 drop table test_show_create_table_f1598d0b.test3; -- executing against localhost:21000 CREATE TABLE test_show_create_table_f1598d0b.test3 (year INT, month INT, id INT COMMENT 'Add a comment', bool_col BOOLEAN, tinyint_col TINYINT, smallint_col SMALLINT, int_col INT, bigint_col BIGINT, float_col FLOAT, double_col DOUBLE, date_string_col STRING, string_col STRING, timestamp_col TIMESTAMP) PARTITIONED BY (x INT, y INT, a BOOLEAN) STORED AS TEXTFILE LOCATION 'file:/tmp/test-warehouse/test_show_create_table_f1598d0b.db/test3'; -- executing against localhost:21000 show create table test_show_create_table_f1598d0b.test3; -- executing against localhost:21000 drop table test_show_create_table_f1598d0b.test3; -- executing against localhost:21000 CREATE TABLE test_show_create_table_f1598d0b.test4 ( year INT, month INT, id INT COMMENT 'Add a comment' ) COMMENT 'This is a test' STORED AS TEXTFILE; -- executing against localhost:21000 show create table test_show_create_table_f1598d0b.test4; -- executing against localhost:21000 drop table test_show_create_table_f1598d0b.test4; -- executing against localhost:21000 CREATE TABLE test_show_create_table_f1598d0b.test4 (year INT, month INT, id INT COMMENT 'Add a comment') COMMENT 'This is a test' STORED AS TEXTFILE LOCATION 'file:/tmp/test-warehouse/test_show_create_table_f1598d0b.db/test4'; -- executing against localhost:21000 show create table test_show_create_table_f1598d0b.test4; -- executing against localhost:21000 drop table test_show_create_table_f1598d0b.test4; -- executing against localhost:21000 CREATE TABLE test_show_create_table_f1598d0b.test5 ( year INT, month INT, id INT COMMENT 'Add a comment' ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '\\' LINES TERMINATED BY '\n' STORED AS TEXTFILE; -- executing against localhost:21000 show create table test_show_create_table_f1598d0b.test5; -- executing against localhost:21000 drop table test_show_create_table_f1598d0b.test5; -- executing against localhost:21000 CREATE TABLE test_show_create_table_f1598d0b.test5 (year INT, month INT, id INT COMMENT 'Add a comment') ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '\\' LINES TERMINATED BY '\n' WITH SERDEPROPERTIES ('line.delim'='\n', 'field.delim'=',', 'serialization.format'=',', 'escape.delim'='\\') STORED AS TEXTFILE LOCATION 'file:/tmp/test-warehouse/test_show_create_table_f1598d0b.db/test5'; -- executing against localhost:21000 show create table test_show_create_table_f1598d0b.test5; -- executing against localhost:21000 drop table test_show_create_table_f1598d0b.test5; -- executing against localhost:21000 CREATE TABLE test_show_create_table_f1598d0b.test6 ( year INT, month INT, id INT COMMENT 'Add a comment' ) STORED AS PARQUET; -- executing against localhost:21000 show create table test_show_create_table_f1598d0b.test6; -- executing against localhost:21000 drop table test_show_create_table_f1598d0b.test6; -- executing against localhost:21000 CREATE TABLE test_show_create_table_f1598d0b.test6 (year INT, month INT, id INT COMMENT 'Add a comment') STORED AS PARQUET LOCATION 'file:/tmp/test-warehouse/test_show_create_table_f1598d0b.db/test6'; -- executing against localhost:21000 show create table test_show_create_table_f1598d0b.test6; -- executing against localhost:21000 drop table test_show_create_table_f1598d0b.test6; -- executing against localhost:21000 CREATE TABLE test_show_create_table_f1598d0b.test7 ( year INT, month INT, id INT COMMENT 'Add a comment' ) STORED AS SEQUENCEFILE TBLPROPERTIES ('key3'='val3', 'key2'='val2', 'key1'='val1'); -- executing against localhost:21000 show create table test_show_create_table_f1598d0b.test7; -- executing against localhost:21000 drop table test_show_create_table_f1598d0b.test7; -- executing against localhost:21000 CREATE TABLE test_show_create_table_f1598d0b.test7 (year INT, month INT, id INT COMMENT 'Add a comment') STORED AS SEQUENCEFILE LOCATION 'file:/tmp/test-warehouse/test_show_create_table_f1598d0b.db/test7' TBLPROPERTIES ('key3'='val3', 'key2'='val2', 'key1'='val1'); -- executing against localhost:21000 show create table test_show_create_table_f1598d0b.test7; -- executing against localhost:21000 drop table test_show_create_table_f1598d0b.test7; -- executing against localhost:21000 CREATE TABLE test_show_create_table_f1598d0b.test8 ( year INT, month INT, id INT COMMENT 'Add a comment' ) STORED AS RCFILE; -- executing against localhost:21000 show create table test_show_create_table_f1598d0b.test8; -- executing against localhost:21000 drop table test_show_create_table_f1598d0b.test8; -- executing against localhost:21000 CREATE TABLE test_show_create_table_f1598d0b.test8 (year INT, month INT, id INT COMMENT 'Add a comment') STORED AS RCFILE LOCATION 'file:/tmp/test-warehouse/test_show_create_table_f1598d0b.db/test8';
          Hide
          henryr Henry Robinson added a comment -

          Thomas Tauber-Marshall - could you investigate, please?

          Show
          henryr Henry Robinson added a comment - Thomas Tauber-Marshall - could you investigate, please?
          Hide
          twmarshall Thomas Tauber-Marshall added a comment -

          Yep, looking into this.

          Show
          twmarshall Thomas Tauber-Marshall added a comment - Yep, looking into this.
          Hide
          tarmstrong Tim Armstrong added a comment -

          I hit something that looks similar with databases here http://sandbox.jenkins.cloudera.com/job/impala-external-gerrit-verify-merge-ASF/442/:

          de_rows_threshold': 0, 'batch_size': 0, 'num_nodes': 0} | query_type: SELECT | max_block_mgr_memory: 0 | cancel_delay: 3 | action: WAIT | query: compute stats lineitem]
          . data_errors/test_data_errors.py::TestTimestampErrors::()::test_timestamp_scan_agg_errors[exec_option: {'disable_codegen': True, 'abort_on_error': 1, 'exec_single_node_rows_threshold': 0, 
          'batch_size': 1, 'num_nodes': 0} | table_format: text/none]
          . metadata/test_compute_stats.py::TestIncompatibleColStats::()::test_incompatible_col_stats[exec_option: {'disable_codegen': False, 'abort_on_error': 1, 'exec_single_node_rows_threshold': 0
          , 'batch_size': 0, 'num_nodes': 0} | table_format: text/none]
          . query_test/test_cancellation.py::TestCancellationParallel::()::test_cancel_select[table_format: parquet/none | exec_option: {'disable_codegen': False, 'abort_on_error': 1, 'exec_single_no
          de_rows_threshold': 0, 'batch_size': 0, 'num_nodes': 0} | query_type: SELECT | max_block_mgr_memory: 0 | cancel_delay: 4 | action: WAIT | query: select * from lineitem order by l_orderkey]
          . hs2/test_fetch.py::TestFetch::()::test_alltypes_v6
          . data_errors/test_data_errors.py::TestTimestampErrors::()::test_timestamp_scan_agg_errors[exec_option: {'disable_codegen': False, 'abort_on_error': 1, 'exec_single_node_rows_threshold': 0,
           'batch_size': 1, 'num_nodes': 0} | table_format: text/none]
          . query_test/test_cancellation.py::TestCancellationParallel::()::test_cancel_select[table_format: avro/snap/block | exec_option: {'disable_codegen': False, 'abort_on_error': 1, 'exec_single
          _node_rows_threshold': 0, 'batch_size': 0, 'num_nodes': 0} | query_type: SELECT | max_block_mgr_memory: 0 | cancel_delay: 4 | action: WAIT | query: select * from lineitem limit 50]
          . failure/test_failpoints.py::TestFailpoints::()::test_failpoints[target_node: ('SORT', [9]) | table_format: text/none | exec_option: {'disable_codegen': False, 'abort_on_error': 1, 'exec_s
          ingle_node_rows_threshold': 0, 'batch_size': 0, 'num_nodes': 0} | query_type: SELECT | location: PREPARE | action: FAIL]
          . query_test/test_cancellation.py::TestCancellationParallel::()::test_cancel_select[table_format: avro/none | exec_option: {'disable_codegen': False, 'abort_on_error': 1, 'exec_single_node_
          rows_threshold': 0, 'batch_size': 0, 'num_nodes': 0} | query_type: SELECT | max_block_mgr_memory: 0 | cancel_delay: 2 | action: WAIT | query: select l_returnflag from lineitem]
          . hs2/test_fetch.py::TestFetch::()::test_show_column_stats
          . query_test/test_cancellation.py::TestCancellationParallel::()::test_cancel_select[table_format: rc/none | exec_option: {'disable_codegen': False, 'abort_on_error': 1, 'exec_single_node_ro
          ws_threshold': 0, 'batch_size': 0, 'num_nodes': 0} | query_type: SELECT | max_block_mgr_memory: 0 | cancel_delay: 2 | action: WAIT | query: select count(l_returnflag) pk from lineitem]
          . failure/test_failpoints.py::TestFailpoints::()::test_failpoints[target_node: ('SCAN HBASE', [0]) | table_format: seq/snap/block | exec_option: {'disable_codegen': False, 'abort_on_error':
           1, 'exec_single_node_rows_threshold': 0, 'batch_size': 0, 'num_nodes': 0} | query_type: SELECT | location: OPEN | action: CANCEL]
          . metadata/test_ddl.py::TestDdlStatements::()::test_drop_table_with_purge
          . query_test/test_cancellation.py::TestCancellationParallel::()::test_cancel_select[table_format: seq/snap/block | exec_option: {'disable_codegen': False, 'abort_on_error': 1, 'exec_single_
          node_rows_threshold': 0, 'batch_size': 0, 'num_nodes': 0} | query_type: SELECT | max_block_mgr_memory: 0 | cancel_delay: 4 | action: WAIT | query: select * from lineitem limit 50]
          . query_test/test_cancellation.py::TestCancellationParallel::()::test_cancel_select[table_format: seq/gzip/block | exec_option: {'disable_codegen': False, 'abort_on_error': 1, 'exec_single_
          node_rows_threshold': 0, 'batch_size': 0, 'num_nodes': 0} | query_type: SELECT | max_block_mgr_memory: 0 | cancel_delay: 1 | action: WAIT | query: select * from lineitem limit 50]
          . failure/test_failpoints.py::TestFailpoints::()::test_failpoints[target_node: ('EXCHANGE [BROADCAST]', [13, 10]) | table_format: rc/snap/block | exec_option: {'disable_codegen': False, 'ab
          ort_on_error': 1, 'exec_single_node_rows_threshold': 0, 'batch_size': 0, 'num_nodes': 0} | query_type: SELECT | location: GETNEXT | action: MEM_LIMIT_EXCEEDED]
          . hs2/test_hs2.py::TestHS2::()::test_change_default_database_case_insensitive
          F metadata/test_ddl.py::TestDdlStatements::()::test_drop_cleans_hdfs_dirs
           metadata/test_ddl.py:130: in test_drop_cleans_hdfs_dirs
               self._create_db(unique_database)
           metadata/test_ddl_base.py:64: in _create_db
               impala_client.execute(ddl)
           common/impala_connection.py:160: in execute
               return self.__beeswax_client.execute(sql_stmt, user=user)
           beeswax/impala_beeswax.py:173: in execute
               handle = self.__execute_query(query_string.strip(), user=user)
           beeswax/impala_beeswax.py:337: in __execute_query
               handle = self.execute_query_async(query_string, user=user)
           beeswax/impala_beeswax.py:333: in execute_query_async
               return self.__do_rpc(lambda: self.imp_service.query(query,))
           beeswax/impala_beeswax.py:458: in __do_rpc
               raise ImpalaBeeswaxException(self.__build_error_message(b), b)
           E   ImpalaBeeswaxException: ImpalaBeeswaxException:
           E    INNER EXCEPTION: <class 'beeswaxd.ttypes.BeeswaxException'>
           E    MESSAGE: AnalysisException: Database already exists: test_drop_cleans_hdfs_dirs_fdfd4f8
          E metadata/test_ddl.py::TestDdlStatements::()::test_drop_cleans_hdfs_dirs
           conftest.py:273: in cleanup
               {'sync_ddl': sync_ddl})
           common/impala_test_suite.py:417: in wrapper
               return function(*args, **kwargs)
           common/impala_test_suite.py:424: in execute_query_expect_success
               result = cls.__execute_query(impalad_client, query, query_options)
           common/impala_test_suite.py:510: in __execute_query
               return impalad_client.execute(query, user=user)
           common/impala_connection.py:160: in execute
               return self.__beeswax_client.execute(sql_stmt, user=user)
           beeswax/impala_beeswax.py:173: in execute
               handle = self.__execute_query(query_string.strip(), user=user)
           beeswax/impala_beeswax.py:337: in __execute_query
               handle = self.execute_query_async(query_string, user=user)
           beeswax/impala_beeswax.py:333: in execute_query_async
               return self.__do_rpc(lambda: self.imp_service.query(query,))
           beeswax/impala_beeswax.py:458: in __do_rpc
               raise ImpalaBeeswaxException(self.__build_error_message(b), b)
           E   ImpalaBeeswaxException: ImpalaBeeswaxException:
           E    INNER EXCEPTION: <class 'beeswaxd.ttypes.BeeswaxException'>
           E    MESSAGE: 
           E   ImpalaRuntimeException: Error making 'dropDatabase' RPC to Hive Metastore: 
           E   CAUSED BY: NoSuchObjectException: test_drop_cleans_hdfs_dirs_fdfd4f8
          
          
          Show
          tarmstrong Tim Armstrong added a comment - I hit something that looks similar with databases here http://sandbox.jenkins.cloudera.com/job/impala-external-gerrit-verify-merge-ASF/442/: de_rows_threshold': 0, 'batch_size': 0, 'num_nodes': 0} | query_type: SELECT | max_block_mgr_memory: 0 | cancel_delay: 3 | action: WAIT | query: compute stats lineitem] . data_errors/test_data_errors.py::TestTimestampErrors::()::test_timestamp_scan_agg_errors[exec_option: {'disable_codegen': True, 'abort_on_error': 1, 'exec_single_node_rows_threshold': 0, 'batch_size': 1, 'num_nodes': 0} | table_format: text/none] . metadata/test_compute_stats.py::TestIncompatibleColStats::()::test_incompatible_col_stats[exec_option: {'disable_codegen': False, 'abort_on_error': 1, 'exec_single_node_rows_threshold': 0 , 'batch_size': 0, 'num_nodes': 0} | table_format: text/none] . query_test/test_cancellation.py::TestCancellationParallel::()::test_cancel_select[table_format: parquet/none | exec_option: {'disable_codegen': False, 'abort_on_error': 1, 'exec_single_no de_rows_threshold': 0, 'batch_size': 0, 'num_nodes': 0} | query_type: SELECT | max_block_mgr_memory: 0 | cancel_delay: 4 | action: WAIT | query: select * from lineitem order by l_orderkey] . hs2/test_fetch.py::TestFetch::()::test_alltypes_v6 . data_errors/test_data_errors.py::TestTimestampErrors::()::test_timestamp_scan_agg_errors[exec_option: {'disable_codegen': False, 'abort_on_error': 1, 'exec_single_node_rows_threshold': 0, 'batch_size': 1, 'num_nodes': 0} | table_format: text/none] . query_test/test_cancellation.py::TestCancellationParallel::()::test_cancel_select[table_format: avro/snap/block | exec_option: {'disable_codegen': False, 'abort_on_error': 1, 'exec_single _node_rows_threshold': 0, 'batch_size': 0, 'num_nodes': 0} | query_type: SELECT | max_block_mgr_memory: 0 | cancel_delay: 4 | action: WAIT | query: select * from lineitem limit 50] . failure/test_failpoints.py::TestFailpoints::()::test_failpoints[target_node: ('SORT', [9]) | table_format: text/none | exec_option: {'disable_codegen': False, 'abort_on_error': 1, 'exec_s ingle_node_rows_threshold': 0, 'batch_size': 0, 'num_nodes': 0} | query_type: SELECT | location: PREPARE | action: FAIL] . query_test/test_cancellation.py::TestCancellationParallel::()::test_cancel_select[table_format: avro/none | exec_option: {'disable_codegen': False, 'abort_on_error': 1, 'exec_single_node_ rows_threshold': 0, 'batch_size': 0, 'num_nodes': 0} | query_type: SELECT | max_block_mgr_memory: 0 | cancel_delay: 2 | action: WAIT | query: select l_returnflag from lineitem] . hs2/test_fetch.py::TestFetch::()::test_show_column_stats . query_test/test_cancellation.py::TestCancellationParallel::()::test_cancel_select[table_format: rc/none | exec_option: {'disable_codegen': False, 'abort_on_error': 1, 'exec_single_node_ro ws_threshold': 0, 'batch_size': 0, 'num_nodes': 0} | query_type: SELECT | max_block_mgr_memory: 0 | cancel_delay: 2 | action: WAIT | query: select count(l_returnflag) pk from lineitem] . failure/test_failpoints.py::TestFailpoints::()::test_failpoints[target_node: ('SCAN HBASE', [0]) | table_format: seq/snap/block | exec_option: {'disable_codegen': False, 'abort_on_error': 1, 'exec_single_node_rows_threshold': 0, 'batch_size': 0, 'num_nodes': 0} | query_type: SELECT | location: OPEN | action: CANCEL] . metadata/test_ddl.py::TestDdlStatements::()::test_drop_table_with_purge . query_test/test_cancellation.py::TestCancellationParallel::()::test_cancel_select[table_format: seq/snap/block | exec_option: {'disable_codegen': False, 'abort_on_error': 1, 'exec_single_ node_rows_threshold': 0, 'batch_size': 0, 'num_nodes': 0} | query_type: SELECT | max_block_mgr_memory: 0 | cancel_delay: 4 | action: WAIT | query: select * from lineitem limit 50] . query_test/test_cancellation.py::TestCancellationParallel::()::test_cancel_select[table_format: seq/gzip/block | exec_option: {'disable_codegen': False, 'abort_on_error': 1, 'exec_single_ node_rows_threshold': 0, 'batch_size': 0, 'num_nodes': 0} | query_type: SELECT | max_block_mgr_memory: 0 | cancel_delay: 1 | action: WAIT | query: select * from lineitem limit 50] . failure/test_failpoints.py::TestFailpoints::()::test_failpoints[target_node: ('EXCHANGE [BROADCAST]', [13, 10]) | table_format: rc/snap/block | exec_option: {'disable_codegen': False, 'ab ort_on_error': 1, 'exec_single_node_rows_threshold': 0, 'batch_size': 0, 'num_nodes': 0} | query_type: SELECT | location: GETNEXT | action: MEM_LIMIT_EXCEEDED] . hs2/test_hs2.py::TestHS2::()::test_change_default_database_case_insensitive F metadata/test_ddl.py::TestDdlStatements::()::test_drop_cleans_hdfs_dirs metadata/test_ddl.py:130: in test_drop_cleans_hdfs_dirs self._create_db(unique_database) metadata/test_ddl_base.py:64: in _create_db impala_client.execute(ddl) common/impala_connection.py:160: in execute return self.__beeswax_client.execute(sql_stmt, user=user) beeswax/impala_beeswax.py:173: in execute handle = self.__execute_query(query_string.strip(), user=user) beeswax/impala_beeswax.py:337: in __execute_query handle = self.execute_query_async(query_string, user=user) beeswax/impala_beeswax.py:333: in execute_query_async return self.__do_rpc(lambda: self.imp_service.query(query,)) beeswax/impala_beeswax.py:458: in __do_rpc raise ImpalaBeeswaxException(self.__build_error_message(b), b) E ImpalaBeeswaxException: ImpalaBeeswaxException: E INNER EXCEPTION: <class 'beeswaxd.ttypes.BeeswaxException'> E MESSAGE: AnalysisException: Database already exists: test_drop_cleans_hdfs_dirs_fdfd4f8 E metadata/test_ddl.py::TestDdlStatements::()::test_drop_cleans_hdfs_dirs conftest.py:273: in cleanup {'sync_ddl': sync_ddl}) common/impala_test_suite.py:417: in wrapper return function(*args, **kwargs) common/impala_test_suite.py:424: in execute_query_expect_success result = cls.__execute_query(impalad_client, query, query_options) common/impala_test_suite.py:510: in __execute_query return impalad_client.execute(query, user=user) common/impala_connection.py:160: in execute return self.__beeswax_client.execute(sql_stmt, user=user) beeswax/impala_beeswax.py:173: in execute handle = self.__execute_query(query_string.strip(), user=user) beeswax/impala_beeswax.py:337: in __execute_query handle = self.execute_query_async(query_string, user=user) beeswax/impala_beeswax.py:333: in execute_query_async return self.__do_rpc(lambda: self.imp_service.query(query,)) beeswax/impala_beeswax.py:458: in __do_rpc raise ImpalaBeeswaxException(self.__build_error_message(b), b) E ImpalaBeeswaxException: ImpalaBeeswaxException: E INNER EXCEPTION: <class 'beeswaxd.ttypes.BeeswaxException'> E MESSAGE: E ImpalaRuntimeException: Error making 'dropDatabase' RPC to Hive Metastore: E CAUSED BY: NoSuchObjectException: test_drop_cleans_hdfs_dirs_fdfd4f8
          Hide
          HuaisiXu Huaisi Xu added a comment -

          I can only get archived logs from http://sandbox.jenkins.cloudera.com/job/impala-asf-master-core-asan/120/. All logs from other tests were lost.

          Based on that log, I can see the topic deletion does not get sent to impalad before that impalad takes another "create table" command. My guess is that we used another session for the "create" since the drop command is synchronized in a single session.

          Show
          HuaisiXu Huaisi Xu added a comment - I can only get archived logs from http://sandbox.jenkins.cloudera.com/job/impala-asf-master-core-asan/120/ . All logs from other tests were lost. Based on that log, I can see the topic deletion does not get sent to impalad before that impalad takes another "create table" command. My guess is that we used another session for the "create" since the drop command is synchronized in a single session.
          Hide
          HuaisiXu Huaisi Xu added a comment -

          If this is true then this is not a product bug since this behavior is expected and hence not a blocker. right? Probably we should make DDL synchronized on single host?

          Show
          HuaisiXu Huaisi Xu added a comment - If this is true then this is not a product bug since this behavior is expected and hence not a blocker. right? Probably we should make DDL synchronized on single host?
          Hide
          jbapple Jim Apple added a comment -

          An error like this one is consistently reproducible in an m4.4xlarge ec2 instance running Ubuntu 14.04:

          =================================== FAILURES ===================================
           TestUdfs.test_drop_function_while_running[exec_option: {'disable_codegen': False, 'abort_on_error': 1, 'exec_single_node_rows_threshold': 0, 'batch_size': 0, 'num_nodes': 0} | table_format: text/none] 
          [gw0] linux2 -- Python 2.7.6 /home/ubuntu/tmp.Tn7IainMI3/bin/../infra/python/env/bin/python
          query_test/test_udfs.py:227: in test_drop_function_while_running
              get_fs_path('/test-warehouse/libTestUdfs.so'))
          common/impala_connection.py:160: in execute
              return self.__beeswax_client.execute(sql_stmt, user=user)
          beeswax/impala_beeswax.py:173: in execute
              handle = self.__execute_query(query_string.strip(), user=user)
          beeswax/impala_beeswax.py:337: in __execute_query
              handle = self.execute_query_async(query_string, user=user)
          beeswax/impala_beeswax.py:333: in execute_query_async
              return self.__do_rpc(lambda: self.imp_service.query(query,))
          beeswax/impala_beeswax.py:458: in __do_rpc
              raise ImpalaBeeswaxException(self.__build_error_message(b), b)
          E   ImpalaBeeswaxException: ImpalaBeeswaxException:
          E    INNER EXCEPTION: <class 'beeswaxd.ttypes.BeeswaxException'>
          E    MESSAGE: 
          E   CatalogException: Function drop_while_running already exists.
          ----------------------------- Captured stderr call -----------------------------
          -- executing against localhost:21000
          drop function if exists default.drop_while_running(BIGINT);
          
          -- executing against localhost:21000
          create function default.drop_while_running(BIGINT) returns BIGINT LOCATION '/test-warehouse/libTestUdfs.so' SYMBOL='Identity';
          
           generated xml file: /home/ubuntu/tmp.Tn7IainMI3/logs/ee_tests/results/TEST-impala-parallel.xml 
          =========================== short test summary info ============================
          FAIL query_test/test_udfs.py::TestUdfs::()::test_drop_function_while_running[exec_option: {'disable_codegen': False, 'abort_on_error': 1, 'exec_single_node_rows_threshold': 0, 'batch_size': 0, 'num_nodes': 0} | table_format: text/none]
          ==================== 1 failed, 19 passed in 239.22 seconds =====================
          

          That's building with -so and using /bin/run_all_tests.sh with EE_TEST set to true and EE_TEST_FILES set to query_test/test_udfs.py. It's at commit c68734ad6509869e8660cd1bad05e53461137474.

          Show
          jbapple Jim Apple added a comment - An error like this one is consistently reproducible in an m4.4xlarge ec2 instance running Ubuntu 14.04: =================================== FAILURES =================================== TestUdfs.test_drop_function_while_running[exec_option: {'disable_codegen': False, 'abort_on_error': 1, 'exec_single_node_rows_threshold': 0, 'batch_size': 0, 'num_nodes': 0} | table_format: text/none] [gw0] linux2 -- Python 2.7.6 /home/ubuntu/tmp.Tn7IainMI3/bin/../infra/python/env/bin/python query_test/test_udfs.py:227: in test_drop_function_while_running get_fs_path('/test-warehouse/libTestUdfs.so')) common/impala_connection.py:160: in execute return self.__beeswax_client.execute(sql_stmt, user=user) beeswax/impala_beeswax.py:173: in execute handle = self.__execute_query(query_string.strip(), user=user) beeswax/impala_beeswax.py:337: in __execute_query handle = self.execute_query_async(query_string, user=user) beeswax/impala_beeswax.py:333: in execute_query_async return self.__do_rpc(lambda: self.imp_service.query(query,)) beeswax/impala_beeswax.py:458: in __do_rpc raise ImpalaBeeswaxException(self.__build_error_message(b), b) E ImpalaBeeswaxException: ImpalaBeeswaxException: E INNER EXCEPTION: <class 'beeswaxd.ttypes.BeeswaxException'> E MESSAGE: E CatalogException: Function drop_while_running already exists. ----------------------------- Captured stderr call ----------------------------- -- executing against localhost:21000 drop function if exists default.drop_while_running(BIGINT); -- executing against localhost:21000 create function default.drop_while_running(BIGINT) returns BIGINT LOCATION '/test-warehouse/libTestUdfs.so' SYMBOL='Identity'; generated xml file: /home/ubuntu/tmp.Tn7IainMI3/logs/ee_tests/results/TEST-impala-parallel.xml =========================== short test summary info ============================ FAIL query_test/test_udfs.py::TestUdfs::()::test_drop_function_while_running[exec_option: {'disable_codegen': False, 'abort_on_error': 1, 'exec_single_node_rows_threshold': 0, 'batch_size': 0, 'num_nodes': 0} | table_format: text/none] ==================== 1 failed, 19 passed in 239.22 seconds ===================== That's building with -so and using /bin/run_all_tests.sh with EE_TEST set to true and EE_TEST_FILES set to query_test/test_udfs.py. It's at commit c68734ad6509869e8660cd1bad05e53461137474.
          Hide
          HuaisiXu Huaisi Xu added a comment -

          How can I get to that instance Jim Apple? where can I start something like this?

          Show
          HuaisiXu Huaisi Xu added a comment - How can I get to that instance Jim Apple ? where can I start something like this?
          Hide
          jbapple Jim Apple added a comment -

          Unfortunately, I cannot give you login credentials to the machine.

          If you want to repro, try and use ec2. If you have trouble, I'd suggest that your questions would be more likely to be answered the more specific and detailed they are. I have tried to detail what you need to do in my previous comment.

          Show
          jbapple Jim Apple added a comment - Unfortunately, I cannot give you login credentials to the machine. If you want to repro, try and use ec2. If you have trouble, I'd suggest that your questions would be more likely to be answered the more specific and detailed they are. I have tried to detail what you need to do in my previous comment.
          Hide
          HuaisiXu Huaisi Xu added a comment -

          The problem is we sometimes add a table multiple times from catalog/statestore:

          I1114 12:38:38.149401  3365 ImpaladCatalog.java:364] adding default.i_1687_p7@1199 <--- first one
          ...
          ...
          ...
          I1114 12:38:38.847038  4534 query-exec-state.cc:882] Updating metastore with 10 altered partitions (int_col=0/, int_col=1/, int_col=2/, int_col=3/, int_col=4/, int_col=5/, int_col=6/, int_col=7/, int_col=8/, int_col=9/)
          I1114 12:38:38.847061  4534 query-exec-state.cc:896] Executing FinalizeDml() using CatalogService
          I1114 12:38:39.266376  4534 ImpaladCatalog.java:364] adding default.i_1687_p7@1201  <---- second one
          I1114 12:38:39.266626  4534 query-exec-state.cc:966] Inserted 100 row(s)
          I1114 12:38:39.282374  3365 impala-beeswax-server.cc:188] get_results_metadata(): query_id=4d4fe13406b02d4d:7e46d0c400000000
          I1114 12:38:39.283711  3365 impala-beeswax-server.cc:230] close(): query_id=4d4fe13406b02d4d:7e46d0c400000000
          I1114 12:38:39.283742  3365 impala-server.cc:908] UnregisterQuery(): query_id=4d4fe13406b02d4d:7e46d0c400000000
          I1114 12:38:39.283747  3365 impala-server.cc:994] Cancel(): query_id=4d4fe13406b02d4d:7e46d0c400000000
          I1114 12:38:39.283754  3365 coordinator.cc:1345] Cancel() query_id=4d4fe13406b02d4d:7e46d0c400000000
          I1114 12:38:39.283762  3365 coordinator.cc:1411] CancelFragmentInstances() query_id=4d4fe13406b02d4d:7e46d0c400000000, tried to cancel 0 fragment instances
          I1114 12:38:47.800526  3548 ImpaladCatalog.java:364] adding default.i_1687_p7@1201   <--- third one that causes problems
          

          The second "add" should be the last one that we add because the query in the same session returned right after this point. The third one is from statestore later, and can potentially cause problems for us.(many many race conditions.)

          While I could not reproduce the exception reported... but obviously this is the root cause. I will check something else and provide another update tomorrow..

          Show
          HuaisiXu Huaisi Xu added a comment - The problem is we sometimes add a table multiple times from catalog/statestore: I1114 12:38:38.149401 3365 ImpaladCatalog.java:364] adding default .i_1687_p7@1199 <--- first one ... ... ... I1114 12:38:38.847038 4534 query-exec-state.cc:882] Updating metastore with 10 altered partitions (int_col=0/, int_col=1/, int_col=2/, int_col=3/, int_col=4/, int_col=5/, int_col=6/, int_col=7/, int_col=8/, int_col=9/) I1114 12:38:38.847061 4534 query-exec-state.cc:896] Executing FinalizeDml() using CatalogService I1114 12:38:39.266376 4534 ImpaladCatalog.java:364] adding default .i_1687_p7@1201 <---- second one I1114 12:38:39.266626 4534 query-exec-state.cc:966] Inserted 100 row(s) I1114 12:38:39.282374 3365 impala-beeswax-server.cc:188] get_results_metadata(): query_id=4d4fe13406b02d4d:7e46d0c400000000 I1114 12:38:39.283711 3365 impala-beeswax-server.cc:230] close(): query_id=4d4fe13406b02d4d:7e46d0c400000000 I1114 12:38:39.283742 3365 impala-server.cc:908] UnregisterQuery(): query_id=4d4fe13406b02d4d:7e46d0c400000000 I1114 12:38:39.283747 3365 impala-server.cc:994] Cancel(): query_id=4d4fe13406b02d4d:7e46d0c400000000 I1114 12:38:39.283754 3365 coordinator.cc:1345] Cancel() query_id=4d4fe13406b02d4d:7e46d0c400000000 I1114 12:38:39.283762 3365 coordinator.cc:1411] CancelFragmentInstances() query_id=4d4fe13406b02d4d:7e46d0c400000000, tried to cancel 0 fragment instances I1114 12:38:47.800526 3548 ImpaladCatalog.java:364] adding default .i_1687_p7@1201 <--- third one that causes problems The second "add" should be the last one that we add because the query in the same session returned right after this point. The third one is from statestore later, and can potentially cause problems for us.(many many race conditions.) While I could not reproduce the exception reported... but obviously this is the root cause. I will check something else and provide another update tomorrow..
          Hide
          HuaisiXu Huaisi Xu added a comment -

          After further discussion with Henry Robinson, I think I can explain what went wrong..
          1. Based on previous observation, table(any catalog object) with same version can be added twice: a. when catalog DDL returns, b. when statestore sends update.
          2. We have a removedCatalogObjects_ to prevent 1.b from overwriting any drop after 1.a in a "drop table" statement.
          3. However, concurrent DDL(on any other object)'s catalog return can trigger CatalogDeltaLog::garbageCollect() and cleans removedCatalogObjects_, thus 2 cannot be guaranteed anymore on older tables.

          The easiesst way to fix this is to prevent catalogd's return (direct update) from triggering garbageCollect(). If there is no objection then I will work on this direction tomorrow...

          Show
          HuaisiXu Huaisi Xu added a comment - After further discussion with Henry Robinson , I think I can explain what went wrong.. 1. Based on previous observation, table(any catalog object) with same version can be added twice: a. when catalog DDL returns, b. when statestore sends update. 2. We have a removedCatalogObjects_ to prevent 1.b from overwriting any drop after 1.a in a "drop table" statement. 3. However, concurrent DDL(on any other object)'s catalog return can trigger CatalogDeltaLog::garbageCollect() and cleans removedCatalogObjects_, thus 2 cannot be guaranteed anymore on older tables. The easiesst way to fix this is to prevent catalogd's return (direct update) from triggering garbageCollect(). If there is no objection then I will work on this direction tomorrow...
          Hide
          alex.behm Alexander Behm added a comment -

          Han Xu, nice catch! This does look like a bug. I also think your proposed fix makes sense.

          Show
          alex.behm Alexander Behm added a comment - Han Xu , nice catch! This does look like a bug. I also think your proposed fix makes sense.
          Hide
          kwho Michael Ho added a comment -

          test_udfs.py with S3 just hit this problem again.

          17:29:53 =================================== FAILURES ===================================
          17:29:53  TestUdfs.test_native_functions[exec_option: {'disable_codegen': True, 'abort_on_error': 1, 'exec_single_node_rows_threshold': 100, 'batch_size': 0, 'num_nodes': 0} | table_format: text/none] 
          17:29:53 [gw0] linux2 -- Python 2.6.6 /data/jenkins/workspace/impala-umbrella-build-and-test-s3/repos/Impala/bin/../infra/python/env/bin/python
          17:29:53 query_test/test_udfs.py:60: in test_native_functions
          17:29:53     self.run_test_case('QueryTest/udf-init-close', vector, use_db=unique_database)
          17:29:53 common/impala_test_suite.py:321: in run_test_case
          17:29:53     result = self.__execute_query(target_impalad_client, query, user=user)
          17:29:53 common/impala_test_suite.py:525: in __execute_query
          17:29:53     return impalad_client.execute(query, user=user)
          17:29:53 common/impala_connection.py:160: in execute
          17:29:53     return self.__beeswax_client.execute(sql_stmt, user=user)
          17:29:53 beeswax/impala_beeswax.py:173: in execute
          17:29:53     handle = self.__execute_query(query_string.strip(), user=user)
          17:29:53 beeswax/impala_beeswax.py:337: in __execute_query
          17:29:53     handle = self.execute_query_async(query_string, user=user)
          17:29:53 beeswax/impala_beeswax.py:333: in execute_query_async
          17:29:53     return self.__do_rpc(lambda: self.imp_service.query(query,))
          17:29:53 beeswax/impala_beeswax.py:458: in __do_rpc
          17:29:53     raise ImpalaBeeswaxException(self.__build_error_message(b), b)
          17:29:53 E   ImpalaBeeswaxException: ImpalaBeeswaxException:
          17:29:53 E    INNER EXCEPTION: <class 'beeswaxd.ttypes.BeeswaxException'>
          17:29:53 E    MESSAGE: AnalysisException: Table already exists: test_native_functions_514d1e6.udfinserttest
          17:29:53 ---------------------------- Captured stderr setup -----------------------------
          17:29:53 SET sync_ddl=False;
          17:29:53 -- executing against localhost:21000
          17:29:53 DROP DATABASE IF EXISTS `test_native_functions_514d1e6` CASCADE;
          17:29:53 
          17:29:53 SET sync_ddl=False;
          17:29:53 -- executing against localhost:21000
          17:29:53 CREATE DATABASE `test_native_functions_514d1e6`;
          17:29:53 
          17:29:53 MainThread: Created database "test_native_functions_514d1e6" for test ID "query_test/test_udfs.py::TestUdfs::()::test_native_functions[exec_option: {'disable_codegen': True, 'abort_on_error': 1, 'exec_single_node_rows_threshold': 100, 'batch_size': 0, 'num_nodes': 0} | table_format: text/none]"
          17:29:53 ----------------------------- Captured stderr call -----------------------------
          
          Show
          kwho Michael Ho added a comment - test_udfs.py with S3 just hit this problem again. 17:29:53 =================================== FAILURES =================================== 17:29:53 TestUdfs.test_native_functions[exec_option: {'disable_codegen': True, 'abort_on_error': 1, 'exec_single_node_rows_threshold': 100, 'batch_size': 0, 'num_nodes': 0} | table_format: text/none] 17:29:53 [gw0] linux2 -- Python 2.6.6 /data/jenkins/workspace/impala-umbrella-build-and-test-s3/repos/Impala/bin/../infra/python/env/bin/python 17:29:53 query_test/test_udfs.py:60: in test_native_functions 17:29:53 self.run_test_case('QueryTest/udf-init-close', vector, use_db=unique_database) 17:29:53 common/impala_test_suite.py:321: in run_test_case 17:29:53 result = self.__execute_query(target_impalad_client, query, user=user) 17:29:53 common/impala_test_suite.py:525: in __execute_query 17:29:53 return impalad_client.execute(query, user=user) 17:29:53 common/impala_connection.py:160: in execute 17:29:53 return self.__beeswax_client.execute(sql_stmt, user=user) 17:29:53 beeswax/impala_beeswax.py:173: in execute 17:29:53 handle = self.__execute_query(query_string.strip(), user=user) 17:29:53 beeswax/impala_beeswax.py:337: in __execute_query 17:29:53 handle = self.execute_query_async(query_string, user=user) 17:29:53 beeswax/impala_beeswax.py:333: in execute_query_async 17:29:53 return self.__do_rpc(lambda: self.imp_service.query(query,)) 17:29:53 beeswax/impala_beeswax.py:458: in __do_rpc 17:29:53 raise ImpalaBeeswaxException(self.__build_error_message(b), b) 17:29:53 E ImpalaBeeswaxException: ImpalaBeeswaxException: 17:29:53 E INNER EXCEPTION: <class 'beeswaxd.ttypes.BeeswaxException'> 17:29:53 E MESSAGE: AnalysisException: Table already exists: test_native_functions_514d1e6.udfinserttest 17:29:53 ---------------------------- Captured stderr setup ----------------------------- 17:29:53 SET sync_ddl=False; 17:29:53 -- executing against localhost:21000 17:29:53 DROP DATABASE IF EXISTS `test_native_functions_514d1e6` CASCADE; 17:29:53 17:29:53 SET sync_ddl=False; 17:29:53 -- executing against localhost:21000 17:29:53 CREATE DATABASE `test_native_functions_514d1e6`; 17:29:53 17:29:53 MainThread: Created database "test_native_functions_514d1e6" for test ID "query_test/test_udfs.py::TestUdfs::()::test_native_functions[exec_option: {'disable_codegen': True, 'abort_on_error': 1, 'exec_single_node_rows_threshold': 100, 'batch_size': 0, 'num_nodes': 0} | table_format: text/none]" 17:29:53 ----------------------------- Captured stderr call -----------------------------
          Hide
          alex.behm Alexander Behm added a comment -

          Updating this thread based on my investigation:

          • The test_udfs.py problem was a test race and has been fixed.
          • The other "Table already exists" issues appear to be product bugs.
          • I've tried very hard to reproduce the issue locally but so far without success.
          • I've tried running concurrent streams of add/drop including drop database cascade to mimic the test env as closely as possible. While doing so I tried adding sleeps to various places and I tried increasing/decreasing the statestore topic update frequency. So far still not local reproduction.
          Show
          alex.behm Alexander Behm added a comment - Updating this thread based on my investigation: The test_udfs.py problem was a test race and has been fixed. The other "Table already exists" issues appear to be product bugs. I've tried very hard to reproduce the issue locally but so far without success. I've tried running concurrent streams of add/drop including drop database cascade to mimic the test env as closely as possible. While doing so I tried adding sleeps to various places and I tried increasing/decreasing the statestore topic update frequency. So far still not local reproduction.
          Hide
          alex.behm Alexander Behm added a comment -

          I am able to produce the same "Table already exists" error when running several concurrent streams doing drop+create+drop and invalidate metadata.
          However, in Jenkins builds I am very confident that no tests are running a concurrent invalidate metadata. We fixed that a while ago, but nevertheless I checked again and could not find anything.

          Show
          alex.behm Alexander Behm added a comment - I am able to produce the same "Table already exists" error when running several concurrent streams doing drop+create+drop and invalidate metadata. However, in Jenkins builds I am very confident that no tests are running a concurrent invalidate metadata. We fixed that a while ago, but nevertheless I checked again and could not find anything.
          Hide
          alex.behm Alexander Behm added a comment -

          Without an easy reproduction, I think we should add targeted logging to better diagnose what happened the next time around.

          Show
          alex.behm Alexander Behm added a comment - Without an easy reproduction, I think we should add targeted logging to better diagnose what happened the next time around.
          Hide
          mikesbrown Michael Brown added a comment -

          Alexander Behm could you put up a patch that adds targeted logging? If such a patch is committed, more builds are likely to run through the targeted logging code path.

          Show
          mikesbrown Michael Brown added a comment - Alexander Behm could you put up a patch that adds targeted logging? If such a patch is committed, more builds are likely to run through the targeted logging code path.
          Hide
          alex.behm Alexander Behm added a comment -

          Michael Brown, to effectively find this issue we need pretty excessive logging, so I'd prefer to not check it into trunk. Last week, I had created a private branch with more logging and set up a private Jenkins job to run periodically.

          We got lucky and my job hit the issue. I attached my commit that adds the extra logging as well as the resulting impalad.INFO from the build that hit the issue. It's indeed a product bug, explained as follows with snippets from the attached impalad.INFO.

          Summary of the problem
          The catalogd may incorrectly return a deleted object to an impalad in response to a DROP TABLE/DATABASE IF EXISTS request. The tricky part is that the version of that deleted object is wrong. See this problematic snippet from CatalogOpExecutor.java:

            private void dropDatabase(TDropDbParams params, TDdlExecResponse resp)
                throws ImpalaException {
          ...
                Db removedDb = catalog_.removeDb(params.getDb());
                // If no db was removed as part of this operation just return the current catalog
                // version.
                if (removedDb == null) {
                  removedObject.setCatalog_version(catalog_.getCatalogVersion());   <--- This can overwrite legitimate entries in an impalad's CatalogDeltaLog with the same version!
                } else {
                  removedObject.setCatalog_version(removedDb.getCatalogVersion());
                }
          ...
          }
          

          The problem is that an impalad's CatalogDeltaLog could have a legitimate entry with the same version, and this "dummy" deleted object may overwrite it. It is definitely not safe (and not even necessary) to return a deleted object with the current catalog version. As a reminder, the CatalogDeltaLog protects deletions from being undone via updates from the statestore (that may to add the same object with a smaller version). That is exactly what is happening in this JIRA.

          Analysis of snippets from the impalad.INFO
          The problematic table is "test_native_functions_b37009f0.udfinserttest"

          1. The table is dropped. The logs show the local application of the deletion. The CatalogDeltaLog contains an entry at version 12214 for the deletion.

          I1216 21:31:55.201782 15371 ImpaladCatalog.java:137] IMPALA-3641 12206 TUpdateCatalogCacheRequest(is_delta:true, catalog_service_id:TUniqueId(hi:606768694544452912, lo:-5884766320429217073), updated_objects:[], removed_objects:[TCatalogObject(type:TABLE, catalog_version:12214, table:TTable(db_name:test_native_functions_b37009f0, tbl_name:udfinserttest))])
          I1216 21:31:55.202005 15371 ImpaladCatalog.java:346] IMPALA-3641 Removing obj 12214 12206
          I1216 21:31:55.202183 15371 CatalogDeltaLog.java:84] IMPALA-3641 {12211=TCatalogObject(type:DATABASE, catalog_version:12211, db:TDatabase(db_name:test_ir_functions_a8de1584)), 12213=TCatalogObject(type:DATABASE, catalog_version:12213, db:TDatabase(db_name:test_ir_functions_a8de1584)), 12214=TCatalogObject(type:TABLE, catalog_version:12214, table:TTable(db_name:test_native_functions_b37009f0, tbl_name:udfinserttest))}
          

          2. The impalad is locally applying a DROP DATABASE IF EXISTS. The databases did not exist, and so got the current catalog version 12214. We see that the entry from (1) was overwritten in the CatalogDeltaLog.

          I1216 21:31:55.203670 15357 ImpaladCatalog.java:137] IMPALA-3641 12206 TUpdateCatalogCacheRequest(is_delta:true, catalog_service_id:TUniqueId(hi:606768694544452912, lo:-5884766320429217073), updated_objects:[], removed_objects:[TCatalogObject(type:DATABASE, catalog_version:12214, db:TDatabase(db_name:test_ir_functions_35acba69))])
          I1216 21:31:55.203840 15357 ImpaladCatalog.java:346] IMPALA-3641 Removing obj 12214 12206
          I1216 21:31:55.204043 15357 CatalogDeltaLog.java:84] IMPALA-3641 {12211=TCatalogObject(type:DATABASE, catalog_version:12211, db:TDatabase(db_name:test_ir_functions_a8de1584)), 12213=TCatalogObject(type:DATABASE, catalog_version:12213, db:TDatabase(db_name:test_ir_functions_a8de1584)), 12214=TCatalogObject(type:DATABASE, catalog_version:12214, db:TDatabase(db_name:test_ir_functions_35acba69))}
          

          3. The impalad process an update from the statestore which contains the addition of the problematic table at version 12207. This addition would have been prevented by the entry in the CatalogDeltaLog.

          I1216 21:31:55.648103 15495 ImpaladCatalog.java:137] IMPALA-3641 12206 TUpdateCatalogCacheRequest(is_delta:true, catalog_service_id:TUniqueId(hi:606768694544452912, lo:-5884766320429217073), updated_objects:[TCatalogObject(type:TABLE, catalog_version:12210, table:TTable(db_name:test_ir_functions_2ddad415, tbl_name:udfinserttest, access_level:READ_WRITE, columns:[TColumn(columnName:a, columnType:TColumnType(types:[TTypeNode(type:SCALAR, scalar_type:TScalarType(type:INT))]), col_stats:TColumnStats(avg_size:4.0, max_size:4, num_distinct_values:-1, num_nulls:-1), position:1)], clustering_columns:[TColumn(columnName:udf_was_opened, columnType:TColumnType(types:[TTypeNode(type:SCALAR, scalar_type:TScalarType(type:STRING))]), col_stats:TColumnStats(avg_size:-1.0, max_size:-1, num_distinct_values:1, num_nulls:0), position:0)], table_type:HDFS_TABLE, hdfs_table:THdfsTable(hdfsBaseDir:hdfs://localhost:20500/test-warehouse/test_ir_functions_2ddad415.db/udfinserttest, colNames:[udf_was_opened, a], nullPartitionKeyValue:__HIVE_DEFAULT_PARTITION__, nullColumnValue:\N, partitions:{-1=THdfsPartition(lineDelim:10, fieldDelim:1, collectionDelim:1, mapKeyDelim:1, escapeChar:0, fileFormat:TEXT, partitionKeyExprs:[], blockSize:0, access_level:READ_WRITE, stats:TTableStats(num_rows:-1), is_marked_cached:false, id:-1, hms_parameters:{}), 6864=THdfsPartition(lineDelim:10, fieldDelim:1, collectionDelim:1, mapKeyDelim:1, escapeChar:0, fileFormat:TEXT, partitionKeyExprs:[TExpr(nodes:[TExprNode(node_type:STRING_LITERAL, type:TColumnType(types:[TTypeNode(type:SCALAR, scalar_type:TScalarType(type:STRING))]), num_children:0, string_literal:TStringLiteral(value:1))])], blockSize:0, file_desc:[THdfsFileDesc(file_name:8443900107e0eb54-6ead82bc00000000_1867147552_data.0., length:2, compression:NONE, last_modification_time:1481952712298, file_blocks:[THdfsFileBlock(offset:0, length:2, replica_host_idxs:[1, 0, 2], disk_ids:[2, 1, 0], is_replica_cached:[false, false, false])]), THdfsFileDesc(file_name:a64938758981c14d-9a46aebf00000000_1657836735_data.0., length:2, compression:NONE, last_modification_time:1481952712558, file_blocks:[THdfsFileBlock(offset:0, length:2, replica_host_idxs:[2, 0, 1], disk_ids:[0, 1, 2], is_replica_cached:[false, false, false])])], location:THdfsPartitionLocation(prefix_index:0, suffix:udf_was_opened=1), access_level:READ_WRITE, stats:TTableStats(num_rows:-1), is_marked_cached:false, id:6864, hms_parameters:{numFiles=1, transient_lastDdlTime=1481952711, COLUMN_STATS_ACCURATE=false, totalSize=2, numRows=-1, rawDataSize=-1})}, network_addresses:[TNetworkAddress(hostname:127.0.0.1, port:31002), TNetworkAddress(hostname:127.0.0.1, port:31000), TNetworkAddress(hostname:127.0.0.1, port:31001)], multiple_filesystems:false, partition_prefixes:[hdfs://localhost:20500/test-warehouse/test_ir_functions_2ddad415.db/udfinserttest/]), metastore_table:Table(tableName:udfinserttest, dbName:test_ir_functions_2ddad415, owner:jenkins, createTime:1481952680, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:a, type:int, comment:)], location:hdfs://localhost:20500/test-warehouse/test_ir_functions_2ddad415.db/udfinserttest, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:0, serdeInfo:SerDeInfo(name:, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), partitionKeys:[FieldSchema(name:udf_was_opened, type:string, comment:)], parameters:{transient_lastDdlTime=1481952711}, viewOriginalText:, viewExpandedText:, tableType:MANAGED_TABLE, temporary:false))),
          TCatalogObject(type:TABLE, catalog_version:12207, table:TTable(db_name:test_native_functions_b37009f0, tbl_name:udfinserttest, access_level:READ_WRITE, columns:[TColumn(columnName:udf_was_opened, columnType:TColumnType(types:[TTypeNode(type:SCALAR, scalar_type:TScalarType(type:BOOLEAN))]), col_stats:TColumnStats(avg_size:1.0, max_size:1, num_distinct_values:-1, num_nulls:-1), position:0)], clustering_columns:[], table_type:HDFS_TABLE, hdfs_table:THdfsTable(hdfsBaseDir:hdfs://localhost:20500/test-warehouse/test_native_functions_b37009f0.db/udfinserttest, colNames:[udf_was_opened], nullPartitionKeyValue:__HIVE_DEFAULT_PARTITION__, nullColumnValue:\N, partitions:{-1=THdfsPartition(lineDelim:10, fieldDelim:1, collectionDelim:1, mapKeyDelim:1, escapeChar:0, fileFormat:TEXT, partitionKeyExprs:[], blockSize:0, access_level:READ_WRITE, stats:TTableStats(num_rows:-1), is_marked_cached:false, id:-1, hms_parameters:{}), 6863=THdfsPartition(lineDelim:10, fieldDelim:1, collectionDelim:1, mapKeyDelim:1, escapeChar:0, fileFormat:TEXT, partitionKeyExprs:[], blockSize:0, file_desc:[THdfsFileDesc(file_name:7749820e73fde320-14b52cd500000000_1392177785_data.0., length:5, compression:NONE, last_modification_time:1481952711672, file_blocks:[THdfsFileBlock(offset:0, length:5, replica_host_idxs:[0, 1, 2], disk_ids:[1, 0, 2], is_replica_cached:[false, false, false])])], location:THdfsPartitionLocation(prefix_index:0, suffix:), access_level:READ_WRITE, stats:TTableStats(num_rows:-1), is_marked_cached:false, id:6863, hms_parameters:{})}, network_addresses:[TNetworkAddress(hostname:127.0.0.1, port:31002), TNetworkAddress(hostname:127.0.0.1, port:31001), TNetworkAddress(hostname:127.0.0.1, port:31000)], multiple_filesystems:false, partition_prefixes:[hdfs://localhost:20500/test-warehouse/test_native_functions_b37009f0.db/udfinserttest]), metastore_table:Table(tableName:udfinserttest, dbName:test_native_functions_b37009f0, owner:jenkins, createTime:1481952687, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:udf_was_opened, type:boolean, comment:)], location:hdfs://localhost:20500/test-warehouse/test_native_functions_b37009f0.db/udfinserttest, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:0, serdeInfo:SerDeInfo(name:, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), partitionKeys:[], parameters:{transient_lastDdlTime=1481952687}, viewOriginalText:, viewExpandedText:, tableType:MANAGED_TABLE, temporary:false))),
          TCatalogObject(type:CATALOG, catalog_version:12210, catalog:TCatalog(catalog_service_id:TUniqueId(hi:606768694544452912, lo:-5884766320429217073)))], removed_objects:[])
          

          4. We try to CREATE TABLE but the impalad reports that the table already exists. The existing table is at version 12207 (the one added by the statestore)

          I1216 21:31:55.816988 15371 Analyzer.java:2472] IMPALA-3641 Tbl exists: test_native_functions_b37009f0.udfinserttest@12207 HdfsTable true
          I1216 21:31:55.817726 15371 jni-util.cc:169] org.apache.impala.common.AnalysisException: Table already exists: test_native_functions_b37009f0.udfinserttest
          

          Unfortunately, I have not been able to reproduce the issue locally despite vigorous attempts and despite knowing what the problem is.

          I believe the fix is to not return a deleted object when no object was actually deleted. It is definitely not safe to return a deleted object with the current catalog version.
          I'm currently testing a patch to fix this issue.

          Show
          alex.behm Alexander Behm added a comment - Michael Brown , to effectively find this issue we need pretty excessive logging, so I'd prefer to not check it into trunk. Last week, I had created a private branch with more logging and set up a private Jenkins job to run periodically. We got lucky and my job hit the issue. I attached my commit that adds the extra logging as well as the resulting impalad.INFO from the build that hit the issue. It's indeed a product bug, explained as follows with snippets from the attached impalad.INFO. Summary of the problem The catalogd may incorrectly return a deleted object to an impalad in response to a DROP TABLE/DATABASE IF EXISTS request. The tricky part is that the version of that deleted object is wrong. See this problematic snippet from CatalogOpExecutor.java: private void dropDatabase(TDropDbParams params, TDdlExecResponse resp) throws ImpalaException { ... Db removedDb = catalog_.removeDb(params.getDb()); // If no db was removed as part of this operation just return the current catalog // version. if (removedDb == null ) { removedObject.setCatalog_version(catalog_.getCatalogVersion()); <--- This can overwrite legitimate entries in an impalad's CatalogDeltaLog with the same version! } else { removedObject.setCatalog_version(removedDb.getCatalogVersion()); } ... } The problem is that an impalad's CatalogDeltaLog could have a legitimate entry with the same version, and this "dummy" deleted object may overwrite it. It is definitely not safe (and not even necessary) to return a deleted object with the current catalog version. As a reminder, the CatalogDeltaLog protects deletions from being undone via updates from the statestore (that may to add the same object with a smaller version). That is exactly what is happening in this JIRA. Analysis of snippets from the impalad.INFO The problematic table is "test_native_functions_b37009f0.udfinserttest" 1. The table is dropped. The logs show the local application of the deletion. The CatalogDeltaLog contains an entry at version 12214 for the deletion. I1216 21:31:55.201782 15371 ImpaladCatalog.java:137] IMPALA-3641 12206 TUpdateCatalogCacheRequest(is_delta: true , catalog_service_id:TUniqueId(hi:606768694544452912, lo:-5884766320429217073), updated_objects:[], removed_objects:[TCatalogObject(type:TABLE, catalog_version:12214, table:TTable(db_name:test_native_functions_b37009f0, tbl_name:udfinserttest))]) I1216 21:31:55.202005 15371 ImpaladCatalog.java:346] IMPALA-3641 Removing obj 12214 12206 I1216 21:31:55.202183 15371 CatalogDeltaLog.java:84] IMPALA-3641 {12211=TCatalogObject(type:DATABASE, catalog_version:12211, db:TDatabase(db_name:test_ir_functions_a8de1584)), 12213=TCatalogObject(type:DATABASE, catalog_version:12213, db:TDatabase(db_name:test_ir_functions_a8de1584)), 12214=TCatalogObject(type:TABLE, catalog_version:12214, table:TTable(db_name:test_native_functions_b37009f0, tbl_name:udfinserttest))} 2. The impalad is locally applying a DROP DATABASE IF EXISTS. The databases did not exist, and so got the current catalog version 12214. We see that the entry from (1) was overwritten in the CatalogDeltaLog. I1216 21:31:55.203670 15357 ImpaladCatalog.java:137] IMPALA-3641 12206 TUpdateCatalogCacheRequest(is_delta: true , catalog_service_id:TUniqueId(hi:606768694544452912, lo:-5884766320429217073), updated_objects:[], removed_objects:[TCatalogObject(type:DATABASE, catalog_version:12214, db:TDatabase(db_name:test_ir_functions_35acba69))]) I1216 21:31:55.203840 15357 ImpaladCatalog.java:346] IMPALA-3641 Removing obj 12214 12206 I1216 21:31:55.204043 15357 CatalogDeltaLog.java:84] IMPALA-3641 {12211=TCatalogObject(type:DATABASE, catalog_version:12211, db:TDatabase(db_name:test_ir_functions_a8de1584)), 12213=TCatalogObject(type:DATABASE, catalog_version:12213, db:TDatabase(db_name:test_ir_functions_a8de1584)), 12214=TCatalogObject(type:DATABASE, catalog_version:12214, db:TDatabase(db_name:test_ir_functions_35acba69))} 3. The impalad process an update from the statestore which contains the addition of the problematic table at version 12207. This addition would have been prevented by the entry in the CatalogDeltaLog. I1216 21:31:55.648103 15495 ImpaladCatalog.java:137] IMPALA-3641 12206 TUpdateCatalogCacheRequest(is_delta: true , catalog_service_id:TUniqueId(hi:606768694544452912, lo:-5884766320429217073), updated_objects:[TCatalogObject(type:TABLE, catalog_version:12210, table:TTable(db_name:test_ir_functions_2ddad415, tbl_name:udfinserttest, access_level:READ_WRITE, columns:[TColumn(columnName:a, columnType:TColumnType(types:[TTypeNode(type:SCALAR, scalar_type:TScalarType(type:INT))]), col_stats:TColumnStats(avg_size:4.0, max_size:4, num_distinct_values:-1, num_nulls:-1), position:1)], clustering_columns:[TColumn(columnName:udf_was_opened, columnType:TColumnType(types:[TTypeNode(type:SCALAR, scalar_type:TScalarType(type:STRING))]), col_stats:TColumnStats(avg_size:-1.0, max_size:-1, num_distinct_values:1, num_nulls:0), position:0)], table_type:HDFS_TABLE, hdfs_table:THdfsTable(hdfsBaseDir:hdfs: //localhost:20500/test-warehouse/test_ir_functions_2ddad415.db/udfinserttest, colNames:[udf_was_opened, a], nullPartitionKeyValue:__HIVE_DEFAULT_PARTITION__, nullColumnValue:\N, partitions:{-1=THdfsPartition(lineDelim:10, fieldDelim:1, collectionDelim:1, mapKeyDelim:1, escapeChar:0, fileFormat:TEXT, partitionKeyExprs:[], blockSize:0, access_level:READ_WRITE, stats:TTableStats(num_rows:-1), is_marked_cached: false , id:-1, hms_parameters:{}), 6864=THdfsPartition(lineDelim:10, fieldDelim:1, collectionDelim:1, mapKeyDelim:1, escapeChar:0, fileFormat:TEXT, partitionKeyExprs:[TExpr(nodes:[TExprNode(node_type:STRING_LITERAL, type:TColumnType(types:[TTypeNode(type:SCALAR, scalar_type:TScalarType(type:STRING))]), num_children:0, string_literal:TStringLiteral(value:1))])], blockSize:0, file_desc:[THdfsFileDesc(file_name:8443900107e0eb54-6ead82bc00000000_1867147552_data.0., length:2, compression:NONE, last_modification_time:1481952712298, file_blocks:[THdfsFileBlock(offset:0, length:2, replica_host_idxs:[1, 0, 2], disk_ids:[2, 1, 0], is_replica_cached:[ false , false , false ])]), THdfsFileDesc(file_name:a64938758981c14d-9a46aebf00000000_1657836735_data.0., length:2, compression:NONE, last_modification_time:1481952712558, file_blocks:[THdfsFileBlock(offset:0, length:2, replica_host_idxs:[2, 0, 1], disk_ids:[0, 1, 2], is_replica_cached:[ false , false , false ])])], location:THdfsPartitionLocation(prefix_index:0, suffix:udf_was_opened=1), access_level:READ_WRITE, stats:TTableStats(num_rows:-1), is_marked_cached: false , id:6864, hms_parameters:{numFiles=1, transient_lastDdlTime=1481952711, COLUMN_STATS_ACCURATE= false , totalSize=2, numRows=-1, rawDataSize=-1})}, network_addresses:[TNetworkAddress(hostname:127.0.0.1, port:31002), TNetworkAddress(hostname:127.0.0.1, port:31000), TNetworkAddress(hostname:127.0.0.1, port:31001)], multiple_filesystems: false , partition_prefixes:[hdfs://localhost:20500/test-warehouse/test_ir_functions_2ddad415.db/udfinserttest/]), metastore_table:Table(tableName:udfinserttest, dbName:test_ir_functions_2ddad415, owner:jenkins, createTime:1481952680, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:a, type: int , comment:)], location:hdfs://localhost:20500/test-warehouse/test_ir_functions_2ddad415.db/udfinserttest, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed: false , numBuckets:0, serdeInfo:SerDeInfo(name:, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories: false ), partitionKeys:[FieldSchema(name:udf_was_opened, type:string, comment:)], parameters:{transient_lastDdlTime=1481952711}, viewOriginalText:, viewExpandedText:, tableType:MANAGED_TABLE, temporary: false ))), TCatalogObject(type:TABLE, catalog_version:12207, table:TTable(db_name:test_native_functions_b37009f0, tbl_name:udfinserttest, access_level:READ_WRITE, columns:[TColumn(columnName:udf_was_opened, columnType:TColumnType(types:[TTypeNode(type:SCALAR, scalar_type:TScalarType(type:BOOLEAN))]), col_stats:TColumnStats(avg_size:1.0, max_size:1, num_distinct_values:-1, num_nulls:-1), position:0)], clustering_columns:[], table_type:HDFS_TABLE, hdfs_table:THdfsTable(hdfsBaseDir:hdfs: //localhost:20500/test-warehouse/test_native_functions_b37009f0.db/udfinserttest, colNames:[udf_was_opened], nullPartitionKeyValue:__HIVE_DEFAULT_PARTITION__, nullColumnValue:\N, partitions:{-1=THdfsPartition(lineDelim:10, fieldDelim:1, collectionDelim:1, mapKeyDelim:1, escapeChar:0, fileFormat:TEXT, partitionKeyExprs:[], blockSize:0, access_level:READ_WRITE, stats:TTableStats(num_rows:-1), is_marked_cached: false , id:-1, hms_parameters:{}), 6863=THdfsPartition(lineDelim:10, fieldDelim:1, collectionDelim:1, mapKeyDelim:1, escapeChar:0, fileFormat:TEXT, partitionKeyExprs:[], blockSize:0, file_desc:[THdfsFileDesc(file_name:7749820e73fde320-14b52cd500000000_1392177785_data.0., length:5, compression:NONE, last_modification_time:1481952711672, file_blocks:[THdfsFileBlock(offset:0, length:5, replica_host_idxs:[0, 1, 2], disk_ids:[1, 0, 2], is_replica_cached:[ false , false , false ])])], location:THdfsPartitionLocation(prefix_index:0, suffix:), access_level:READ_WRITE, stats:TTableStats(num_rows:-1), is_marked_cached: false , id:6863, hms_parameters:{})}, network_addresses:[TNetworkAddress(hostname:127.0.0.1, port:31002), TNetworkAddress(hostname:127.0.0.1, port:31001), TNetworkAddress(hostname:127.0.0.1, port:31000)], multiple_filesystems: false , partition_prefixes:[hdfs://localhost:20500/test-warehouse/test_native_functions_b37009f0.db/udfinserttest]), metastore_table:Table(tableName:udfinserttest, dbName:test_native_functions_b37009f0, owner:jenkins, createTime:1481952687, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:udf_was_opened, type: boolean , comment:)], location:hdfs://localhost:20500/test-warehouse/test_native_functions_b37009f0.db/udfinserttest, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed: false , numBuckets:0, serdeInfo:SerDeInfo(name:, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories: false ), partitionKeys:[], parameters:{transient_lastDdlTime=1481952687}, viewOriginalText:, viewExpandedText:, tableType:MANAGED_TABLE, temporary: false ))), TCatalogObject(type:CATALOG, catalog_version:12210, catalog:TCatalog(catalog_service_id:TUniqueId(hi:606768694544452912, lo:-5884766320429217073)))], removed_objects:[]) 4. We try to CREATE TABLE but the impalad reports that the table already exists. The existing table is at version 12207 (the one added by the statestore) I1216 21:31:55.816988 15371 Analyzer.java:2472] IMPALA-3641 Tbl exists: test_native_functions_b37009f0.udfinserttest@12207 HdfsTable true I1216 21:31:55.817726 15371 jni-util.cc:169] org.apache.impala.common.AnalysisException: Table already exists: test_native_functions_b37009f0.udfinserttest Unfortunately, I have not been able to reproduce the issue locally despite vigorous attempts and despite knowing what the problem is. I believe the fix is to not return a deleted object when no object was actually deleted. It is definitely not safe to return a deleted object with the current catalog version. I'm currently testing a patch to fix this issue.
          Hide
          alex.behm Alexander Behm added a comment -

          commit 6d15f03777afeb1375927ad9520a5db5bc9d42a1
          Author: Alex Behm <alex.behm@cloudera.com>
          Date: Tue Dec 20 00:12:51 2016 -0800

          IMPALA-3641: Fix catalogd RPC responses to DROP IF EXISTS.

          The main problem was that the catalogd's response to
          a DROP IF EXISTS operations included a removed object
          that was applied to the requesting impalad's catalog cache.
          In particular, a DROP DATABASE IF EXISTS that did not actually
          drop anything in the catalogd still returned the object name in
          the RPC response as a removed object with the current catalog
          version (i.e., without incrementing the catalog version).

          The above behavior lead to a situation where a drop of
          a non-existent object overwrote a legitimate entry in
          an impalad's CatalogDeltaLog. Recall that the version of the
          dropped object was based on the current catalog version
          at some point in time, e.g., the same version of a
          legitimate entry in the CatalogDeltaLog.

          As a reminder, the CatalogDeltaLog protects deletions from
          being undone via updates from the statestore. So overwriting
          an object in the CatalogDeltaLog can lead to a dropped object
          appearing again with certain timing of a statestore update.

          Please see the JIRA for an analysis of logging output that
          shows the bug and its effect.

          The fix is simple: The RPC response of a DROP IF EXISTS should
          only contain a removed object if an object was actually
          removed from the catalogd.

          This fix, however, introduces a new consistency issue (IMPALA-4727).
          The new behavior is not ideal, but better than the old behavior,
          explained as follows:
          The behavior before this patch is problematic because the drop of a
          completely unrelated object can affect the consistency of a drop+add
          on another object.
          The behavior after this patch is that a drop+add may fail in the add
          if there is an ill-timed concurrent drop of the same object.

          Testing:

          • Unfortunately, I have not been able to reproduce the issue
            locally despite vigorous attempts and despite knowing what
            the problem is. Our existing tests seem to reproduce the
            issue pretty reliably, so it's not clear whether a targeted
            test is feasible or needed.
          • An exhaustive test run passed.

          Change-Id: Icb1f31eb2ecf05b9b51ef4e12e6bb78f44d0cf84
          Reviewed-on: http://gerrit.cloudera.org:8080/5556
          Reviewed-by: Alex Behm <alex.behm@cloudera.com>
          Tested-by: Impala Public Jenkins

          Show
          alex.behm Alexander Behm added a comment - commit 6d15f03777afeb1375927ad9520a5db5bc9d42a1 Author: Alex Behm <alex.behm@cloudera.com> Date: Tue Dec 20 00:12:51 2016 -0800 IMPALA-3641 : Fix catalogd RPC responses to DROP IF EXISTS. The main problem was that the catalogd's response to a DROP IF EXISTS operations included a removed object that was applied to the requesting impalad's catalog cache. In particular, a DROP DATABASE IF EXISTS that did not actually drop anything in the catalogd still returned the object name in the RPC response as a removed object with the current catalog version (i.e., without incrementing the catalog version). The above behavior lead to a situation where a drop of a non-existent object overwrote a legitimate entry in an impalad's CatalogDeltaLog. Recall that the version of the dropped object was based on the current catalog version at some point in time, e.g., the same version of a legitimate entry in the CatalogDeltaLog. As a reminder, the CatalogDeltaLog protects deletions from being undone via updates from the statestore. So overwriting an object in the CatalogDeltaLog can lead to a dropped object appearing again with certain timing of a statestore update. Please see the JIRA for an analysis of logging output that shows the bug and its effect. The fix is simple: The RPC response of a DROP IF EXISTS should only contain a removed object if an object was actually removed from the catalogd. This fix, however, introduces a new consistency issue ( IMPALA-4727 ). The new behavior is not ideal, but better than the old behavior, explained as follows: The behavior before this patch is problematic because the drop of a completely unrelated object can affect the consistency of a drop+add on another object. The behavior after this patch is that a drop+add may fail in the add if there is an ill-timed concurrent drop of the same object. Testing: Unfortunately, I have not been able to reproduce the issue locally despite vigorous attempts and despite knowing what the problem is. Our existing tests seem to reproduce the issue pretty reliably, so it's not clear whether a targeted test is feasible or needed. An exhaustive test run passed. Change-Id: Icb1f31eb2ecf05b9b51ef4e12e6bb78f44d0cf84 Reviewed-on: http://gerrit.cloudera.org:8080/5556 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins
          Hide
          dknupp David Knupp added a comment -

          I'm still seeing this issue intermittently on various test runs.

          Show
          dknupp David Knupp added a comment - I'm still seeing this issue intermittently on various test runs.
          Hide
          dknupp David Knupp added a comment -

          Closing again, with apologies – I looked at the date on the commit message, but somehow missed that this fix just went in yesterday.

          Show
          dknupp David Knupp added a comment - Closing again, with apologies – I looked at the date on the commit message, but somehow missed that this fix just went in yesterday.
          Hide
          jbapple Jim Apple added a comment -

          This is a bulk comment on all issues with Fix Version 2.8.0 that were resolved on or after 2016-12-09.

          2.8.0 was branched on December 9, with only two changes to master cherry-picked to the 2.8.0 release branch after that:

          https://github.com/apache/incubator-impala/commits/2.8.0

          Issues fixed after December 9 might not be fixed in 2.8.0. If you are the one who marked this issue Resolved, can you check to see if the patch is in 2.8.0 by using the link above? If the patch is not in 2.8.0, can you change the Fix Version to 2.9.0?

          Thank you!

          Show
          jbapple Jim Apple added a comment - This is a bulk comment on all issues with Fix Version 2.8.0 that were resolved on or after 2016-12-09. 2.8.0 was branched on December 9, with only two changes to master cherry-picked to the 2.8.0 release branch after that: https://github.com/apache/incubator-impala/commits/2.8.0 Issues fixed after December 9 might not be fixed in 2.8.0. If you are the one who marked this issue Resolved, can you check to see if the patch is in 2.8.0 by using the link above? If the patch is not in 2.8.0, can you change the Fix Version to 2.9.0? Thank you!
          Hide
          lv Lars Volker added a comment -

          I checked and the fix is not included in 2.8.0. Setting Fix Version to 2.9.0.

          Show
          lv Lars Volker added a comment - I checked and the fix is not included in 2.8.0. Setting Fix Version to 2.9.0.

            People

            • Assignee:
              alex.behm Alexander Behm
              Reporter:
              lv Lars Volker
            • Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development