[IMPALA-5427] beeswax get_state() can return EXCEPTION before error is visible via get_log() - ASF JIRA

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: Impala 2.9.0
Fix Version/s: Impala 2.10.0
Component/s: Backend
Labels:
- broken-build
- flaky

Epic Color:
ghx-label-5

Description

We've seen this error once:

09:52:03 FAIL query_test/test_scanners.py::TestParquet::()::test_corrupt_files[exec_option: {'disable_codegen': True, 'abort_on_error': 1, 'exec_single_node_rows_threshold': 0, 'batch_size': 0, 'num_nodes': 0} | table_format: parquet/none]
09:52:03 =================================== FAILURES ===================================
09:52:03  TestParquet.test_corrupt_files[exec_option: {'disable_codegen': True, 'abort_on_error': 1, 'exec_single_node_rows_threshold': 0, 'batch_size': 0, 'num_nodes': 0} | table_format: parquet/none] 
09:52:03 [gw2] linux2 -- Python 2.6.6 /data/jenkins/workspace/impala-umbrella-build-and-test/repos/Impala/bin/../infra/python/env/bin/python
09:52:03 query_test/test_scanners.py:254: in test_corrupt_files
09:52:03     self.run_test_case('QueryTest/parquet-abort-on-error', vector)
09:52:03 common/impala_test_suite.py:374: in run_test_case
09:52:03     self.__verify_exceptions(test_section['CATCH'], str(e), use_db)
09:52:03 common/impala_test_suite.py:253: in __verify_exceptions
09:52:03     (expected_str, actual_str)
09:52:03 E   AssertionError: Unexpected exception string. Expected: Column metadata states there are 50 values, but read 100 values from column element.
09:52:03 E   Not found in actual: ImpalaBeeswaxException: Query aborted:
09:52:03 ----------------------------- Captured stderr call -----------------------------
09:52:03 -- executing against localhost:21000
09:52:03 use functional_parquet;
09:52:03 
09:52:03 SET disable_codegen=True;
09:52:03 SET abort_on_error=0;
09:52:03 SET exec_single_node_rows_threshold=0;
09:52:03 SET batch_size=0;
09:52:03 SET num_nodes=0;
09:52:03 -- executing against localhost:21000
09:52:03 set num_nodes=1;
09:52:03 
09:52:03 -- executing against localhost:21000
09:52:03 
09:52:03 set num_scanner_threads=1;
09:52:03 
09:52:03 -- executing against localhost:21000
09:52:03 
09:52:03 select id, cnt from bad_column_metadata t, (select count(*) cnt from t.int_array) v;
09:52:03 
09:52:03 -- executing against localhost:21000
09:52:03 SET NUM_NODES=0;;
09:52:03 
09:52:03 -- executing against localhost:21000
09:52:03 SET NUM_SCANNER_THREADS=0;;
09:52:03 
09:52:03 -- executing against localhost:21000
09:52:03 set num_nodes=1;
09:52:03 
09:52:03 -- executing against localhost:21000
09:52:03 
09:52:03 set num_scanner_threads=1;
09:52:03 
09:52:03 -- executing against localhost:21000
09:52:03 
09:52:03 select id from bad_column_metadata;
09:52:03 
09:52:03 -- executing against localhost:21000
09:52:03 SET NUM_NODES=0;;
09:52:03 
09:52:03 -- executing against localhost:21000
09:52:03 SET NUM_SCANNER_THREADS=0;;
09:52:03 
09:52:03 -- executing against localhost:21000
09:52:03 SELECT * from bad_parquet_strings_negative_len;
09:52:03 
09:52:03 -- executing against localhost:21000
09:52:03 SELECT * from bad_parquet_strings_out_of_bounds;
09:52:03 
09:52:03 -- executing against localhost:21000
09:52:03 use functional_parquet;
09:52:03 
09:52:03 SET disable_codegen=True;
09:52:03 SET abort_on_error=1;
09:52:03 SET exec_single_node_rows_threshold=0;
09:52:03 SET batch_size=0;
09:52:03 SET num_nodes=0;
09:52:03 -- executing against localhost:21000
09:52:03 set num_nodes=1;
09:52:03 
09:52:03 -- executing against localhost:21000
09:52:03 
09:52:03 set num_scanner_threads=1;
09:52:03 
09:52:03 -- executing against localhost:21000
09:52:03 
09:52:03 select id, cnt from bad_column_metadata t, (select count(*) cnt from t.int_array) v;
09:52:03 
09:52:03 -- executing against localhost:21000
09:52:03 SET NUM_NODES=0;;
09:52:03 
09:52:03 -- executing against localhost:21000
09:52:03 SET NUM_SCANNER_THREADS=0;;
09:52:03 
09:52:03 ===== 1 failed, 10942 passed, 139 skipped, 500 xfailed in 15060.87 seconds =====

I was able to reproduce this once on my dev machine. To do so, I altered the affected test like so:

@SkipIfOldAggsJoins.nested_types
@pytest.mark.parametrize('multiplier', xrange(32))
def test_corrupt_files(self, vector, multiplier):

Then I ran it in a loop:

i=0; while [ $? -eq 0 ]; do ((++i)); echo "Run: $i"; impala-py.test tests/query_test/test_scanners.py::TestParquet::test_corrupt_files --exploration_strategy=exhaustive -n8; done

After a few hours the test failed. From the log files I could not figure out what had happened.

I also noticed that some queries don't report an error even when they should. For example:

[localhost:21000] > select id, cnt from functional_parquet.bad_column_metadata t, (select 1 cnt) u;
Query: select id, cnt from functional_parquet.bad_column_metadata t, (select 1 cnt) u
Query submitted at: 2017-06-01 22:34:55 (Coordinator: http://lv-desktop:25000)
Query progress can be monitored at: http://lv-desktop:25000/query_plan?query_id=8f401de1b07987c9:ae67b91d00000000
WARNINGS: Column metadata states there are 11 values, but read 10 values from column id. file=hdfs://localhost:20500/test-warehouse/bad_column_metadata_parquet/bad_column_metadata.parquet

Without the second subquery a proper error is returned:

[localhost:21000] > select id from functional_parquet.bad_column_metadata t;
Query: select id from functional_parquet.bad_column_metadata t
Query submitted at: 2017-06-01 22:38:03 (Coordinator: http://lv-desktop:25000)
Query progress can be monitored at: http://lv-desktop:25000/query_plan?query_id=c3456d826947f584:156f07da00000000
ERROR: Column metadata states there are 11 values, but read 10 values from column id. file=hdfs://localhost:20500/test-warehouse/bad_column_metadata_parquet/bad_column_metadata.parquet

beeswax get_state() can return EXCEPTION before error is visible via get_log()

Details

Description

Attachments

Attachments

Activity

People

Dates