Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-3314

Altering table partition's storage format is not working and crashing the daemon

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: Impala 2.5.0
    • Fix Version/s: Impala 2.8.0
    • Component/s: Frontend
    • Labels:

      Description

      Steps to reproduce the problem -
      Steps to reproduce the problem:
      Step1:
      create external table sample (username string, tweet string, timewhen int) partitioned by (year string,month string) location '/tmp/test_avro/data/' TBLPROPERTIES ('avro.schema.url'='hdfs://host-10-17-80-187.coe.cloudera.com:8020/tmp/test_avro/schema/twitter.avsc');
      Step2:
      hadoop fs -mkdir /tmp/test_avro/data/year=2016/month=03
      hadoop fs -put twitter.avro /tmp/test_avro/data/year=2016/month=03
      Step3:
      alter table sample add partition (year="2016",month="03") location '/tmp/test_avro/data/year=2016/month=03';
      Step4:
      alter table sample partition (year="2016",month="03") set fileformat avro
      Step5:
      select * from sample;

      Data and schema can be found here-
      https://github.com/miguno/avro-cli-examples

      core dump -
      0x00000000015c4c10 in impala::HdfsAvroScanner::ResolveSchemas (this=0x92bac60, table_root=..., file_root=0x8c742b8) at /home/bharath/Impala/be/src/exec/hdfs-avro-scanner.cc:186
      (gdb) print table_root
      $2 = (const impala::AvroSchemaElement &) @0x8c46250:

      {schema = 0x0, children = std::vector of length 0, capacity 0, null_union_position = -1, slot_desc = 0x0, static LLVM_CLASS_NAME = 0x279ccf0 "struct.impala::AvroSchemaElement"}

      So, the schema is clearly null and we are dereferencing a null pointer at
      if (table_root.schema->type != AVRO_RECORD) return Status("Table schema is not a record");

      The schema is NULL since hdfs_scan_node sees an empty schema url-
      // Parse Avro table schema if applicable
      const string& avro_schema_str = hdfs_table_->avro_schema(); <<<< - Empty string.
      if (!avro_schema_str.empty()) {
      avro_schema_t avro_schema;
      int error = avro_schema_from_json_length(
      avro_schema_str.c_str(), avro_schema_str.size(), &avro_schema);
      if (error != 0)

      { return Status(Substitute("Failed to parse table schema: $0", avro_strerror())); }

      RETURN_IF_ERROR(AvroSchemaElement::ConvertSchema(avro_schema, avro_schema_.get()));
      }

      This information is usually passed on to the backend from the frontend table descriptor.
      avroSchema_ = hdfsTable.isSetAvroSchema() ? hdfsTable.getAvroSchema() : null;

      it is a per table property and not per partition. This means that the avro schema URL is only passed on to the backend if the base table is avro.
      if (HdfsFileFormat.fromJavaClassName(inputFormat) == HdfsFileFormat.AVRO) {
      ........
      avroSchema_ = AvroSchemaUtils.getAvroSchema(schemaSearchLocations);
      .........
      }

      Changing the base table format to avro works fine.

        Issue Links

          Activity

          Hide
          bharathv bharath v added a comment -

          Home: http://github.mtv.cloudera.com/CDH/Impala
          Commit: 6d90f4710439a9dcfaf26247d4a0feef76216a91
          http://github.com/Cloudera/Impala/commit/6d90f4710439a9dcfaf26247d4a0feef76216a91
          Author: Bharath Vissapragada <bharathv@cloudera.com>
          Date: 2016-05-21 (Sat, 21 May 2016)

          Changed paths:
          M be/src/exec/hdfs-avro-scanner.cc
          M be/src/exec/hdfs-avro-scanner.h
          M fe/src/main/java/com/cloudera/impala/catalog/HdfsPartition.java
          M fe/src/main/java/com/cloudera/impala/catalog/HdfsTable.java
          A testdata/workloads/functional-query/queries/QueryTest/avro-stale-schema.test
          M tests/query_test/test_avro_schema_resolution.py

          Log Message:
          -----------
          IMPALA-3314/IMPALA-3513: Fix querying tables/partitions altered to Avro format

          Bug: Impalads crash if we query an Avro table with stale metadata

          Cause: This happens because avroSchema_ is not set in HdfsTable,
          which is not propagated to the avro scanner and it doesn't have
          appropriate checks to make sure the schema is non-null.

          The patch fixes the following.

          1. Avro scanner should gracefully handle the case where the avro schema
          is not set. Appropriate null checks and a meaning error message have
          been added.

          2. This is a special case with multi-fileformat partitioned tables.
          avroSchema_ should be set in HdfsTable even if any subset of the
          partitions are backed by avro. Without this patch, we only set it
          if the base table file format is Avro.

          Change-Id: I09262d3a7b85a2263c721f3beafd0cab2a1bdf4b
          Reviewed-on: http://gerrit.cloudera.org:8080/3136

          Show
          bharathv bharath v added a comment - Home: http://github.mtv.cloudera.com/CDH/Impala Commit: 6d90f4710439a9dcfaf26247d4a0feef76216a91 http://github.com/Cloudera/Impala/commit/6d90f4710439a9dcfaf26247d4a0feef76216a91 Author: Bharath Vissapragada <bharathv@cloudera.com> Date: 2016-05-21 (Sat, 21 May 2016) Changed paths: M be/src/exec/hdfs-avro-scanner.cc M be/src/exec/hdfs-avro-scanner.h M fe/src/main/java/com/cloudera/impala/catalog/HdfsPartition.java M fe/src/main/java/com/cloudera/impala/catalog/HdfsTable.java A testdata/workloads/functional-query/queries/QueryTest/avro-stale-schema.test M tests/query_test/test_avro_schema_resolution.py Log Message: ----------- IMPALA-3314 / IMPALA-3513 : Fix querying tables/partitions altered to Avro format Bug: Impalads crash if we query an Avro table with stale metadata Cause: This happens because avroSchema_ is not set in HdfsTable, which is not propagated to the avro scanner and it doesn't have appropriate checks to make sure the schema is non-null. The patch fixes the following. 1. Avro scanner should gracefully handle the case where the avro schema is not set. Appropriate null checks and a meaning error message have been added. 2. This is a special case with multi-fileformat partitioned tables. avroSchema_ should be set in HdfsTable even if any subset of the partitions are backed by avro. Without this patch, we only set it if the base table file format is Avro. Change-Id: I09262d3a7b85a2263c721f3beafd0cab2a1bdf4b Reviewed-on: http://gerrit.cloudera.org:8080/3136
          Hide
          bharathv bharath v added a comment -

          This fix created a minor regression that can affect loading of Avro created in older versions of Hive. It has been fixed in https://github.com/cloudera/Impala/commit/465b2829d66d2881206c0ea0399f3fd9788ea7ab .

          Show
          bharathv bharath v added a comment - This fix created a minor regression that can affect loading of Avro created in older versions of Hive. It has been fixed in https://github.com/cloudera/Impala/commit/465b2829d66d2881206c0ea0399f3fd9788ea7ab .
          Hide
          bharathv bharath v added a comment -

          The fix is not complete yet, especially for partitioned tables. I'll submit one shortly.

          Show
          bharathv bharath v added a comment - The fix is not complete yet, especially for partitioned tables. I'll submit one shortly.
          Show
          bharathv bharath v added a comment - https://github.com/apache/incubator-impala/commit/bb633393775691807843a2b6bac28b1750c2c5da

            People

            • Assignee:
              bharathv bharath v
              Reporter:
              anujphadke Anuj Phadke
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development