Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-1947

Avro cols may load incorrectly if schema inconsistent with StorageDescriptor

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: Impala 2.1, Impala 2.2
    • Fix Version/s: Impala 2.3.0
    • Component/s: None

      Description

      In some corner cases, an Avro table may be loaded from the HMS incorrectly such that string columns may be loaded with other types.

      This can happen when the avro schema's column definitions are inconsistent with the HMS StorageDescriptor's column definitions. This is relatively hard to do, but a bug in the Kite SDK (CDK-974) did produce this issue (and it has since been fixed https://github.com/kite-sdk/kite/pull/347/files ). Unfortunately I have not been able to reproduce this easily with just Impala/Hive.

      In HdfsTable.java:

      // Load the fields from the Avro schema.
      // Since Avro does not include meta-data for CHAR or VARCHAR, an Avro type of
      // "string" is used for CHAR, VARCHAR and STRING. Default back to the storage
      // descriptor to determine the the type for "string"
      List<FieldSchema> sdTypes = msTbl.getSd().getCols();
      int i = 0;
      List<Column> avroTypeList = AvroSchemaParser.parse(avroSchema_);
      boolean canFallBack = sdTypes.size() == avroTypeList.size();
      for (Column parsedCol: avroTypeList) {
        FieldSchema fs = new FieldSchema();
        fs.setName(parsedCol.getName());
        String avroType = parsedCol.getType().toSql();
        if (avroType.toLowerCase().equals("string") && canFallBack) {
          // check col names match and sdType is string/char/varchar
          // parsedCol.getName().equalsIgnoreCase(sdTypes.get(i).getName())
          fs.setType(sdTypes.get(i).getType());
        } else {
          fs.setType(avroType);
        }
        fs.setComment("from deserializer");
        tblFields.add(fs);
        i++;
      }
      
      

      We can't simply "fall back" if the # of cols is the same between the avro defs and the SD defs. We also should check that the col names are the same and that the SD col def is a STRING, CHAR, or VARCHAR.

        Attachments

          Activity

            People

            • Assignee:
              alex.behm Alexander Behm
              Reporter:
              mjacobs Matthew Jacobs
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: