Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-1947

Avro cols may load incorrectly if schema inconsistent with StorageDescriptor

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • Impala 2.1, Impala 2.2
    • Impala 2.3.0
    • None

    Description

      In some corner cases, an Avro table may be loaded from the HMS incorrectly such that string columns may be loaded with other types.

      This can happen when the avro schema's column definitions are inconsistent with the HMS StorageDescriptor's column definitions. This is relatively hard to do, but a bug in the Kite SDK (CDK-974) did produce this issue (and it has since been fixed https://github.com/kite-sdk/kite/pull/347/files ). Unfortunately I have not been able to reproduce this easily with just Impala/Hive.

      In HdfsTable.java:

      // Load the fields from the Avro schema.
      // Since Avro does not include meta-data for CHAR or VARCHAR, an Avro type of
      // "string" is used for CHAR, VARCHAR and STRING. Default back to the storage
      // descriptor to determine the the type for "string"
      List<FieldSchema> sdTypes = msTbl.getSd().getCols();
      int i = 0;
      List<Column> avroTypeList = AvroSchemaParser.parse(avroSchema_);
      boolean canFallBack = sdTypes.size() == avroTypeList.size();
      for (Column parsedCol: avroTypeList) {
        FieldSchema fs = new FieldSchema();
        fs.setName(parsedCol.getName());
        String avroType = parsedCol.getType().toSql();
        if (avroType.toLowerCase().equals("string") && canFallBack) {
          // check col names match and sdType is string/char/varchar
          // parsedCol.getName().equalsIgnoreCase(sdTypes.get(i).getName())
          fs.setType(sdTypes.get(i).getType());
        } else {
          fs.setType(avroType);
        }
        fs.setComment("from deserializer");
        tblFields.add(fs);
        i++;
      }
      
      

      We can't simply "fall back" if the # of cols is the same between the avro defs and the SD defs. We also should check that the col names are the same and that the SD col def is a STRING, CHAR, or VARCHAR.

      Attachments

        Activity

          People

            alex.behm Alexander Behm
            mjacobs Matthew Jacobs
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: