Uploaded image for project: 'ORC'
  1. ORC
  2. ORC-54

Evolve schemas based on field name rather than index

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.2.0
    • Component/s: None
    • Labels:
      None

      Description

      Schema evolution as it stands today allows adding fields to the end of schemas or removing them from the end. However, because it is based on the index of the column, you can only ever add or remove – not both.

      ORC files have the full schema information of their contents, so there's actually enough metadata to support changing columns anywhere in the schema.

        Issue Links

          Activity

          Hide
          mwagner Mark Wagner added a comment -

          Hive writers with an earlier writer version than HIVE_4243 don't have the fix from HIVE-4243, which provides necessary metadata. If a file without the proper metadata is encountered, we'll need to either give a warning and fall back to the existing behavior, or be strict and give an exception.

          Show
          mwagner Mark Wagner added a comment - Hive writers with an earlier writer version than HIVE_4243 don't have the fix from HIVE-4243 , which provides necessary metadata. If a file without the proper metadata is encountered, we'll need to either give a warning and fall back to the existing behavior, or be strict and give an exception.
          Hide
          mmccline Matt McCline added a comment -

          Mark Wagner Owen O'Malley Gunther Hagleitner I need to be involved in this.

          Show
          mmccline Matt McCline added a comment - Mark Wagner Owen O'Malley Gunther Hagleitner I need to be involved in this.
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user wagnermarkd opened a pull request:

          https://github.com/apache/orc/pull/40

          ORC-54: Evolve schemas based on field name rather than index

          Change evolution to compare column names when building mapping.

          Misc notes:

          • I've added a setting orc.tolerate.missing.schema to allow working with data before HIVE-4243 was fixed. This is important for anyone who has older data, though it limits the evolutions that can be supported
          • A lot of changes in RecordReaderImpl to make the reader column vs writer column difference explicit
          • SARG tests are passing, but there's no test coverage for SARG + evolution. Will work on that.
          • Removed some unneeded checked exceptions.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/wagnermarkd/orc ORC-54-public

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/orc/pull/40.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #40


          commit c24200900e55437a242034eda6a5cd5122cc3acc
          Author: Mark Wagner <mwagner@apache.org>
          Date: 2016-05-18T21:24:16Z

          ORC-54: Evolve schemas based on field name rather than index


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user wagnermarkd opened a pull request: https://github.com/apache/orc/pull/40 ORC-54 : Evolve schemas based on field name rather than index Change evolution to compare column names when building mapping. Misc notes: I've added a setting orc.tolerate.missing.schema to allow working with data before HIVE-4243 was fixed. This is important for anyone who has older data, though it limits the evolutions that can be supported A lot of changes in RecordReaderImpl to make the reader column vs writer column difference explicit SARG tests are passing, but there's no test coverage for SARG + evolution. Will work on that. Removed some unneeded checked exceptions. You can merge this pull request into a Git repository by running: $ git pull https://github.com/wagnermarkd/orc ORC-54 -public Alternatively you can review and apply these changes as the patch at: https://github.com/apache/orc/pull/40.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #40 commit c24200900e55437a242034eda6a5cd5122cc3acc Author: Mark Wagner <mwagner@apache.org> Date: 2016-05-18T21:24:16Z ORC-54 : Evolve schemas based on field name rather than index
          Hide
          mwagner Mark Wagner added a comment -

          Matt McCline, I've just posted a PR. Let me know your thoughts. Thanks!

          Show
          mwagner Mark Wagner added a comment - Matt McCline , I've just posted a PR. Let me know your thoughts. Thanks!
          Hide
          mmccline Matt McCline added a comment - - edited

          Thanks for the heads up. Big change – seem like it is on a direct collision course with HIVE-13974 where the reader schema needs to be used for determining the included array. I need to understand how this relates to Hive. Will Hive be able to read these tables?

          Gunther Hagleitner Owen O'Malley

          Show
          mmccline Matt McCline added a comment - - edited Thanks for the heads up. Big change – seem like it is on a direct collision course with HIVE-13974 where the reader schema needs to be used for determining the included array. I need to understand how this relates to Hive. Will Hive be able to read these tables? Gunther Hagleitner Owen O'Malley
          Hide
          mwagner Mark Wagner added a comment -

          I'm reading through your patch on HIVE-13974. Wasn't aware of that work. Still need to read through it some more to grok, but I agree that there's overlap in that both of the patches are making the distinction around the included array explicit.

          I don't see a reason Hive shouldn't be able to read these datasets as long as the OrcSerde declares the reader schema in the Options. We use Hive for reading Orc, so that's definitely something I'm paying attention to.

          Show
          mwagner Mark Wagner added a comment - I'm reading through your patch on HIVE-13974 . Wasn't aware of that work. Still need to read through it some more to grok, but I agree that there's overlap in that both of the patches are making the distinction around the included array explicit. I don't see a reason Hive shouldn't be able to read these datasets as long as the OrcSerde declares the reader schema in the Options. We use Hive for reading Orc, so that's definitely something I'm paying attention to.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user omalley commented on a diff in the pull request:

          https://github.com/apache/orc/pull/40#discussion_r69777020

          — Diff: java/core/src/java/org/apache/orc/impl/RecordReaderImpl.java —
          @@ -27,21 +27,7 @@
          import java.util.List;
          import java.util.Map;

          -import org.apache.orc.BooleanColumnStatistics;
          -import org.apache.orc.Reader;
          -import org.apache.orc.RecordReader;
          -import org.apache.orc.TypeDescription;
          -import org.apache.orc.ColumnStatistics;
          -import org.apache.orc.CompressionCodec;
          -import org.apache.orc.DataReader;
          -import org.apache.orc.DateColumnStatistics;
          -import org.apache.orc.DecimalColumnStatistics;
          -import org.apache.orc.DoubleColumnStatistics;
          -import org.apache.orc.IntegerColumnStatistics;
          -import org.apache.orc.OrcConf;
          -import org.apache.orc.StringColumnStatistics;
          -import org.apache.orc.StripeInformation;
          -import org.apache.orc.TimestampColumnStatistics;
          +import org.apache.orc.*;
          — End diff –

          Please configure your IDE to not replace imports with wildcards.

          Show
          githubbot ASF GitHub Bot added a comment - Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/40#discussion_r69777020 — Diff: java/core/src/java/org/apache/orc/impl/RecordReaderImpl.java — @@ -27,21 +27,7 @@ import java.util.List; import java.util.Map; -import org.apache.orc.BooleanColumnStatistics; -import org.apache.orc.Reader; -import org.apache.orc.RecordReader; -import org.apache.orc.TypeDescription; -import org.apache.orc.ColumnStatistics; -import org.apache.orc.CompressionCodec; -import org.apache.orc.DataReader; -import org.apache.orc.DateColumnStatistics; -import org.apache.orc.DecimalColumnStatistics; -import org.apache.orc.DoubleColumnStatistics; -import org.apache.orc.IntegerColumnStatistics; -import org.apache.orc.OrcConf; -import org.apache.orc.StringColumnStatistics; -import org.apache.orc.StripeInformation; -import org.apache.orc.TimestampColumnStatistics; +import org.apache.orc.*; — End diff – Please configure your IDE to not replace imports with wildcards.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user omalley commented on a diff in the pull request:

          https://github.com/apache/orc/pull/40#discussion_r69777633

          — Diff: java/core/src/java/org/apache/orc/impl/ReaderImpl.java —
          @@ -572,7 +572,14 @@ public RecordReader rows(Options options) throws IOException {
          boolean[] include = options.getInclude();
          // if included columns is null, then include all columns
          if (include == null) {

          • include = new boolean[types.size()];
            + int size;
            + TypeDescription readSchema = options.getSchema();
            + if (readSchema != null){
            + size = readSchema.getMaximumId() + 1;
              • End diff –

          I'd suggest doing
          ```
          if (readSchema == null)

          { readSchema = schema; }

          ```

          Show
          githubbot ASF GitHub Bot added a comment - Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/40#discussion_r69777633 — Diff: java/core/src/java/org/apache/orc/impl/ReaderImpl.java — @@ -572,7 +572,14 @@ public RecordReader rows(Options options) throws IOException { boolean[] include = options.getInclude(); // if included columns is null, then include all columns if (include == null) { include = new boolean [types.size()] ; + int size; + TypeDescription readSchema = options.getSchema(); + if (readSchema != null){ + size = readSchema.getMaximumId() + 1; End diff – I'd suggest doing ``` if (readSchema == null) { readSchema = schema; } ```
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user omalley commented on a diff in the pull request:

          https://github.com/apache/orc/pull/40#discussion_r69778493

          — Diff: java/core/src/java/org/apache/orc/impl/SchemaEvolution.java —
          @@ -18,25 +18,33 @@

          package org.apache.orc.impl;

          -import java.io.IOException;
          import java.util.ArrayList;
          import java.util.HashMap;
          import java.util.List;
          import java.util.Map;
          +import java.util.regex.Pattern;

          import org.apache.orc.TypeDescription;
          import org.slf4j.Logger;
          import org.slf4j.LoggerFactory;

          /**

          • * Take the file types and the (optional) configuration column names/types and see if there
          • * has been schema evolution.
            + * Infer and track the evolution between the schema as stored in the file and the schema that has been requested by the
            + * reader.
            */
            public class SchemaEvolution {
            +
            + public static class IllegalEvolutionException extends RuntimeException
            Unknown macro: { + public IllegalEvolutionException(String msg) { + super(msg); + } + }

            +
            private final Map<Integer, TypeDescription> readerToFile;
            private final boolean[] included;
            private final TypeDescription readerSchema;
            private static final Logger LOG = LoggerFactory.getLogger(SchemaEvolution.class);
            + private static final Pattern missingMetadataPattern = Pattern.compile("_col
            d+");

              • End diff –

          You should check for WriterVersion < HIVE_4243 before bothering to check the column names. The vast majority of files from before HIVE_4243 have the synthetic column names.

          Show
          githubbot ASF GitHub Bot added a comment - Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/40#discussion_r69778493 — Diff: java/core/src/java/org/apache/orc/impl/SchemaEvolution.java — @@ -18,25 +18,33 @@ package org.apache.orc.impl; -import java.io.IOException; import java.util.ArrayList; import java.util.HashMap; import java.util.List; import java.util.Map; +import java.util.regex.Pattern; import org.apache.orc.TypeDescription; import org.slf4j.Logger; import org.slf4j.LoggerFactory; /** * Take the file types and the (optional) configuration column names/types and see if there * has been schema evolution. + * Infer and track the evolution between the schema as stored in the file and the schema that has been requested by the + * reader. */ public class SchemaEvolution { + + public static class IllegalEvolutionException extends RuntimeException Unknown macro: { + public IllegalEvolutionException(String msg) { + super(msg); + } + } + private final Map<Integer, TypeDescription> readerToFile; private final boolean[] included; private final TypeDescription readerSchema; private static final Logger LOG = LoggerFactory.getLogger(SchemaEvolution.class); + private static final Pattern missingMetadataPattern = Pattern.compile("_col d+"); End diff – You should check for WriterVersion < HIVE_4243 before bothering to check the column names. The vast majority of files from before HIVE_4243 have the synthetic column names.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user omalley commented on a diff in the pull request:

          https://github.com/apache/orc/pull/40#discussion_r69779322

          — Diff: java/core/src/java/org/apache/orc/OrcConf.java —
          @@ -82,6 +82,12 @@
          "If ORC reader encounters corrupt data, this value will be used to\n" +
          "determine whether to skip the corrupt data or throw exception.\n" +
          "The default behavior is to throw exception."),
          + TOLERATE_MISSING_SCHEMA("orc.tolerate.missing.schema",
          + "hive.exec.orc.tolerate.missing.schema",
          + true,
          + "Writers earlier than HIVE-4243 may have inaccurate schema metadata.\n"
          + + "This setting will enable best effort schema evolution rather\n"
          — End diff –

          I'd suggest that the comment include something about using position in the top level structure as the fallback and that it support adding new columns to the end and changing types, but not deleting or reordering the top level columns.

          Show
          githubbot ASF GitHub Bot added a comment - Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/40#discussion_r69779322 — Diff: java/core/src/java/org/apache/orc/OrcConf.java — @@ -82,6 +82,12 @@ "If ORC reader encounters corrupt data, this value will be used to\n" + "determine whether to skip the corrupt data or throw exception.\n" + "The default behavior is to throw exception."), + TOLERATE_MISSING_SCHEMA("orc.tolerate.missing.schema", + "hive.exec.orc.tolerate.missing.schema", + true, + "Writers earlier than HIVE-4243 may have inaccurate schema metadata.\n" + + "This setting will enable best effort schema evolution rather\n" — End diff – I'd suggest that the comment include something about using position in the top level structure as the fallback and that it support adding new columns to the end and changing types, but not deleting or reordering the top level columns.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user omalley commented on a diff in the pull request:

          https://github.com/apache/orc/pull/40#discussion_r69783819

          — Diff: java/core/src/java/org/apache/orc/impl/RecordReaderImpl.java —
          @@ -933,7 +926,7 @@ void createStreams(List<OrcProto.Stream> streamDescriptions,
          for (OrcProto.Stream streamDesc : streamDescriptions) {
          int column = streamDesc.getColumn();
          if ((includeColumn != null &&

          • (column < included.length && !includeColumn[column])) ||
            + (column < includeColumn.length && !includeColumn[column])) ||
              • End diff –

          should this be writerIncluded?

          Show
          githubbot ASF GitHub Bot added a comment - Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/40#discussion_r69783819 — Diff: java/core/src/java/org/apache/orc/impl/RecordReaderImpl.java — @@ -933,7 +926,7 @@ void createStreams(List<OrcProto.Stream> streamDescriptions, for (OrcProto.Stream streamDesc : streamDescriptions) { int column = streamDesc.getColumn(); if ((includeColumn != null && (column < included.length && !includeColumn [column] )) || + (column < includeColumn.length && !includeColumn [column] )) || End diff – should this be writerIncluded?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user omalley commented on a diff in the pull request:

          https://github.com/apache/orc/pull/40#discussion_r69784134

          — Diff: java/core/src/java/org/apache/orc/impl/SchemaEvolution.java —
          @@ -45,16 +53,65 @@ public SchemaEvolution(TypeDescription readerSchema, boolean[] included) {
          }

          public SchemaEvolution(TypeDescription fileSchema,

          • TypeDescription readerSchema,
          • boolean[] included) throws IOException {
            + TypeDescription readerSchema,
            + boolean allowMissingMetadata,
            + boolean[] included) {
            readerToFile = new HashMap<>(readerSchema.getMaximumId() + 1);
            this.included = included;
            if (checkAcidSchema(fileSchema)) { this.readerSchema = createEventSchema(readerSchema); }

            else

            { this.readerSchema = readerSchema; }
          • buildMapping(fileSchema, this.readerSchema);
            +
            + boolean useFieldNames;
            + if (!hasColumnNames(fileSchema)){
            + if (!allowMissingMetadata && !fileSchema.equals(readerSchema)) { + throw new RuntimeException("Found that schema metadata is missing from file. This is likely caused by a writer earlier than HIVE-4243. Will not try to reconcile schemas"); + }

            else {
            + LOG.warn("Found that schema metadata is missing from file. This is likely caused by a writer earlier than HIVE-4243. Will try to reconcile schemas based on index");

              • End diff –

          How about "Found that column names are missing from file, which is caused by ORC writers before HIVE-4243."

          Show
          githubbot ASF GitHub Bot added a comment - Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/40#discussion_r69784134 — Diff: java/core/src/java/org/apache/orc/impl/SchemaEvolution.java — @@ -45,16 +53,65 @@ public SchemaEvolution(TypeDescription readerSchema, boolean[] included) { } public SchemaEvolution(TypeDescription fileSchema, TypeDescription readerSchema, boolean[] included) throws IOException { + TypeDescription readerSchema, + boolean allowMissingMetadata, + boolean[] included) { readerToFile = new HashMap<>(readerSchema.getMaximumId() + 1); this.included = included; if (checkAcidSchema(fileSchema)) { this.readerSchema = createEventSchema(readerSchema); } else { this.readerSchema = readerSchema; } buildMapping(fileSchema, this.readerSchema); + + boolean useFieldNames; + if (!hasColumnNames(fileSchema)){ + if (!allowMissingMetadata && !fileSchema.equals(readerSchema)) { + throw new RuntimeException("Found that schema metadata is missing from file. This is likely caused by a writer earlier than HIVE-4243. Will not try to reconcile schemas"); + } else { + LOG.warn("Found that schema metadata is missing from file. This is likely caused by a writer earlier than HIVE-4243 . Will try to reconcile schemas based on index"); End diff – How about "Found that column names are missing from file, which is caused by ORC writers before HIVE-4243 ."
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user omalley commented on a diff in the pull request:

          https://github.com/apache/orc/pull/40#discussion_r69784363

          — Diff: java/core/src/java/org/apache/orc/impl/TreeReaderFactory.java —
          @@ -201,6 +201,10 @@ public void nextVector(ColumnVector previous,
          public BitFieldReader getPresent()

          { return present; }

          +
          — End diff –

          What do you think about moving the building of writerIncluded into the SchemaEvolution class? I think it would be pretty natural there and building it would come naturally out of buildMapping.

          Show
          githubbot ASF GitHub Bot added a comment - Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/40#discussion_r69784363 — Diff: java/core/src/java/org/apache/orc/impl/TreeReaderFactory.java — @@ -201,6 +201,10 @@ public void nextVector(ColumnVector previous, public BitFieldReader getPresent() { return present; } + — End diff – What do you think about moving the building of writerIncluded into the SchemaEvolution class? I think it would be pretty natural there and building it would come naturally out of buildMapping.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user omalley commented on a diff in the pull request:

          https://github.com/apache/orc/pull/40#discussion_r69785443

          — Diff: java/core/src/java/org/apache/orc/impl/SchemaEvolution.java —
          @@ -85,55 +142,78 @@ void buildMapping(TypeDescription fileType,
          // check the easy case first
          if (fileType.getCategory() == readerType.getCategory()) {
          switch (readerType.getCategory()) {

          • case BOOLEAN:
          • case BYTE:
          • case SHORT:
          • case INT:
          • case LONG:
          • case DOUBLE:
          • case FLOAT:
          • case STRING:
          • case TIMESTAMP:
          • case BINARY:
          • case DATE:
          • // these are always a match
          • break;
          • case CHAR:
          • case VARCHAR:
          • // HIVE-13648: Look at ORC data type conversion edge cases (CHAR, VARCHAR, DECIMAL)
          • isOk = fileType.getMaxLength() == readerType.getMaxLength();
          • break;
          • case DECIMAL:
          • // HIVE-13648: Look at ORC data type conversion edge cases (CHAR, VARCHAR, DECIMAL)
          • // TODO we don't enforce scale and precision checks, but probably should
          • break;
          • case UNION:
          • case MAP:
          • case LIST: {
          • // these must be an exact match
          • List<TypeDescription> fileChildren = fileType.getChildren();
          • List<TypeDescription> readerChildren = readerType.getChildren();
          • if (fileChildren.size() == readerChildren.size()) {
          • for(int i=0; i < fileChildren.size(); ++i) { - buildMapping(fileChildren.get(i), readerChildren.get(i)); - }
          • } else {
          • isOk = false;
            + case BOOLEAN:
            + case BYTE:
            + case SHORT:
            + case INT:
            + case LONG:
            + case DOUBLE:
            + case FLOAT:
            + case STRING:
            + case TIMESTAMP:
            + case BINARY:
            + case DATE:
            + // these are always a match
            + break;
            + case CHAR:
            + case VARCHAR:
            + // HIVE-13648: Look at ORC data type conversion edge cases (CHAR, VARCHAR, DECIMAL)
            + isOk = fileType.getMaxLength() == readerType.getMaxLength();
            + break;
            + case DECIMAL:
            + // HIVE-13648: Look at ORC data type conversion edge cases (CHAR, VARCHAR, DECIMAL)
            + // TODO we don't enforce scale and precision checks, but probably should
            + break;
            + case UNION:
            + case MAP:
            + case LIST: {
            + // these must be an exact match
            + List<TypeDescription> fileChildren = fileType.getChildren();
            + List<TypeDescription> readerChildren = readerType.getChildren();
            + if (fileChildren.size() == readerChildren.size()) {
            + for (int i = 0; i < fileChildren.size(); ++i) {
            + buildMapping(fileChildren.get, readerChildren.get, useFieldNames);
              • End diff –

          when you are recursing, useFieldNames should always be true. Prior to HIVE-4243 only the top level column names were lost. In recursive types the field names were correct.

          Show
          githubbot ASF GitHub Bot added a comment - Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/40#discussion_r69785443 — Diff: java/core/src/java/org/apache/orc/impl/SchemaEvolution.java — @@ -85,55 +142,78 @@ void buildMapping(TypeDescription fileType, // check the easy case first if (fileType.getCategory() == readerType.getCategory()) { switch (readerType.getCategory()) { case BOOLEAN: case BYTE: case SHORT: case INT: case LONG: case DOUBLE: case FLOAT: case STRING: case TIMESTAMP: case BINARY: case DATE: // these are always a match break; case CHAR: case VARCHAR: // HIVE-13648 : Look at ORC data type conversion edge cases (CHAR, VARCHAR, DECIMAL) isOk = fileType.getMaxLength() == readerType.getMaxLength(); break; case DECIMAL: // HIVE-13648 : Look at ORC data type conversion edge cases (CHAR, VARCHAR, DECIMAL) // TODO we don't enforce scale and precision checks, but probably should break; case UNION: case MAP: case LIST: { // these must be an exact match List<TypeDescription> fileChildren = fileType.getChildren(); List<TypeDescription> readerChildren = readerType.getChildren(); if (fileChildren.size() == readerChildren.size()) { for(int i=0; i < fileChildren.size(); ++i) { - buildMapping(fileChildren.get(i), readerChildren.get(i)); - } } else { isOk = false; + case BOOLEAN: + case BYTE: + case SHORT: + case INT: + case LONG: + case DOUBLE: + case FLOAT: + case STRING: + case TIMESTAMP: + case BINARY: + case DATE: + // these are always a match + break; + case CHAR: + case VARCHAR: + // HIVE-13648 : Look at ORC data type conversion edge cases (CHAR, VARCHAR, DECIMAL) + isOk = fileType.getMaxLength() == readerType.getMaxLength(); + break; + case DECIMAL: + // HIVE-13648 : Look at ORC data type conversion edge cases (CHAR, VARCHAR, DECIMAL) + // TODO we don't enforce scale and precision checks, but probably should + break; + case UNION: + case MAP: + case LIST: { + // these must be an exact match + List<TypeDescription> fileChildren = fileType.getChildren(); + List<TypeDescription> readerChildren = readerType.getChildren(); + if (fileChildren.size() == readerChildren.size()) { + for (int i = 0; i < fileChildren.size(); ++i) { + buildMapping(fileChildren.get , readerChildren.get , useFieldNames); End diff – when you are recursing, useFieldNames should always be true. Prior to HIVE-4243 only the top level column names were lost. In recursive types the field names were correct.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user omalley commented on a diff in the pull request:

          https://github.com/apache/orc/pull/40#discussion_r69785648

          — Diff: java/core/src/java/org/apache/orc/impl/SchemaEvolution.java —
          @@ -85,55 +142,78 @@ void buildMapping(TypeDescription fileType,
          // check the easy case first
          if (fileType.getCategory() == readerType.getCategory()) {
          switch (readerType.getCategory()) {

          • case BOOLEAN:
          • case BYTE:
          • case SHORT:
          • case INT:
          • case LONG:
          • case DOUBLE:
          • case FLOAT:
          • case STRING:
          • case TIMESTAMP:
          • case BINARY:
          • case DATE:
          • // these are always a match
          • break;
          • case CHAR:
          • case VARCHAR:
          • // HIVE-13648: Look at ORC data type conversion edge cases (CHAR, VARCHAR, DECIMAL)
          • isOk = fileType.getMaxLength() == readerType.getMaxLength();
          • break;
          • case DECIMAL:
          • // HIVE-13648: Look at ORC data type conversion edge cases (CHAR, VARCHAR, DECIMAL)
          • // TODO we don't enforce scale and precision checks, but probably should
          • break;
          • case UNION:
          • case MAP:
          • case LIST: {
          • // these must be an exact match
          • List<TypeDescription> fileChildren = fileType.getChildren();
          • List<TypeDescription> readerChildren = readerType.getChildren();
          • if (fileChildren.size() == readerChildren.size()) {
          • for(int i=0; i < fileChildren.size(); ++i) { - buildMapping(fileChildren.get(i), readerChildren.get(i)); - }
          • } else {
          • isOk = false;
            + case BOOLEAN:
            + case BYTE:
            + case SHORT:
            + case INT:
            + case LONG:
            + case DOUBLE:
            + case FLOAT:
            + case STRING:
            + case TIMESTAMP:
            + case BINARY:
            + case DATE:
            + // these are always a match
            + break;
            + case CHAR:
            + case VARCHAR:
            + // HIVE-13648: Look at ORC data type conversion edge cases (CHAR, VARCHAR, DECIMAL)
            + isOk = fileType.getMaxLength() == readerType.getMaxLength();
            + break;
            + case DECIMAL:
            + // HIVE-13648: Look at ORC data type conversion edge cases (CHAR, VARCHAR, DECIMAL)
            + // TODO we don't enforce scale and precision checks, but probably should
            + break;
            + case UNION:
            + case MAP:
            + case LIST: {
            + // these must be an exact match
            + List<TypeDescription> fileChildren = fileType.getChildren();
            + List<TypeDescription> readerChildren = readerType.getChildren();
            + if (fileChildren.size() == readerChildren.size()) {
            + for (int i = 0; i < fileChildren.size(); ++i) { + buildMapping(fileChildren.get(i), readerChildren.get(i), useFieldNames); }
          • break;
            + } else { + isOk = false; }
          • case STRUCT: { - // allow either side to have fewer fields than the other - List<TypeDescription> fileChildren = fileType.getChildren(); - List<TypeDescription> readerChildren = readerType.getChildren(); + break; + }

            + case STRUCT: {
            + List<TypeDescription> readerChildren = readerType.getChildren();
            + List<String> readerFieldNames = readerType.getFieldNames();
            +
            + List<String> fileFieldNames = fileType.getFieldNames();
            + List<TypeDescription> fileChildren =
            + fileType.getChildren();
            +
            + if (useFieldNames) {
            + Map<String, TypeDescription> fileTypesIdx = new HashMap<>();
            + for (int i = 0; i < fileFieldNames.size(); i++)

            { + fileTypesIdx.put(fileFieldNames.get(i), fileChildren.get(i)); + }

            +
            + for (int i = 0; i < readerFieldNames.size(); i++)

            Unknown macro: { + String readerFieldName = readerFieldNames.get(i); + TypeDescription readerField = readerChildren.get(i); + + TypeDescription fileField = fileTypesIdx.get(readerFieldName); + if (fileField == null) { + continue; + } + + buildMapping(fileField, readerField, true); + }

            + } else {
            int jointSize = Math.min(fileChildren.size(), readerChildren.size());

          • for(int i=0; i < jointSize; ++i) {
          • buildMapping(fileChildren.get, readerChildren.get);
            + for (int i = 0; i < jointSize; ++i) {
            + buildMapping(fileChildren.get, readerChildren.get, false);
              • End diff –

          This should be true also to match field names down below the root.

          Show
          githubbot ASF GitHub Bot added a comment - Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/40#discussion_r69785648 — Diff: java/core/src/java/org/apache/orc/impl/SchemaEvolution.java — @@ -85,55 +142,78 @@ void buildMapping(TypeDescription fileType, // check the easy case first if (fileType.getCategory() == readerType.getCategory()) { switch (readerType.getCategory()) { case BOOLEAN: case BYTE: case SHORT: case INT: case LONG: case DOUBLE: case FLOAT: case STRING: case TIMESTAMP: case BINARY: case DATE: // these are always a match break; case CHAR: case VARCHAR: // HIVE-13648 : Look at ORC data type conversion edge cases (CHAR, VARCHAR, DECIMAL) isOk = fileType.getMaxLength() == readerType.getMaxLength(); break; case DECIMAL: // HIVE-13648 : Look at ORC data type conversion edge cases (CHAR, VARCHAR, DECIMAL) // TODO we don't enforce scale and precision checks, but probably should break; case UNION: case MAP: case LIST: { // these must be an exact match List<TypeDescription> fileChildren = fileType.getChildren(); List<TypeDescription> readerChildren = readerType.getChildren(); if (fileChildren.size() == readerChildren.size()) { for(int i=0; i < fileChildren.size(); ++i) { - buildMapping(fileChildren.get(i), readerChildren.get(i)); - } } else { isOk = false; + case BOOLEAN: + case BYTE: + case SHORT: + case INT: + case LONG: + case DOUBLE: + case FLOAT: + case STRING: + case TIMESTAMP: + case BINARY: + case DATE: + // these are always a match + break; + case CHAR: + case VARCHAR: + // HIVE-13648 : Look at ORC data type conversion edge cases (CHAR, VARCHAR, DECIMAL) + isOk = fileType.getMaxLength() == readerType.getMaxLength(); + break; + case DECIMAL: + // HIVE-13648 : Look at ORC data type conversion edge cases (CHAR, VARCHAR, DECIMAL) + // TODO we don't enforce scale and precision checks, but probably should + break; + case UNION: + case MAP: + case LIST: { + // these must be an exact match + List<TypeDescription> fileChildren = fileType.getChildren(); + List<TypeDescription> readerChildren = readerType.getChildren(); + if (fileChildren.size() == readerChildren.size()) { + for (int i = 0; i < fileChildren.size(); ++i) { + buildMapping(fileChildren.get(i), readerChildren.get(i), useFieldNames); } break; + } else { + isOk = false; } case STRUCT: { - // allow either side to have fewer fields than the other - List<TypeDescription> fileChildren = fileType.getChildren(); - List<TypeDescription> readerChildren = readerType.getChildren(); + break; + } + case STRUCT: { + List<TypeDescription> readerChildren = readerType.getChildren(); + List<String> readerFieldNames = readerType.getFieldNames(); + + List<String> fileFieldNames = fileType.getFieldNames(); + List<TypeDescription> fileChildren = + fileType.getChildren(); + + if (useFieldNames) { + Map<String, TypeDescription> fileTypesIdx = new HashMap<>(); + for (int i = 0; i < fileFieldNames.size(); i++) { + fileTypesIdx.put(fileFieldNames.get(i), fileChildren.get(i)); + } + + for (int i = 0; i < readerFieldNames.size(); i++) Unknown macro: { + String readerFieldName = readerFieldNames.get(i); + TypeDescription readerField = readerChildren.get(i); + + TypeDescription fileField = fileTypesIdx.get(readerFieldName); + if (fileField == null) { + continue; + } + + buildMapping(fileField, readerField, true); + } + } else { int jointSize = Math.min(fileChildren.size(), readerChildren.size()); for(int i=0; i < jointSize; ++i) { buildMapping(fileChildren.get , readerChildren.get ); + for (int i = 0; i < jointSize; ++i) { + buildMapping(fileChildren.get , readerChildren.get , false); End diff – This should be true also to match field names down below the root.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user omalley commented on the issue:

          https://github.com/apache/orc/pull/40

          This is looking good. I rebased it to the current trunk and undid some of the spacing changes. My version is here:
          https://github.com/omalley/orc/tree/orc-54

          I also left some comments in the pull request.

          Show
          githubbot ASF GitHub Bot added a comment - Github user omalley commented on the issue: https://github.com/apache/orc/pull/40 This is looking good. I rebased it to the current trunk and undid some of the spacing changes. My version is here: https://github.com/omalley/orc/tree/orc-54 I also left some comments in the pull request.
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user omalley opened a pull request:

          https://github.com/apache/orc/pull/55

          ORC-54: Evolve schemas based on field name rather than index

          This is an updated version of Mark's patch that fixes evolution of ACID files and rebases it to the current trunk.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/omalley/orc orc-54

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/orc/pull/55.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #55


          commit e37bc797562f4c219bb733390f8e17a79d7ecec8
          Author: Mark Wagner <mwagner@apache.org>
          Date: 2016-08-16T22:30:51Z

          ORC-54: Evolve schemas based on field name rather than index


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user omalley opened a pull request: https://github.com/apache/orc/pull/55 ORC-54 : Evolve schemas based on field name rather than index This is an updated version of Mark's patch that fixes evolution of ACID files and rebases it to the current trunk. You can merge this pull request into a Git repository by running: $ git pull https://github.com/omalley/orc orc-54 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/orc/pull/55.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #55 commit e37bc797562f4c219bb733390f8e17a79d7ecec8 Author: Mark Wagner <mwagner@apache.org> Date: 2016-08-16T22:30:51Z ORC-54 : Evolve schemas based on field name rather than index
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user prasanthj commented on a diff in the pull request:

          https://github.com/apache/orc/pull/55#discussion_r75038675

          — Diff: java/core/src/java/org/apache/orc/impl/SchemaEvolution.java —
          @@ -20,59 +20,132 @@

          import java.util.ArrayList;
          import java.util.Arrays;
          +import java.util.HashMap;
          import java.util.List;
          +import java.util.Map;
          +import java.util.regex.Pattern;

          +import org.apache.orc.Reader;
          import org.apache.orc.TypeDescription;
          import org.slf4j.Logger;
          import org.slf4j.LoggerFactory;

          /**

          • * Take the file types and the (optional) configuration column names/types and
          • * see if there has been schema evolution.
            + * Infer and track the evolution between the schema as stored in the file and
            + * the schema that has been requested by the reader.
            */
            public class SchemaEvolution {
            // indexed by reader column id
            private final TypeDescription[] readerFileTypes;
            // indexed by reader column id
          • private final boolean[] included;
            + private final boolean[] readerIncluded;
            + // indexed by file column id
            + private final boolean[] fileIncluded;
            private final TypeDescription fileSchema;
            private final TypeDescription readerSchema;
            private boolean hasConversion = false;
            + private final boolean isAcid;
            +
            // indexed by reader column id
            private final boolean[] ppdSafeConversion;
          • public SchemaEvolution(TypeDescription fileSchema, boolean[] includedCols) {
          • this(fileSchema, null, includedCols);
            + private static final Logger LOG =
            + LoggerFactory.getLogger(SchemaEvolution.class);
            + private static final Pattern missingMetadataPattern =
            + Pattern.compile("_col
            d+");
            +
            + public static class IllegalEvolutionException extends RuntimeException
            Unknown macro: { + public IllegalEvolutionException(String msg) { + super(msg); + } + }

            +
            + public SchemaEvolution(TypeDescription fileSchema,
            + Reader.Options options)

            { + this(fileSchema, null, options); }

          public SchemaEvolution(TypeDescription fileSchema,
          TypeDescription readerSchema,

          • boolean[] includedCols) {
          • this.included = includedCols == null ? null :
            + Reader.Options options) {
            + boolean allowMissingMetadata = options.getTolerateMissingSchema();
            + boolean[] includedCols = options.getInclude();
            + this.readerIncluded = includedCols == null ? null :
            Arrays.copyOf(includedCols, includedCols.length);
            + this.fileIncluded = new boolean[fileSchema.getMaximumId() + 1];
            this.hasConversion = false;
            this.fileSchema = fileSchema;
            + isAcid = checkAcidSchema(fileSchema);
            if (readerSchema != null) {
          • if (checkAcidSchema(fileSchema)) {
            + if (isAcid) { this.readerSchema = createEventSchema(readerSchema); }

            else

            { this.readerSchema = readerSchema; }
          • this.readerFileTypes = new TypeDescription[this.readerSchema.getMaximumId() + 1];
          • buildConversionFileTypesArray(fileSchema, this.readerSchema);
            + this.readerFileTypes =
            + new TypeDescription[this.readerSchema.getMaximumId() + 1];
            + int positionalLevels = 0;
            + if (!hasColumnNames(isAcid? getBaseRow(fileSchema) : fileSchema)){
            + if (!this.fileSchema.equals(this.readerSchema)) {
            + if (!allowMissingMetadata) { + throw new RuntimeException("Found that schema metadata is missing" + + " from file. This is likely caused by" + + " a writer earlier than HIVE-4243. Will" + + " not try to reconcile schemas"); + }

            else {
            + LOG.warn("Column names are missing from this file. This is"
            + + " caused by a writer earlier than HIVE-4243. The reader will"
            + + " reconcile schemas based on index. File type: " +
            + this.fileSchema + ", reader type: " + this.readerSchema);
            + positionalLevels = isAcid ? 2 : 1;

              • End diff –

          What does positional level mean? Is it real row level?
          Does acid file schema look like this struct<struct<[acid_cols]>,struct[real_cols]>>? If so can you leave a comment about it?

          Show
          githubbot ASF GitHub Bot added a comment - Github user prasanthj commented on a diff in the pull request: https://github.com/apache/orc/pull/55#discussion_r75038675 — Diff: java/core/src/java/org/apache/orc/impl/SchemaEvolution.java — @@ -20,59 +20,132 @@ import java.util.ArrayList; import java.util.Arrays; +import java.util.HashMap; import java.util.List; +import java.util.Map; +import java.util.regex.Pattern; +import org.apache.orc.Reader; import org.apache.orc.TypeDescription; import org.slf4j.Logger; import org.slf4j.LoggerFactory; /** * Take the file types and the (optional) configuration column names/types and * see if there has been schema evolution. + * Infer and track the evolution between the schema as stored in the file and + * the schema that has been requested by the reader. */ public class SchemaEvolution { // indexed by reader column id private final TypeDescription[] readerFileTypes; // indexed by reader column id private final boolean[] included; + private final boolean[] readerIncluded; + // indexed by file column id + private final boolean[] fileIncluded; private final TypeDescription fileSchema; private final TypeDescription readerSchema; private boolean hasConversion = false; + private final boolean isAcid; + // indexed by reader column id private final boolean[] ppdSafeConversion; public SchemaEvolution(TypeDescription fileSchema, boolean[] includedCols) { this(fileSchema, null, includedCols); + private static final Logger LOG = + LoggerFactory.getLogger(SchemaEvolution.class); + private static final Pattern missingMetadataPattern = + Pattern.compile("_col d+"); + + public static class IllegalEvolutionException extends RuntimeException Unknown macro: { + public IllegalEvolutionException(String msg) { + super(msg); + } + } + + public SchemaEvolution(TypeDescription fileSchema, + Reader.Options options) { + this(fileSchema, null, options); } public SchemaEvolution(TypeDescription fileSchema, TypeDescription readerSchema, boolean[] includedCols) { this.included = includedCols == null ? null : + Reader.Options options) { + boolean allowMissingMetadata = options.getTolerateMissingSchema(); + boolean[] includedCols = options.getInclude(); + this.readerIncluded = includedCols == null ? null : Arrays.copyOf(includedCols, includedCols.length); + this.fileIncluded = new boolean [fileSchema.getMaximumId() + 1] ; this.hasConversion = false; this.fileSchema = fileSchema; + isAcid = checkAcidSchema(fileSchema); if (readerSchema != null) { if (checkAcidSchema(fileSchema)) { + if (isAcid) { this.readerSchema = createEventSchema(readerSchema); } else { this.readerSchema = readerSchema; } this.readerFileTypes = new TypeDescription [this.readerSchema.getMaximumId() + 1] ; buildConversionFileTypesArray(fileSchema, this.readerSchema); + this.readerFileTypes = + new TypeDescription [this.readerSchema.getMaximumId() + 1] ; + int positionalLevels = 0; + if (!hasColumnNames(isAcid? getBaseRow(fileSchema) : fileSchema)){ + if (!this.fileSchema.equals(this.readerSchema)) { + if (!allowMissingMetadata) { + throw new RuntimeException("Found that schema metadata is missing" + + " from file. This is likely caused by" + + " a writer earlier than HIVE-4243. Will" + + " not try to reconcile schemas"); + } else { + LOG.warn("Column names are missing from this file. This is" + + " caused by a writer earlier than HIVE-4243 . The reader will" + + " reconcile schemas based on index. File type: " + + this.fileSchema + ", reader type: " + this.readerSchema); + positionalLevels = isAcid ? 2 : 1; End diff – What does positional level mean? Is it real row level? Does acid file schema look like this struct<struct< [acid_cols] >,struct [real_cols] >>? If so can you leave a comment about it?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user prasanthj commented on a diff in the pull request:

          https://github.com/apache/orc/pull/55#discussion_r75038709

          — Diff: java/core/src/java/org/apache/orc/impl/RecordReaderImpl.java —
          @@ -853,6 +862,15 @@ private void readStripe() throws IOException {
          }
          }

          + private boolean isFullRead() {
          + for (Boolean isColumnPresent : writerIncluded){
          — End diff –

          nit: boolean

          Show
          githubbot ASF GitHub Bot added a comment - Github user prasanthj commented on a diff in the pull request: https://github.com/apache/orc/pull/55#discussion_r75038709 — Diff: java/core/src/java/org/apache/orc/impl/RecordReaderImpl.java — @@ -853,6 +862,15 @@ private void readStripe() throws IOException { } } + private boolean isFullRead() { + for (Boolean isColumnPresent : writerIncluded){ — End diff – nit: boolean
          Hide
          prasanth_j Prasanth Jayachandran added a comment -

          Rebased patch lgtm, +1

          Show
          prasanth_j Prasanth Jayachandran added a comment - Rebased patch lgtm, +1
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/orc/pull/55

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/orc/pull/55
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/orc/pull/40

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/orc/pull/40
          Hide
          owen.omalley Owen O'Malley added a comment -

          I just committed this. Thanks, Mark!

          Show
          owen.omalley Owen O'Malley added a comment - I just committed this. Thanks, Mark!
          Hide
          leftylev Lefty Leverenz added a comment -

          Should this be documented in the wiki?

          Show
          leftylev Lefty Leverenz added a comment - Should this be documented in the wiki?
          Hide
          leftylev Lefty Leverenz added a comment -

          At least the new ORC configuration parameter (orc.tolerate.missing.schema) needs to be documented. But what about the field name functionality?

          Show
          leftylev Lefty Leverenz added a comment - At least the new ORC configuration parameter ( orc.tolerate.missing.schema ) needs to be documented. But what about the field name functionality?
          Hide
          owen.omalley Owen O'Malley added a comment -

          Released as part of ORC 1.2.0

          Show
          owen.omalley Owen O'Malley added a comment - Released as part of ORC 1.2.0

            People

            • Assignee:
              mwagner Mark Wagner
              Reporter:
              mwagner Mark Wagner
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development