Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-1024

Impala BE cannot parse Avro schema that contains a trailing semi-colon

    Details

    • Type: Bug
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: Impala 1.3.1
    • Fix Version/s: None
    • Component/s: Backend
    • Labels:

      Description

      The Impala BE cannot parse Avro schema that contain a trailing semi-colon, failing with error:

      ERRORS: 
      Backend 0:Failed to parse table schema: Error parsing JSON: end of file expected near ';' 
      Failed to parse table schema: Error parsing JSON: end of file expected near ';'
      

      Hive is able to properly query these tables.

        Activity

        Hide
        jbapple Jim Apple added a comment -

        Lenni Kuff, I'm trying to reproduce this. How did you generate an Avro file with a trailing semi-colon in the schema?

        Show
        jbapple Jim Apple added a comment - Lenni Kuff , I'm trying to reproduce this. How did you generate an Avro file with a trailing semi-colon in the schema?
        Hide
        csringhofer Csaba Ringhofer added a comment - - edited

        A way to reproduce this issue:

        1. create a table with a trailing semi-colon in the avro schema from Hive or Impala:
        CREATE TABLE t (col INT)
        STORED AS AVRO
        TBLPROPERTIES ('avro.schema.literal'='{
        "name": "t",
        "type": "record",
        "fields": [

        {"name": "col", "type": "int"}

        ]};');

        2. insert some values from Hive:
        INSERT INTO t (col) VALUES (1);

        3. trying to query from Impala (SELECT * from t; ) leads to an error message:
        Failed to parse table schema: Error parsing JSON: end of file expected near ';'

        the same query from Hive returns the inserted rows

        Hiva actually ignores anything after the last closing '}', this will work as well:

        CREATE TABLE t (col INT)
        STORED AS AVRO
        TBLPROPERTIES ('avro.schema.literal'='{
        "name": "t",
        "type": "record",
        "fields": [

        {"name": "col", "type": "int"}

        ]}I can write here whatever I want');

        Show
        csringhofer Csaba Ringhofer added a comment - - edited A way to reproduce this issue: 1. create a table with a trailing semi-colon in the avro schema from Hive or Impala: CREATE TABLE t (col INT) STORED AS AVRO TBLPROPERTIES ('avro.schema.literal'='{ "name": "t", "type": "record", "fields": [ {"name": "col", "type": "int"} ]};'); 2. insert some values from Hive: INSERT INTO t (col) VALUES (1); 3. trying to query from Impala (SELECT * from t; ) leads to an error message: Failed to parse table schema: Error parsing JSON: end of file expected near ';' the same query from Hive returns the inserted rows Hiva actually ignores anything after the last closing '}', this will work as well: CREATE TABLE t (col INT) STORED AS AVRO TBLPROPERTIES ('avro.schema.literal'='{ "name": "t", "type": "record", "fields": [ {"name": "col", "type": "int"} ]}I can write here whatever I want');
        Hide
        csringhofer Csaba Ringhofer added a comment -

        Greg Rahn:
        What do you think, should Impala/Hive tolerate these avro schemes?

        In my opinion, it is better to be strict here - any other tool can try to read the avro scheme, and they are probably not prepared for these invalid json strings.

        Show
        csringhofer Csaba Ringhofer added a comment - Greg Rahn : What do you think, should Impala/Hive tolerate these avro schemes? In my opinion, it is better to be strict here - any other tool can try to read the avro scheme, and they are probably not prepared for these invalid json strings.
        Hide
        grahn Greg Rahn added a comment -

        How does Avro define a valid schema?
        Just because Hive accepts it, does not mean it's the right behavior. It seems either a schema is valid or not and only valid schemas should be accepted.

        Show
        grahn Greg Rahn added a comment - How does Avro define a valid schema? Just because Hive accepts it, does not mean it's the right behavior. It seems either a schema is valid or not and only valid schemas should be accepted.
        Hide
        csringhofer Csaba Ringhofer added a comment - - edited

        >How does Avro define a valid schema?

        An Avro schema should be valid JSON, so it should start with { and end with } + maybe some white space, but no semi-colon.
        On the other hand, JSON parsers often ignore trailing characters, because they parse from an element from { to its } pair, and then stop. This can be actually useful, for example in https://en.wikipedia.org/wiki/JSON_Streaming#Concatenated_JSON
        
        The cause of the issue is that the java lib used by Hive and Impala FE ignores trailing characters, while the C++ lib used by Impala  BE doesn't.
        I see 3 ways to fix this:
        1. make Hive/Impala FE stricter
        2. make Impala BE more tolerant
        3. accept JSONs with trailing characters, but only store the valid JSON part
        
        I would go for 1, but that could cause problems with already existing tables. 
        If these tables have to be supported, then 2 must be done, and optionally 1 or 3 to avoid similar problems in the future.
        
        Show
        csringhofer Csaba Ringhofer added a comment - - edited >How does Avro define a valid schema? An Avro schema should be valid JSON, so it should start with { and end with } + maybe some white space, but no semi-colon. On the other hand, JSON parsers often ignore trailing characters, because they parse from an element from { to its } pair, and then stop. This can be actually useful, for example in https://en.wikipedia.org/wiki/JSON_Streaming#Concatenated_JSON The cause of the issue is that the java lib used by Hive and Impala FE ignores trailing characters, while the C++ lib used by Impala BE doesn't. I see 3 ways to fix this: 1. make Hive/Impala FE stricter 2. make Impala BE more tolerant 3. accept JSONs with trailing characters, but only store the valid JSON part I would go for 1, but that could cause problems with already existing tables. If these tables have to be supported, then 2 must be done, and optionally 1 or 3 to avoid similar problems in the future.
        Hide
        grahn Greg Rahn added a comment -

        If Avro schemas should be valid JSON, then there seems to be no bug to fix here. I'm opposed to loosening the restriction to accept trailing garbage even if Hive does so.

        Show
        grahn Greg Rahn added a comment - If Avro schemas should be valid JSON, then there seems to be no bug to fix here. I'm opposed to loosening the restriction to accept trailing garbage even if Hive does so.
        Hide
        lv Lars Volker added a comment -

        Csaba Ringhofer - What does our documentation say on this? Do we already state that we require valid JSON?

        Show
        lv Lars Volker added a comment - Csaba Ringhofer - What does our documentation say on this? Do we already state that we require valid JSON?
        Hide
        lv Lars Volker added a comment -

        I checked and the documentation does not explicitly mention that the we don't allow trailing characters. Greg Rahn, John Russell - Should we be more explicit in the docs?

        Show
        lv Lars Volker added a comment - I checked and the documentation does not explicitly mention that the we don't allow trailing characters. Greg Rahn , John Russell - Should we be more explicit in the docs?
        Hide
        grahn Greg Rahn added a comment -

        We should clarify that then, highlighting only valid JSON is accepted.

        Show
        grahn Greg Rahn added a comment - We should clarify that then, highlighting only valid JSON is accepted.
        Hide
        csringhofer Csaba Ringhofer added a comment -

        I have found a similar but probably more important issue:
        Impala BE does not support comments in Avro schemas, while Impala FE and Hive has no problem with them:
        CREATE TABLE t (col INT) STORED AS AVRO
        TBLPROPERTIES ('avro.schema.literal'='{
        "name": "t", "type": "record", "fields": [

        {"name": "col", /* This is comment! */ "type": "int"}

        ]}');

        the CREATE above will succeed in Impala, but select * from t; will fail with the following error;
        " Failed to parse table schema: Error parsing JSON: string or '}' expected near '/ '"

        I had to rethink the issue because of this - for the trailing stuff, it seemed like that the c++ Avro parser is correct, and the java parser should work similarly, but enabling comments can be a useful feature, so it would make sense to change c++ parser (or as it is a 3rd party lib, maybe it would be easier to preprocess the string before giving it to the parser).

        Zoltan Ivanfi: what do you think about comments in Avro? Should Impala support them (as Hive does)?

        Back to the trailing characters:
        I think that this should be discussed / solved on Avro side.
        As I understand, parsing happens here: https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/Schema.java#L1035
        ..
        return Schema.parse(MAPPER.readTree(parser), names);
        ..
        readTree() only reads till the closing }, so parser could be checked after this call with nextToken() - it should return NULL, if there are no trailing non white space characters

        Show
        csringhofer Csaba Ringhofer added a comment - I have found a similar but probably more important issue: Impala BE does not support comments in Avro schemas, while Impala FE and Hive has no problem with them: CREATE TABLE t (col INT) STORED AS AVRO TBLPROPERTIES ('avro.schema.literal'='{ "name": "t", "type": "record", "fields": [ {"name": "col", /* This is comment! */ "type": "int"} ]}'); the CREATE above will succeed in Impala, but select * from t; will fail with the following error; " Failed to parse table schema: Error parsing JSON: string or '}' expected near '/ '" I had to rethink the issue because of this - for the trailing stuff, it seemed like that the c++ Avro parser is correct, and the java parser should work similarly, but enabling comments can be a useful feature, so it would make sense to change c++ parser (or as it is a 3rd party lib, maybe it would be easier to preprocess the string before giving it to the parser). Zoltan Ivanfi : what do you think about comments in Avro? Should Impala support them (as Hive does)? Back to the trailing characters: I think that this should be discussed / solved on Avro side. As I understand, parsing happens here: https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/Schema.java#L1035 .. return Schema.parse(MAPPER.readTree(parser), names); .. readTree() only reads till the closing }, so parser could be checked after this call with nextToken() - it should return NULL, if there are no trailing non white space characters
        Hide
        zi Zoltan Ivanfi added a comment - - edited

        However painful and stupid it may be, the fact is that the JSON specification does not allow comments (neither trailing garbage). In my opinion every JSON interpreter should stick to these rules, otherwise users will come up with invalid JSON schemas without realizing that they are invalid. Those schemas will break at some point in the future when the user tries to use some other software to process them and the later they break the bigger the confusion they cause.

        (As a frightening example, being liberal in accepting invalid HTML has lead to the horrific nightmare of web page development where each page has to be checked with every browser as described in this insanely long article.)

        Show
        zi Zoltan Ivanfi added a comment - - edited However painful and stupid it may be, the fact is that the JSON specification does not allow comments (neither trailing garbage). In my opinion every JSON interpreter should stick to these rules, otherwise users will come up with invalid JSON schemas without realizing that they are invalid. Those schemas will break at some point in the future when the user tries to use some other software to process them and the later they break the bigger the confusion they cause. (As a frightening example, being liberal in accepting invalid HTML has lead to the horrific nightmare of web page development where each page has to be checked with every browser as described in this insanely long article .)
        Hide
        csringhofer Csaba Ringhofer added a comment - - edited

        JSON does not allow comments, but Avro seems to allow them, as some examples in the specification include // type comments:
        https://avro.apache.org/docs/1.8.2/spec.html

        This looks like an Avro inconsistency issue, as APIs for different languages treat comments differently, see:
        https://stackoverflow.com/questions/16704310/can-you-put-comments-in-avro-json-schema-files

        JSON's approach to comments seems to be that if you want them, run a preprocessor that removes them before parsing:
        https://plus.google.com/+DouglasCrockfordEsq/posts/RK8qyGVaGSr
        It would be possible to do this in C++, but I dislike the idea of adding +1 pass to parsing, just because there may be comments in a few cases.

        So I think that we should create an Avro ticket about the inconsistency, and it should be decided there whether to enable or disable comments/trailing garbage in all APIs.

        Adding trailing garbage support seems easy in C by adding flag JSON_DISABLE_EOF_CHECK to the call of json_loadb() in https://github.com/apache/avro/blob/master/lang/c/src/schema.c

        Adding comment support seems much harder, because the json parser (jansson) does not seem to support them. Removing the support from Java looks easy, as it is enabled by a config (FACTORY.enable(JsonParser.Feature.ALLOW_COMMENTS); ) in https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/Schema.java The support for comments seems to be there since ancient times.

        Show
        csringhofer Csaba Ringhofer added a comment - - edited JSON does not allow comments, but Avro seems to allow them, as some examples in the specification include // type comments: https://avro.apache.org/docs/1.8.2/spec.html This looks like an Avro inconsistency issue, as APIs for different languages treat comments differently, see: https://stackoverflow.com/questions/16704310/can-you-put-comments-in-avro-json-schema-files JSON's approach to comments seems to be that if you want them, run a preprocessor that removes them before parsing: https://plus.google.com/+DouglasCrockfordEsq/posts/RK8qyGVaGSr It would be possible to do this in C++, but I dislike the idea of adding +1 pass to parsing, just because there may be comments in a few cases. So I think that we should create an Avro ticket about the inconsistency, and it should be decided there whether to enable or disable comments/trailing garbage in all APIs. Adding trailing garbage support seems easy in C by adding flag JSON_DISABLE_EOF_CHECK to the call of json_loadb() in https://github.com/apache/avro/blob/master/lang/c/src/schema.c Adding comment support seems much harder, because the json parser (jansson) does not seem to support them. Removing the support from Java looks easy, as it is enabled by a config (FACTORY.enable(JsonParser.Feature.ALLOW_COMMENTS); ) in https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/Schema.java The support for comments seems to be there since ancient times.

          People

          • Assignee:
            csringhofer Csaba Ringhofer
            Reporter:
            lskuff Lenni Kuff
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:

              Development