Avro
  1. Avro
  2. AVRO-268

Replace lemon-generated JSON parser with simpler recursive descent parser

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: None
    • Fix Version/s: 1.3.0
    • Component/s: c
    • Labels:
      None

      Description

      This is a drop-in replacement for the current JSON parser which is based on lemon (a LALR parser generator).

      This parser

      • reads and returns a single JSON_value and its nested children (using recursive descent parsing)
      • allows you to process JSON from streams in addition to static memory buffers
      • correctly processes unicode \u escaping including surrogates
      • distinguishes between integer and real number representations
      • provides information about the line and character in JSON that failed to parse
      • is much simpler to understand and maintain (less lines of code and source files)
      • is written to allow error recovery to be added later

      This patch also adds more unit tests.

      1. AVRO-268.patch
        253 kB
        Matt Massie

        Activity

        Matt Massie created issue -
        Matt Massie made changes -
        Field Original Value New Value
        Assignee Matt Massie [ massie ]
        Hide
        Matt Massie added a comment -

        I noticed that two old directories still exist in svn

        $ rm -rf src/c/json/fail
        $ rm -rf src/c/json/pass

        will also be performed when this patch is committed.

        Show
        Matt Massie added a comment - I noticed that two old directories still exist in svn $ rm -rf src/c/json/fail $ rm -rf src/c/json/pass will also be performed when this patch is committed.
        Matt Massie made changes -
        Attachment AVRO-268.patch [ 12429103 ]
        Matt Massie made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hide
        Matt Massie added a comment -

        I should also mention that a few unit tests are removed with this patch as well. Tests with trailing character that used to fail now succeed with the new parser.

        For example, the following JSON used to fail to parse

        { "key": "value" } foo bar baz
        

        while now, the new parser will return immediately when it hits the last '}' ignoring the trailing junk characters.

        Show
        Matt Massie added a comment - I should also mention that a few unit tests are removed with this patch as well. Tests with trailing character that used to fail now succeed with the new parser. For example, the following JSON used to fail to parse { "key" : "value" } foo bar baz while now, the new parser will return immediately when it hits the last '}' ignoring the trailing junk characters.
        Hide
        Doug Cutting added a comment -

        Sounds like a fine approach.

        Show
        Doug Cutting added a comment - Sounds like a fine approach.
        Hide
        Jeff Hammerbacher added a comment -

        Worth noting that the new behavior is not standard:

        JavaScript:

        js> var a = JSON.parse('{ "key": "value" } foo bar baz');
        js: "/Users/hammer/codebox/narwhal/engines/default/lib/json.js", line 474: exception from uncaught JavaScript throw: SyntaxError: JSON.parse
        

        Python:

        >>> b = json.loads('{ "key": "value" } foo bar baz')
        Traceback (most recent call last):
          File "<stdin>", line 1, in <module>
          File "build/bdist.macosx-10.5-i386/egg/simplejson/__init__.py", line 307, in loads
          File "build/bdist.macosx-10.5-i386/egg/simplejson/decoder.py", line 338, in decode
        ValueError: Extra data: line 1 column 19 - line 1 column 30 (char 19 - 30)
        
        Show
        Jeff Hammerbacher added a comment - Worth noting that the new behavior is not standard: JavaScript: js> var a = JSON.parse('{ "key" : "value" } foo bar baz'); js: "/Users/hammer/codebox/narwhal/engines/ default /lib/json.js" , line 474: exception from uncaught JavaScript throw : SyntaxError: JSON.parse Python: >>> b = json.loads('{ "key" : "value" } foo bar baz') Traceback (most recent call last): File "<stdin>" , line 1, in <module> File "build/bdist.macosx-10.5-i386/egg/simplejson/__init__.py" , line 307, in loads File "build/bdist.macosx-10.5-i386/egg/simplejson/decoder.py" , line 338, in decode ValueError: Extra data: line 1 column 19 - line 1 column 30 ( char 19 - 30)
        Hide
        Matt Massie added a comment -

        I should have been clearer.

        I'm not saying that

        { "key" : "value" } foo bar baz
        

        is valid JSON. It's not.

        I was just speaking to the fact that the parser isn't greedy and will return as soon as it's completed a JSON value. However, if the stream in this example remained pointed to 'foo', the parser would throw an error the next time it's called.

        Show
        Matt Massie added a comment - I should have been clearer. I'm not saying that { "key" : "value" } foo bar baz is valid JSON. It's not. I was just speaking to the fact that the parser isn't greedy and will return as soon as it's completed a JSON value. However, if the stream in this example remained pointed to 'foo', the parser would throw an error the next time it's called.
        Hide
        Ryan King added a comment -

        Please don't use a non-standard json parser.

        I was under the understanding that part of the reason for choosing JSON is that it is a standard format (http://www.ietf.org/rfc/rfc4627.txt) with parsers available in many languages already. If you use a non-standard JSON parser, you lose that benefit.

        Show
        Ryan King added a comment - Please don't use a non-standard json parser. I was under the understanding that part of the reason for choosing JSON is that it is a standard format ( http://www.ietf.org/rfc/rfc4627.txt ) with parsers available in many languages already. If you use a non-standard JSON parser, you lose that benefit.
        Hide
        Matt Massie added a comment -

        Ryan-

        We posted comments about two minutes apart.

        I hope that my earlier comment clarifies that I didn't implement a non-standard JSON parser.

        -Matt

        Show
        Matt Massie added a comment - Ryan- We posted comments about two minutes apart. I hope that my earlier comment clarifies that I didn't implement a non-standard JSON parser. -Matt
        Hide
        Ryan King added a comment -

        Alright, I now understand a bit better what you're going for.

        Now forgive me if this is a naive question: whenever you need to read JSON for avro, do we know ahead of time how long the JSON blob will be?

        Most JSON parsers don't have the property of returning as soon as a full object has been parsed, so we'll need to be able to read just the appropriate length, then parse as JSON.

        Show
        Ryan King added a comment - Alright, I now understand a bit better what you're going for. Now forgive me if this is a naive question: whenever you need to read JSON for avro, do we know ahead of time how long the JSON blob will be? Most JSON parsers don't have the property of returning as soon as a full object has been parsed, so we'll need to be able to read just the appropriate length, then parse as JSON.
        Hide
        Matt Massie added a comment -

        No need for this work. I've finally found a high-quality C parser with a friendly license called Jansson which will serve as the JSON parser moving forward. Sorry for the JIRA noise.

        Show
        Matt Massie added a comment - No need for this work. I've finally found a high-quality C parser with a friendly license called Jansson which will serve as the JSON parser moving forward. Sorry for the JIRA noise.
        Matt Massie made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Resolution Won't Fix [ 2 ]
        Doug Cutting made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Open Open Patch Available Patch Available
        12m 2s 1 Matt Massie 29/Dec/09 23:18
        Patch Available Patch Available Resolved Resolved
        16d 19h 25m 1 Matt Massie 15/Jan/10 18:43
        Resolved Resolved Closed Closed
        44d 22h 25m 1 Doug Cutting 01/Mar/10 17:09

          People

          • Assignee:
            Matt Massie
            Reporter:
            Matt Massie
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development