CouchDB
  1. CouchDB
  2. COUCHDB-1118

Adding a NIF based JSON decoding/encoding module

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.2
    • Component/s: Database Core
    • Labels:
      None
    • Skill Level:
      Guru Level (Everyone buy this person a beer at the next conference!)

      Description

      Currently, all the Erlang based JSON encoders and decoders are very slow, and decoding and encoding JSON is something that we do basically everywhere.

      Via IRC, it recently discussed about adding a JSON NIF encoder/decoder. Damien also started a thread at the development mailing list about adding NIFs to trunk.

      The patch/branch at [1] adds such a JSON encoder/decoder. It is based on Paul Davis' eep0018 project [2]. Damien made some modifications [3] to it mostly to add support for big numbers (Paul's eep0018 limits the precision to 32/64 bits) and a few optimizations. I made a few corrections and minor enhancements on top of Damien's fork as well [4]. Finally Benoît identified some missing capabilities compared to mochijson2 (on encoding, allow atoms as strings and strings as object properties).

      Also, the version added in the patch at [1] uses mochijson2 when the C NIF is not loaded. Autotools configuration was adapted to compile the NIF only when we're using an OTP release >= R13B04 (R13B03 NIF API is too limited and suffered many changes compared to R13B04 and R14) - therefore it should work on any OTP release > R13B at least.

      I successfully tested this on R13B03, R13B04 and R14B02 in an Ubuntu environment.
      I'm not sure if it builds at all on Windows - would appreciate if someone could verify it.
      Also, I'm far from being good with the autotools, so I probably missed something important or I'm doing something in a not very standard way.

      This NIF encoder/decoder is about one order of magnitude faster compared to mochijson2 and other Erlang-only solutions such as jsx. A read and writes test with relaximation shows this has a very positive impact, specially on reads (the EJSON encoding is more expensive than JSON decoding) - http://graphs.mikeal.couchone.com/#/graph/698bf36b6c64dbd19aa2bef634052381

      @Paul, since this is based on your eep0018 effort, do you think any other missing files should be added (README, etap tests, etc)? Also, should we put somewhere a note this is based on your project?

      [1] - https://github.com/fdmanana/couchdb/compare/json_nif
      [2] - https://github.com/davisp/eep0018
      [3] - https://github.com/Damienkatz/eep0018/commits/master
      [4] - https://github.com/fdmanana/eep0018/commits/final_damien

        Activity

        Hide
        Damien Katz added a comment -

        Looks good to me. Check it in!

        Show
        Damien Katz added a comment - Looks good to me. Check it in!
        Hide
        Adam Kocoloski added a comment -

        Hi Filipe, which version are you proposing to commit?

        There must be some overhead associated with using a NIF. I thought in at least one version of Paul's work the overhead made the NIF version slower at decoding very small documents than pure Erlang. Is that no longer the case? Also, has anyone studied the impact that working with very large JSON bodies has on the soft real-time properties of the Erlang VM? Do you know what kind of throughput you see in terms of MB/sec? I guess if that number is large enough then any reasonably-sized JSON body (even a very big batch of KV pairs returned from a map_docs call) will be processed without mucking up the scheduling too much.

        Show
        Adam Kocoloski added a comment - Hi Filipe, which version are you proposing to commit? There must be some overhead associated with using a NIF. I thought in at least one version of Paul's work the overhead made the NIF version slower at decoding very small documents than pure Erlang. Is that no longer the case? Also, has anyone studied the impact that working with very large JSON bodies has on the soft real-time properties of the Erlang VM? Do you know what kind of throughput you see in terms of MB/sec? I guess if that number is large enough then any reasonably-sized JSON body (even a very big batch of KV pairs returned from a map_docs call) will be processed without mucking up the scheduling too much.
        Hide
        Paul Joseph Davis added a comment -

        I need to go back and look at the numbers for encoding/decoding again to try and pin down what the actual overhead/cost is for each method. I do remember there being some issues with tiny docs, but they were extremely tiny. Anything of any actual size is probably going to be faster in the NIF version.

        As to the scheduler bits, I'm not sure that I'm really that concerned about it. AFAIK, its operating under the same principles as term_to_binary, so if the JSON part is a problem for us, then we should be looking into replacing term_to_binary with an Erlang version (which we're not so I think we shouldn't care too much). Then again, figuring out a way to test these sorts of things might be a bonus regardless.

        Show
        Paul Joseph Davis added a comment - I need to go back and look at the numbers for encoding/decoding again to try and pin down what the actual overhead/cost is for each method. I do remember there being some issues with tiny docs, but they were extremely tiny. Anything of any actual size is probably going to be faster in the NIF version. As to the scheduler bits, I'm not sure that I'm really that concerned about it. AFAIK, its operating under the same principles as term_to_binary, so if the JSON part is a problem for us, then we should be looking into replacing term_to_binary with an Erlang version (which we're not so I think we shouldn't care too much). Then again, figuring out a way to test these sorts of things might be a bonus regardless.
        Hide
        Filipe Manana added a comment -

        Adam,

        The patch I proposed is this one:

        https://github.com/fdmanana/couchdb/compare/json_nif

        From all the links I gave before, it's the only one which points to a diff against CouchDB, completely ready to integrate into trunk.

        As for very small documents it's not a problem, sorry I forgot to mention it before. Part of the work Damien did, was related to performance, not only support for big numbers. Here's a shell session that shows timings for a document under 300 bytes (if all white spaces are removed).

        Erlang R14B02 (erts-5.8.3) [source] [smp:2:2] [rq:2] [async-threads:4] [hipe] [kernel-poll:true]

        Eshell V5.8.3 (abort with ^G)
        1> Apache CouchDB 1.2.0aea00f2a-git (LogLevel=info) is starting.
        Apache CouchDB has started. Time to relax.
        [info] [<0.37.0>] Apache CouchDB has started on http://127.0.0.1:5984/

        1>

        {ok, Json}

        = file:read_file("../seatoncouch/doc_100b.json").
        {ok,<<"{\n\"data3\":\"ColYo\",\n\"data5\":{\n \"nested2\":

        {\n \"integers\":[756509,116117,776378,275045"...>>}

        2>
        2> byte_size(Json).
        361
        3> element(1, timer:tc(ejson, decode, [Json])).
        2536
        4> element(1, timer:tc(ejson, decode, [Json])).
        66
        5> element(1, timer:tc(ejson, decode, [Json])).
        87
        6> element(1, timer:tc(ejson, decode, [Json])).
        107
        7> element(1, timer:tc(ejson, decode, [Json])).
        77
        8> element(1, timer:tc(ejson, decode, [Json])).
        71
        9> element(1, timer:tc(ejson, decode, [Json])).
        67
        10> element(1, timer:tc(ejson, decode, [Json])).
        70
        11> element(1, timer:tc(ejson, decode, [Json])).
        45
        12>
        12> element(1, timer:tc(mochijson2, decode, [Json])).
        8364
        13> element(1, timer:tc(mochijson2, decode, [Json])).
        265
        14> element(1, timer:tc(mochijson2, decode, [Json])).
        324
        15> element(1, timer:tc(mochijson2, decode, [Json])).
        278
        16> element(1, timer:tc(mochijson2, decode, [Json])).
        292
        17> element(1, timer:tc(mochijson2, decode, [Json])).
        291
        18> element(1, timer:tc(mochijson2, decode, [Json])).
        239
        19> element(1, timer:tc(mochijson2, decode, [Json])).
        263
        20>
        20> EJson = ejson:decode(Json).
        {[

        {<<"data3">>,<<"ColYo">>}

        ,
        {<<"data5">>,
        {[{<<"nested2">>,
        {[

        {<<"integers">>, [756509,116117,776378,275045,703447,988947,450154]}

        ]}}]}},

        {<<"data1">>,<<"9EVqHm5ARJPyBY0J">>}

        ,
        {<<"more_nested">>,
        {[{<<"nested1">>,
        {[

        {<<"integers">>,[685803,147958,941747,905651]}

        ]}},
        {<<"nested2">>,{[

        {<<"integers">>,[756509,116117]}

        ]}}]}}]}
        21>
        21> element(1, timer:tc(ejson, encode, [EJson])).
        73
        22> element(1, timer:tc(ejson, encode, [EJson])).
        70
        23> element(1, timer:tc(ejson, encode, [EJson])).
        65
        24> element(1, timer:tc(ejson, encode, [EJson])).
        104
        25> element(1, timer:tc(ejson, encode, [EJson])).
        64
        26> element(1, timer:tc(ejson, encode, [EJson])).
        75
        27> element(1, timer:tc(ejson, encode, [EJson])).
        70
        28> element(1, timer:tc(ejson, encode, [EJson])).
        66
        29>
        29> MochiDec = mochijson2:encoder([{handler, fun(

        {L}

        ) when is_list(L) ->

        {struct, L}

        ; (Bad) -> exit({json_encode, {bad_term, Bad}}) end}]).
        #Fun<mochijson2.0.93741038>
        30>
        30> element(1, timer:tc(MochiDec, [EJson])).
        203
        31> element(1, timer:tc(MochiDec, [EJson])).
        205
        32> element(1, timer:tc(MochiDec, [EJson])).
        206
        33> element(1, timer:tc(MochiDec, [EJson])).
        209
        34> element(1, timer:tc(MochiDec, [EJson])).
        213
        35> element(1, timer:tc(MochiDec, [EJson])).
        214
        36> element(1, timer:tc(MochiDec, [EJson])).
        229
        37>

        So even for such small documents, the NIF solution is faster.

        Show
        Filipe Manana added a comment - Adam, The patch I proposed is this one: https://github.com/fdmanana/couchdb/compare/json_nif From all the links I gave before, it's the only one which points to a diff against CouchDB, completely ready to integrate into trunk. As for very small documents it's not a problem, sorry I forgot to mention it before. Part of the work Damien did, was related to performance, not only support for big numbers. Here's a shell session that shows timings for a document under 300 bytes (if all white spaces are removed). Erlang R14B02 (erts-5.8.3) [source] [smp:2:2] [rq:2] [async-threads:4] [hipe] [kernel-poll:true] Eshell V5.8.3 (abort with ^G) 1> Apache CouchDB 1.2.0aea00f2a-git (LogLevel=info) is starting. Apache CouchDB has started. Time to relax. [info] [<0.37.0>] Apache CouchDB has started on http://127.0.0.1:5984/ 1> {ok, Json} = file:read_file( "../seatoncouch/doc_100b.json"). {ok,<<"{\n\"data3\":\"ColYo\",\n\"data5\":{\n \"nested2\": {\n \"integers\":[756509,116117,776378,275045"...>>} 2> 2> byte_size(Json). 361 3> element(1, timer:tc(ejson, decode, [Json] )). 2536 4> element(1, timer:tc(ejson, decode, [Json] )). 66 5> element(1, timer:tc(ejson, decode, [Json] )). 87 6> element(1, timer:tc(ejson, decode, [Json] )). 107 7> element(1, timer:tc(ejson, decode, [Json] )). 77 8> element(1, timer:tc(ejson, decode, [Json] )). 71 9> element(1, timer:tc(ejson, decode, [Json] )). 67 10> element(1, timer:tc(ejson, decode, [Json] )). 70 11> element(1, timer:tc(ejson, decode, [Json] )). 45 12> 12> element(1, timer:tc(mochijson2, decode, [Json] )). 8364 13> element(1, timer:tc(mochijson2, decode, [Json] )). 265 14> element(1, timer:tc(mochijson2, decode, [Json] )). 324 15> element(1, timer:tc(mochijson2, decode, [Json] )). 278 16> element(1, timer:tc(mochijson2, decode, [Json] )). 292 17> element(1, timer:tc(mochijson2, decode, [Json] )). 291 18> element(1, timer:tc(mochijson2, decode, [Json] )). 239 19> element(1, timer:tc(mochijson2, decode, [Json] )). 263 20> 20> EJson = ejson:decode(Json). {[ {<<"data3">>,<<"ColYo">>} , {<<"data5">>, {[{<<"nested2">>, {[ {<<"integers">>, [756509,116117,776378,275045,703447,988947,450154]} ]}}]}}, {<<"data1">>,<<"9EVqHm5ARJPyBY0J">>} , {<<"more_nested">>, {[{<<"nested1">>, {[ {<<"integers">>,[685803,147958,941747,905651]} ]}}, {<<"nested2">>,{[ {<<"integers">>,[756509,116117]} ]}}]}}]} 21> 21> element(1, timer:tc(ejson, encode, [EJson] )). 73 22> element(1, timer:tc(ejson, encode, [EJson] )). 70 23> element(1, timer:tc(ejson, encode, [EJson] )). 65 24> element(1, timer:tc(ejson, encode, [EJson] )). 104 25> element(1, timer:tc(ejson, encode, [EJson] )). 64 26> element(1, timer:tc(ejson, encode, [EJson] )). 75 27> element(1, timer:tc(ejson, encode, [EJson] )). 70 28> element(1, timer:tc(ejson, encode, [EJson] )). 66 29> 29> MochiDec = mochijson2:encoder([{handler, fun( {L} ) when is_list(L) -> {struct, L} ; (Bad) -> exit({json_encode, {bad_term, Bad}}) end}]). #Fun<mochijson2.0.93741038> 30> 30> element(1, timer:tc(MochiDec, [EJson] )). 203 31> element(1, timer:tc(MochiDec, [EJson] )). 205 32> element(1, timer:tc(MochiDec, [EJson] )). 206 33> element(1, timer:tc(MochiDec, [EJson] )). 209 34> element(1, timer:tc(MochiDec, [EJson] )). 213 35> element(1, timer:tc(MochiDec, [EJson] )). 214 36> element(1, timer:tc(MochiDec, [EJson] )). 229 37> So even for such small documents, the NIF solution is faster.
        Hide
        Filipe Manana added a comment -

        Also, contrary to the original eep0018, it doesn't limit the nesting level of the EJson.
        The document tested before is this one:

        {
        "data3":"ColYo",
        "data5":{
        "nested2":

        { "integers":[756509,116117,776378,275045,703447,988947,450154] }

        },
        "data1":"9EVqHm5ARJPyBY0J",
        "more_nested":{
        "nested1":

        { "integers":[685803,147958,941747,905651] }

        ,
        "nested2":

        { "integers":[756509,116117] }

        }
        }

        Show
        Filipe Manana added a comment - Also, contrary to the original eep0018, it doesn't limit the nesting level of the EJson. The document tested before is this one: { "data3":"ColYo", "data5":{ "nested2": { "integers":[756509,116117,776378,275045,703447,988947,450154] } }, "data1":"9EVqHm5ARJPyBY0J", "more_nested":{ "nested1": { "integers":[685803,147958,941747,905651] } , "nested2": { "integers":[756509,116117] } } }
        Hide
        Jan Lehnardt added a comment -

        This builds fine on my development Mac OS X 10.6.7. We should try this on a stock Mac OS X install, too.

        Couple of notes:

        • We should be using ?JSON_ENCODE/?JSON_DECODE everywhere for consistency.
        • Autotools integration looks sane, but I'll refer to Noah for final judgement.
        • At least Yajl will need to be added to the NOTICE file.
        • What other external code is added here?

        With the above sorted, I'd love to see this in trunk.

        Show
        Jan Lehnardt added a comment - This builds fine on my development Mac OS X 10.6.7. We should try this on a stock Mac OS X install, too. Couple of notes: We should be using ?JSON_ENCODE/?JSON_DECODE everywhere for consistency. Autotools integration looks sane, but I'll refer to Noah for final judgement. At least Yajl will need to be added to the NOTICE file. What other external code is added here? With the above sorted, I'd love to see this in trunk.
        Hide
        Adam Kocoloski added a comment -

        Hi, just some minor comments about packaging:

        • I think it's customary for an initial release to have patch level 0, i.e. instead of version 0.0.1 this could be 0.1.0 or 1.0.0.
        • The modules list in the .app file is empty

        Are you interested in packaging up ejson as a standalone OTP application? I know that's the first thing I'll be doing once this code lands, but I don't want to take the credit for it. erlagner.org already lists at least four other JSON parsers; if this one is better for all non-streaming usage I'd like to see the community adopt it.

        Show
        Adam Kocoloski added a comment - Hi, just some minor comments about packaging: I think it's customary for an initial release to have patch level 0, i.e. instead of version 0.0.1 this could be 0.1.0 or 1.0.0. The modules list in the .app file is empty Are you interested in packaging up ejson as a standalone OTP application? I know that's the first thing I'll be doing once this code lands, but I don't want to take the credit for it. erlagner.org already lists at least four other JSON parsers; if this one is better for all non-streaming usage I'd like to see the community adopt it.
        Hide
        Paul Joseph Davis added a comment -

        Just to point out, of the other four, one is mochijson packaged by itself. One is jsx which focuses on decoding JSON streams into tokens like our json_stream_thinger and Alisdair said he hasn't focused at all on encoding which more or less prompted this discussion. ktuo appears to be pretty close to mochijson but a bit cleaner on the source side though I see its not doing any sort of unicode handling which we spent so much time on. And lastly jsonerl is basically a modified mochijson which redefines objects to be k1, v1}, {k2, v2 which would break our codes.

        Other than that, I'd say I need a definition of measurements for what better is. Can a NIF be much faster than all of the other implementations? Probably. But then some people wouldn't want to touch it because its a NIF and not pure Erlang. On the other hand a NIF could have an Erlang implementation as well. I'd probably focus on making a customized version of mochijson2 to ensure that both versions give the exact same output for a given input.

        Also, whether such a project becomes the Erlang JSON package isn't too much of a concern to me. Sure it'd be nice to see lots of people using our version so that we know its been battle tested, but with the number of couchers out there beating on this public facing code, I'm not too concerned that we'll fail to uncover any bugs quite quickly.

        Show
        Paul Joseph Davis added a comment - Just to point out, of the other four, one is mochijson packaged by itself. One is jsx which focuses on decoding JSON streams into tokens like our json_stream_thinger and Alisdair said he hasn't focused at all on encoding which more or less prompted this discussion. ktuo appears to be pretty close to mochijson but a bit cleaner on the source side though I see its not doing any sort of unicode handling which we spent so much time on. And lastly jsonerl is basically a modified mochijson which redefines objects to be k1, v1}, {k2, v2 which would break our codes. Other than that, I'd say I need a definition of measurements for what better is. Can a NIF be much faster than all of the other implementations? Probably. But then some people wouldn't want to touch it because its a NIF and not pure Erlang. On the other hand a NIF could have an Erlang implementation as well. I'd probably focus on making a customized version of mochijson2 to ensure that both versions give the exact same output for a given input. Also, whether such a project becomes the Erlang JSON package isn't too much of a concern to me. Sure it'd be nice to see lots of people using our version so that we know its been battle tested, but with the number of couchers out there beating on this public facing code, I'm not too concerned that we'll fail to uncover any bugs quite quickly.
        Hide
        Paul Joseph Davis added a comment -

        Just a quick heads up on some work I did this weekend.

        I was working on a thing that took a different approach to JSON encoding/decoding than the token based approach in ejson. Currently I return

        {ok, EJson}

        ,

        {error, Error}

        or

        {bignum, Terms}

        and then if if its bignum I have a function that goes through and parses all the bingums present before returning the EJson.

        Encoding will have a similar method based on iolists but currently only supports JSON available through the NIF api.

        I've finally managed to suss out the remaining memory bugs I was having last night and have slapped together a repo for testing the new project against ejson and mochijson.

        The steps to running it are:

        $ git clone git://github.com/davisp/erljson_bench.git
        $ cd erljson_bench
        $ make
        $ ./bench

        That'll spit out something like such:

        encode: jiffy: 2444465
        encode: ejson_test: 12169427
        encode: mochijson2: 25071078
        decode: jiffy: 1118045
        decode: ejson_test: 9873485
        decode: mochijson2: 25117838

        The current test runs N workers (default to 10) for M iterations per worker (default 1,000). Each iteration runs timer:tc(Module, encode|decode, [Doc]) (where Doc is roughly 5K by default). The number in the third column is the sum of the time that is reported by timer:tc.

        Here's some quick results on multiple machines from Dale Harvey, Koco and I from earlier tonight. I think we're all three on relatively newish Mac laptops of some sort.

        My next bit will be to add the rest of the special encoding as well as adding some tests to look at how the Erlang VM reacts to having these types of NIF calls.

        Anyway, just a heads up on some hopeful gains to be had here.

        Show
        Paul Joseph Davis added a comment - Just a quick heads up on some work I did this weekend. I was working on a thing that took a different approach to JSON encoding/decoding than the token based approach in ejson. Currently I return {ok, EJson} , {error, Error} or {bignum, Terms} and then if if its bignum I have a function that goes through and parses all the bingums present before returning the EJson. Encoding will have a similar method based on iolists but currently only supports JSON available through the NIF api. I've finally managed to suss out the remaining memory bugs I was having last night and have slapped together a repo for testing the new project against ejson and mochijson. The steps to running it are: $ git clone git://github.com/davisp/erljson_bench.git $ cd erljson_bench $ make $ ./bench That'll spit out something like such: encode: jiffy: 2444465 encode: ejson_test: 12169427 encode: mochijson2: 25071078 decode: jiffy: 1118045 decode: ejson_test: 9873485 decode: mochijson2: 25117838 The current test runs N workers (default to 10) for M iterations per worker (default 1,000). Each iteration runs timer:tc(Module, encode|decode, [Doc] ) (where Doc is roughly 5K by default). The number in the third column is the sum of the time that is reported by timer:tc. Here's some quick results on multiple machines from Dale Harvey, Koco and I from earlier tonight. I think we're all three on relatively newish Mac laptops of some sort. My next bit will be to add the rest of the special encoding as well as adding some tests to look at how the Erlang VM reacts to having these types of NIF calls. Anyway, just a heads up on some hopeful gains to be had here.
        Hide
        Paul Joseph Davis added a comment -
        Show
        Paul Joseph Davis added a comment - Whoops, link for results: http://pastebin.me/24d7e33ce334e56087dcb65708006773
        Hide
        Paul Joseph Davis added a comment -

        I should point out that this work shouldn't block the commit for the ejson stuff. The work I'm doing should be API and behaviour identical so it'd be a pretty easy upgrade once we're comfortable.

        Show
        Paul Joseph Davis added a comment - I should point out that this work shouldn't block the commit for the ejson stuff. The work I'm doing should be API and behaviour identical so it'd be a pretty easy upgrade once we're comfortable.
        Hide
        Filipe Manana added a comment -

        You're results are awesome Paul. Great work
        Looking forward to see your new JSON parser in trunk whenever you think is ready.

        Committed to trunk, thanks everyone.

        Show
        Filipe Manana added a comment - You're results are awesome Paul. Great work Looking forward to see your new JSON parser in trunk whenever you think is ready. Committed to trunk, thanks everyone.

          People

          • Assignee:
            Paul Joseph Davis
            Reporter:
            Filipe Manana
          • Votes:
            1 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development