Solr
  1. Solr
  2. SOLR-8582

/update/json/docs is 4x slower than /update for indexing a list of json docs

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 5.3.2, 5.4.1
    • Fix Version/s: 5.5, 6.0
    • Component/s: update
    • Labels:
      None

      Description

      Indexing a ~650 MB json file containing a list of 2.2 million json documents, I found that bin/post had become 4x slower after SOLR-7042. Memory consumption has also gone up and I can no longer index this file with a 512mb heap.

      The difference is because we now default to /update/json/docs instead of /update. This can be verified on trunk:

      time curl 'http://localhost:8983/solr/gettingstarted/update' --data-binary @/hdd/solr-data/imdb.json 
      {"responseHeader":{"status":0,"QTime":161869}}
      ​
      real	2m42.044s
      user	0m0.292s
      sys	0m0.493s
      ​
      time curl 'http://localhost:8983/solr/gettingstarted/update/json/docs' --data-binary @/hdd/solr-data/imdb.json 
      {"responseHeader":{"status":0,"QTime":686264}}
      ​
      real	11m26.478s
      user	0m0.324s
      sys	0m0.552s
      
      1. SOLR-8582.patch
        4 kB
        Noble Paul
      2. SOLR-8582.patch
        2 kB
        Noble Paul

        Activity

        Hide
        Noble Paul added a comment -

        JsonRecordReader keeps around all keys in a root object. In this case it happens to be a very large array . create a new Set for each object in the array

        Show
        Noble Paul added a comment - JsonRecordReader keeps around all keys in a root object. In this case it happens to be a very large array . create a new Set for each object in the array
        Hide
        Shalin Shekhar Mangar added a comment -

        Thanks Noble. Your patch fixes the slowdown:

        ./bin/solr start -e schemaless -m 2g
        time curl 'http://localhost:8983/solr/gettingstarted/update' --data-binary @/solr-data/imdb.json    
        {"responseHeader":{"status":0,"QTime":195917}}
        
        real	3m16.231s
        user	0m0.274s
        sys	0m0.681s
        
        ./bin/solr stop
        rm -r example/schemaless
        ./bin/solr start -e schemaless -m 2g
        time curl 'http://localhost:8983/solr/gettingstarted/update/json/docs' --data-binary @/solr-data/imdb.json
        {"responseHeader":{"status":0,"QTime":192269}}
        
        real	3m12.596s
        user	0m0.268s
        sys	0m0.721s
        

        Memory consumption has also reduced. I can now index the same document with 512m of heap. I think there's still some memory pressure but it is not that bad e.g. the following is with 512m of heap:

        ./bin/solr start -e schemaless
        time curl 'http://localhost:8983/solr/gettingstarted/update/json/docs' --data-binary @/solr-data/imdb.json
        {"responseHeader":{"status":0,"QTime":244608}}
        
        real	4m4.924s
        user	0m0.294s
        sys	0m0.780s
        
        ./bin/solr stop
        rm -r example/schemaless
        ./bin/solr start -e schemaless
        time curl 'http://localhost:8983/solr/gettingstarted/update' --data-binary @/solr-data/imdb.json
        {"responseHeader":{"status":0,"QTime":231332}}
        
        real	3m51.638s
        user	0m0.291s
        sys	0m0.745s
        

        Minor nit - JsonRecordReader#handleObjectStart has an unused argument childrenFound

        Show
        Shalin Shekhar Mangar added a comment - Thanks Noble. Your patch fixes the slowdown: ./bin/solr start -e schemaless -m 2g time curl 'http: //localhost:8983/solr/gettingstarted/update' --data-binary @/solr-data/imdb.json { "responseHeader" :{ "status" :0, "QTime" :195917}} real 3m16.231s user 0m0.274s sys 0m0.681s ./bin/solr stop rm -r example/schemaless ./bin/solr start -e schemaless -m 2g time curl 'http: //localhost:8983/solr/gettingstarted/update/json/docs' --data-binary @/solr-data/imdb.json { "responseHeader" :{ "status" :0, "QTime" :192269}} real 3m12.596s user 0m0.268s sys 0m0.721s Memory consumption has also reduced. I can now index the same document with 512m of heap. I think there's still some memory pressure but it is not that bad e.g. the following is with 512m of heap: ./bin/solr start -e schemaless time curl 'http: //localhost:8983/solr/gettingstarted/update/json/docs' --data-binary @/solr-data/imdb.json { "responseHeader" :{ "status" :0, "QTime" :244608}} real 4m4.924s user 0m0.294s sys 0m0.780s ./bin/solr stop rm -r example/schemaless ./bin/solr start -e schemaless time curl 'http: //localhost:8983/solr/gettingstarted/update' --data-binary @/solr-data/imdb.json { "responseHeader" :{ "status" :0, "QTime" :231332}} real 3m51.638s user 0m0.291s sys 0m0.745s Minor nit - JsonRecordReader#handleObjectStart has an unused argument childrenFound
        Hide
        Shalin Shekhar Mangar added a comment -

        I think there was some underlying bug with JsonRecordReader that affects json-line files which is also solved by your patch. Without your patch, I was not able to index a 549MB json-line (one json per line) even with a 2g heap. I had to bump the heap upto 4g to succeed. But with your patch I am able to index the same file with a 512m heap. Too bad we missed 5.3.2 and 5.4.1 releases.

        +1 to commit

        Show
        Shalin Shekhar Mangar added a comment - I think there was some underlying bug with JsonRecordReader that affects json-line files which is also solved by your patch. Without your patch, I was not able to index a 549MB json-line (one json per line) even with a 2g heap. I had to bump the heap upto 4g to succeed. But with your patch I am able to index the same file with a 512m heap. Too bad we missed 5.3.2 and 5.4.1 releases. +1 to commit
        Hide
        Noble Paul added a comment -

        Thanks

        Show
        Noble Paul added a comment - Thanks
        Hide
        ASF subversion and git services added a comment -

        Commit 1726261 from Noble Paul in branch 'dev/trunk'
        [ https://svn.apache.org/r1726261 ]

        SOLR-8582 : memory leak in JsonRecordReader affecting /update/json/docs. Large payloads
        cause OOM

        Show
        ASF subversion and git services added a comment - Commit 1726261 from Noble Paul in branch 'dev/trunk' [ https://svn.apache.org/r1726261 ] SOLR-8582 : memory leak in JsonRecordReader affecting /update/json/docs. Large payloads cause OOM
        Hide
        ASF subversion and git services added a comment -

        Commit 1726271 from Noble Paul in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1726271 ]

        SOLR-8582 : memory leak in JsonRecordReader affecting /update/json/docs. Large payloads
        cause OOM

        Show
        ASF subversion and git services added a comment - Commit 1726271 from Noble Paul in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1726271 ] SOLR-8582 : memory leak in JsonRecordReader affecting /update/json/docs. Large payloads cause OOM
        Hide
        Shalin Shekhar Mangar added a comment -

        Thanks Noble!

        Show
        Shalin Shekhar Mangar added a comment - Thanks Noble!

          People

          • Assignee:
            Noble Paul
            Reporter:
            Shalin Shekhar Mangar
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development