Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-9668

Support cursor paging in SolrEntityProcessor

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 6.4, 7.0
    • Security Level: Public (Default Security Level. Issues are Public)
    • Flags:
      Patch

      Description

      SolrEntityProcessor paginates using the start and rows parameters which can be very inefficient at large offsets. In fact, the current implementation is impracticable to import large amounts of data (10M+ documents) because the data import rate degrades from 1000docs/second to 10docs/second and the import gets stuck.

      This patch introduces support for cursor paging which offers more or less predictable performance. In my tests the time to fetch the 1st and 1000th pages was about the same and the data import rate was stable throughout the entire import.

      To enable cursor paging a user needs to:

      <?xml version="1.0" encoding="UTF-8" ?>
      <dataConfig>
        <document>
          <entity name="se" processor="SolrEntityProcessor" 
          query="*:*"
          rows="1000"
      
          cursorMark='true'
          sort="id asc"  
      
          url="http://localhost:8983/solr/collection1">
          </entity>
        </document>
      </dataConfig>
      

      If the cursorMark attribute is missing or is not 'true' then the default start/rows pagination is used.

      1. SOLR-9668.patch
        18 kB
        Mikhail Khludnev
      2. SOLR-9668.patch
        13 kB
        Mikhail Khludnev

        Issue Links

          Activity

          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user YegorKozlov opened a pull request:

          https://github.com/apache/lucene-solr/pull/101

          SOLR-9668 Support cursor paging in SolrEntityProcessor

          SolrEntityProcessor paginates using the start and rows parameters which can be very inefficient at large offsets. In fact, the current implementation is impracticable to import large amounts of data (10M+ documents) because the data import rate degrades from 1000docs/second to 10docs/second and the import gets stuck.
          This patch introduces support for cursor paging which offers more or less predictable performance. In my tests the time to fetch the 1st and 1000th pages was about the same and the data import rate was stable throughout the entire import.

          To enable cursor paging a user needs to add a "sort" attribute in the entity configuration:
          ```xml
          <?xml version="1.0" encoding="UTF-8" ?>
          <dataConfig>
          <document>
          <entity name="se" processor="SolrEntityProcessor"
          query=":"
          rows="1000"
          sort="id asc" <!-- turns on cursor paging. Must be a uniqueKey field tie breaker -->
          url="http://localhost:8983/solr/collection1">
          </entity>
          </document>
          </dataConfig>
          ```
          If the "sort" attribute is missing then the default start/rows pagination is used.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/YegorKozlov/lucene-solr SOLR-9668

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/lucene-solr/pull/101.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #101


          commit 840dfe68895fabdc8ff5458b3f114a678d0dd080
          Author: U-CEB\YKozlov <ykozlov@2504kx1.ceb.com>
          Date: 2016-10-20T08:34:55Z

          SOLR-9668 Support cursor paging in SolrEntityProcessor


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user YegorKozlov opened a pull request: https://github.com/apache/lucene-solr/pull/101 SOLR-9668 Support cursor paging in SolrEntityProcessor SolrEntityProcessor paginates using the start and rows parameters which can be very inefficient at large offsets. In fact, the current implementation is impracticable to import large amounts of data (10M+ documents) because the data import rate degrades from 1000docs/second to 10docs/second and the import gets stuck. This patch introduces support for cursor paging which offers more or less predictable performance. In my tests the time to fetch the 1st and 1000th pages was about the same and the data import rate was stable throughout the entire import. To enable cursor paging a user needs to add a "sort" attribute in the entity configuration: ```xml <?xml version="1.0" encoding="UTF-8" ?> <dataConfig> <document> <entity name="se" processor="SolrEntityProcessor" query=" : " rows="1000" sort="id asc" <!-- turns on cursor paging. Must be a uniqueKey field tie breaker --> url="http://localhost:8983/solr/collection1"> </entity> </document> </dataConfig> ``` If the "sort" attribute is missing then the default start/rows pagination is used. You can merge this pull request into a Git repository by running: $ git pull https://github.com/YegorKozlov/lucene-solr SOLR-9668 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/lucene-solr/pull/101.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #101 commit 840dfe68895fabdc8ff5458b3f114a678d0dd080 Author: U-CEB\YKozlov <ykozlov@2504kx1.ceb.com> Date: 2016-10-20T08:34:55Z SOLR-9668 Support cursor paging in SolrEntityProcessor
          Hide
          mkhludnev Mikhail Khludnev added a comment -

          what about SOLR-9668.patch?

          Show
          mkhludnev Mikhail Khludnev added a comment - what about SOLR-9668.patch ?
          Hide
          mkhludnev Mikhail Khludnev added a comment - - edited

          Are there any concerns?

          Show
          mkhludnev Mikhail Khludnev added a comment - - edited Are there any concerns?
          Hide
          noble.paul Noble Paul added a comment -

          I haven't looked at the patch. Do you have any concerns mikhail you want me to specifically look at

          Show
          noble.paul Noble Paul added a comment - I haven't looked at the patch. Do you have any concerns mikhail you want me to specifically look at
          Hide
          mkhludnev Mikhail Khludnev added a comment -

          Not really. I just want to confirm that configuration approach is fine.

          Show
          mkhludnev Mikhail Khludnev added a comment - Not really. I just want to confirm that configuration approach is fine.
          Hide
          noble.paul Noble Paul added a comment -

          the config is fine

          Show
          noble.paul Noble Paul added a comment - the config is fine
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit cc862d8e67f32d5447599d265f5d126541ed92c9 in lucene-solr's branch refs/heads/master from Mikhail Khludnev
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=cc862d8 ]

          SOLR-9668: introduce cursorMark='true' for SolrEntityProcessor

          Show
          jira-bot ASF subversion and git services added a comment - Commit cc862d8e67f32d5447599d265f5d126541ed92c9 in lucene-solr's branch refs/heads/master from Mikhail Khludnev [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=cc862d8 ] SOLR-9668 : introduce cursorMark='true' for SolrEntityProcessor
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit b2d54f645db6e365497660cee1b3e059c6c2b4ca in lucene-solr's branch refs/heads/branch_6x from Mikhail Khludnev
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=b2d54f6 ]

          SOLR-9668: introduce cursorMark='true' for SolrEntityProcessor

          Show
          jira-bot ASF subversion and git services added a comment - Commit b2d54f645db6e365497660cee1b3e059c6c2b4ca in lucene-solr's branch refs/heads/branch_6x from Mikhail Khludnev [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=b2d54f6 ] SOLR-9668 : introduce cursorMark='true' for SolrEntityProcessor

            People

            • Assignee:
              mkhludnev Mikhail Khludnev
              Reporter:
              yegor.kozlov Yegor Kozlov
            • Votes:
              1 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development