Lucene - Core
  1. Lucene - Core
  2. LUCENE-2919

IndexSplitter that divides by primary key term

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: 3.3, 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Index splitter that divides by primary key term. The contrib MultiPassIndexSplitter we have divides by docid, however to guarantee external constraints it's sometimes necessary to split by a primary key term id. I think this implementation is a fairly trivial change.

      1. LUCENE-2919-filter.patch
        7 kB
        Uwe Schindler
      2. LUCENE-2919-filter.patch
        7 kB
        Uwe Schindler
      3. LUCENE-2919-filter.patch
        12 kB
        Uwe Schindler
      4. LUCENE-2919-3x.patch
        10 kB
        Uwe Schindler
      5. LUCENE-2919.patch
        8 kB
        Jason Rutherglen

        Issue Links

          Activity

          Gavin made changes -
          Link This issue is depended upon by SOLR-2593 [ SOLR-2593 ]
          Gavin made changes -
          Link This issue blocks SOLR-2593 [ SOLR-2593 ]
          Hide
          Uwe Schindler added a comment -

          Thanks!
          You should upgrade to 3.5, 3.1 and 3.2 contains serious index corru(m)ption bugs!

          Show
          Uwe Schindler added a comment - Thanks! You should upgrade to 3.5, 3.1 and 3.2 contains serious index corru(m)ption bugs!
          Hide
          Elmer van Chastelet added a comment -

          This was exactly what I was looking for!

          FTR, for this to work in Lucene 3.1.0 (and 3.2.0), only 2 calls to IOUtils.closeSafely(boolean suppressExceptions, Closeable... objects) need to be changed:

          IOUtils.closeSafely(!success, reader) -> IOUtils.closeSafely(reader)
          IOUtils.closeSafely(!success, w) -> IOUtils.closeSafely(w)

          Show
          Elmer van Chastelet added a comment - This was exactly what I was looking for! FTR, for this to work in Lucene 3.1.0 (and 3.2.0), only 2 calls to IOUtils.closeSafely(boolean suppressExceptions, Closeable... objects) need to be changed: IOUtils.closeSafely(!success, reader) -> IOUtils.closeSafely(reader) IOUtils.closeSafely(!success, w) -> IOUtils.closeSafely(w)
          Hide
          Jason Rutherglen added a comment -

          Sorry for the naive off/on-topic question.

          Ryan, what's the repository info that needs to be added to the pom.xml so that the project downloads the 4.0 snapshot?

          Eg, I don't think it's:

          <repository>
            <id>lucene</id>
            <url>https://builds.apache.org/job/Lucene-Solr-Maven-trunk/lastSuccessfulBuild/artifact/maven_artifacts/org/apache/</url>
            <snapshots>
              <enabled>true</enabled>
            </snapshots>
          </repository>
          
          Show
          Jason Rutherglen added a comment - Sorry for the naive off/on-topic question. Ryan, what's the repository info that needs to be added to the pom.xml so that the project downloads the 4.0 snapshot? Eg, I don't think it's: <repository> <id>lucene</id> <url>https: //builds.apache.org/job/Lucene-Solr-Maven-trunk/lastSuccessfulBuild/artifact/maven_artifacts/org/apache/</url> <snapshots> <enabled> true </enabled> </snapshots> </repository>
          Robert Muir made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Hide
          Robert Muir added a comment -

          bulk close for 3.3

          Show
          Robert Muir added a comment - bulk close for 3.3
          Shalin Shekhar Mangar made changes -
          Link This issue blocks SOLR-2593 [ SOLR-2593 ]
          Show
          Ryan McKinley added a comment - Jason... not really sure what you are asking 4.0-SNAPSHOT? https://builds.apache.org/job/Lucene-Solr-Maven-trunk/lastSuccessfulBuild/artifact/maven_artifacts/org/apache/lucene/lucene-core/4.0-SNAPSHOT/maven-metadata.xml
          Hide
          Jason Rutherglen added a comment -

          @Ryan Thanks! What would one place as the artifact info into the pom.xml?

          Show
          Jason Rutherglen added a comment - @Ryan Thanks! What would one place as the artifact info into the pom.xml?
          Show
          Ryan McKinley added a comment - to get the current maven build, check: https://builds.apache.org/job/Lucene-Solr-Maven-trunk/lastSuccessfulBuild/artifact/maven_artifacts/
          Hide
          Jason Rutherglen added a comment -

          Thanks, committing this means I can remove a custom GitHub branch with only this patch. Also, it'd be great if we somehow published nightly versions to Maven repositories. Though they'd accumulate over time.

          Show
          Jason Rutherglen added a comment - Thanks, committing this means I can remove a custom GitHub branch with only this patch. Also, it'd be great if we somehow published nightly versions to Maven repositories. Though they'd accumulate over time.
          Hide
          Michael McCandless added a comment -

          Thanks Uwe!

          Show
          Michael McCandless added a comment - Thanks Uwe!
          Uwe Schindler made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Fix Version/s 3.3 [ 12316470 ]
          Fix Version/s 4.0 [ 12314025 ]
          Resolution Fixed [ 1 ]
          Hide
          Uwe Schindler added a comment -

          Committed 3.x revision: 1137166

          Show
          Uwe Schindler added a comment - Committed 3.x revision: 1137166
          Uwe Schindler made changes -
          Attachment LUCENE-2919-3x.patch [ 12483038 ]
          Hide
          Uwe Schindler added a comment -

          Patch for 3.x (not merged one).

          Show
          Uwe Schindler added a comment - Patch for 3.x (not merged one).
          Uwe Schindler made changes -
          Attachment LUCENE-2919-3x.patch [ 12483034 ]
          Hide
          Uwe Schindler added a comment -

          Committed trunk revision: 1137162

          Backporting...

          Show
          Uwe Schindler added a comment - Committed trunk revision: 1137162 Backporting...
          Uwe Schindler made changes -
          Attachment LUCENE-2919-filter.patch [ 12483036 ]
          Hide
          Uwe Schindler added a comment -

          Final patch:

          • improved tests
          • changed api to be able to pass arbitrary filter

          This ready to commit, will do this soon, as the current trunk is unfortunately broken (splits incorrect)

          Show
          Uwe Schindler added a comment - Final patch: improved tests changed api to be able to pass arbitrary filter This ready to commit, will do this soon, as the current trunk is unfortunately broken (splits incorrect)
          Uwe Schindler made changes -
          Assignee Uwe Schindler [ thetaphi ]
          Hide
          Uwe Schindler added a comment -

          I will fix the test and commit this, then backport again, using your TermPositions.

          Show
          Uwe Schindler added a comment - I will fix the test and commit this, then backport again, using your TermPositions.
          Michael McCandless made changes -
          Attachment LUCENE-2919-3x.patch [ 12483034 ]
          Hide
          Michael McCandless added a comment -

          Here's patch for back-porting original approach to 3.x.

          Show
          Michael McCandless added a comment - Here's patch for back-porting original approach to 3.x.
          Hide
          Michael McCandless added a comment -

          Patch looks great Uwe! I love how generic it is now, that you can just provide any Filter.

          Show
          Michael McCandless added a comment - Patch looks great Uwe! I love how generic it is now, that you can just provide any Filter.
          Uwe Schindler made changes -
          Attachment LUCENE-2919-filter.patch [ 12483032 ]
          Hide
          Uwe Schindler added a comment -

          New patch:

          • simplified the Filter logic
          • added option to negate the filter in the IndexReader, this enabled use of only one TermRangeFilter and simply negate it for the second pass.
          • made code correctly close using IOUtils.closeSafely

          Tests are still ugly.

          Show
          Uwe Schindler added a comment - New patch: simplified the Filter logic added option to negate the filter in the IndexReader, this enabled use of only one TermRangeFilter and simply negate it for the second pass. made code correctly close using IOUtils.closeSafely Tests are still ugly.
          Uwe Schindler made changes -
          Attachment LUCENE-2919-filter.patch [ 12483016 ]
          Hide
          Uwe Schindler added a comment -

          Here patch that changes PKIndexSplitter to use a Filter of "allowed" documents.

          Its yet hardcoded to be a TermRangeFilter, but a second flexible version could e.g. also use NumericRangeFilter, WildCardFilter or whatever.

          The test in the committed code had a bug (the second half of the index had to contain 1 more document, maybe that was the bug Mike mentioned or introduced?). The documentation says: If the midTerm is in the index, its document will be in the second index.

          I think the test should also be improved to check indexes with deleted documents.

          Maybe the Filter could automatically be negated by a boolean parameter to the FilterIndexReader's ctor.

          Show
          Uwe Schindler added a comment - Here patch that changes PKIndexSplitter to use a Filter of "allowed" documents. Its yet hardcoded to be a TermRangeFilter, but a second flexible version could e.g. also use NumericRangeFilter, WildCardFilter or whatever. The test in the committed code had a bug (the second half of the index had to contain 1 more document, maybe that was the bug Mike mentioned or introduced?). The documentation says: If the midTerm is in the index, its document will be in the second index. I think the test should also be improved to check indexes with deleted documents. Maybe the Filter could automatically be negated by a boolean parameter to the FilterIndexReader's ctor.
          Hide
          Uwe Schindler added a comment -

          Too late, already committed

          I will still provide patch tomorrow!

          Show
          Uwe Schindler added a comment - Too late, already committed I will still provide patch tomorrow!
          Hide
          Uwe Schindler added a comment -

          I would implement this stuff a little bit more flexible:

          You could use a standard Filter to do the split, e.g. TermRangeFilter and use its returned DocIdSet as BitSet (if Filter returns no BitSet, can be checked by instanceof Bits, use OpenBitSetDISI as wrapper - like CachingWrapperFilter). This makes it more flexible, as this Filter again has some code duplication with the other IndexSplitter but is again very specific. A simple tool, let it be an DocumentExtractor, could extract parts of a bigger Index using any filter.

          How about that?

          Show
          Uwe Schindler added a comment - I would implement this stuff a little bit more flexible: You could use a standard Filter to do the split, e.g. TermRangeFilter and use its returned DocIdSet as BitSet (if Filter returns no BitSet, can be checked by instanceof Bits, use OpenBitSetDISI as wrapper - like CachingWrapperFilter). This makes it more flexible, as this Filter again has some code duplication with the other IndexSplitter but is again very specific. A simple tool, let it be an DocumentExtractor, could extract parts of a bigger Index using any filter. How about that?
          Hide
          Michael McCandless added a comment -

          Patch looks good Jason! Sorry for the long delay... I'll commit shortly.

          One small thing I fixed: I think the term.compareTo(endTermExcl) > 0 should be a >= 0?

          Show
          Michael McCandless added a comment - Patch looks good Jason! Sorry for the long delay... I'll commit shortly. One small thing I fixed: I think the term.compareTo(endTermExcl) > 0 should be a >= 0?
          Jason Rutherglen made changes -
          Link This issue blocks HBASE-3529 [ HBASE-3529 ]
          Jason Rutherglen made changes -
          Attachment LUCENE-2919.patch [ 12472584 ]
          Hide
          Jason Rutherglen added a comment -

          First cut. Roughly divides an index by the exclusive mid term given.

          Show
          Jason Rutherglen added a comment - First cut. Roughly divides an index by the exclusive mid term given.
          Mark Thomas made changes -
          Workflow Default workflow, editable Closed status [ 12562726 ] jira [ 12583654 ]
          Mark Thomas made changes -
          Field Original Value New Value
          Workflow jira [ 12552966 ] Default workflow, editable Closed status [ 12562726 ]
          Jason Rutherglen created issue -

            People

            • Assignee:
              Uwe Schindler
              Reporter:
              Jason Rutherglen
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development