Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-6825

Add multidimensional byte[] indexing support to Lucene

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      I think we should graduate the low-level block KD-tree data structure
      from sandbox into Lucene's core?

      This can be used for very fast 1D range filtering for numerics,
      removing the 8 byte (long/double) limit we have today, so e.g. we
      could efficiently support BigInteger, BigDecimal, IPv6 addresses, etc.

      It can also be used for > 1D use cases, like 2D (lat/lon) and 3D
      (x/y/z with geo3d) geo shape intersection searches.

      The idea here is to add a new part of the Codec API (DimensionalFormat
      maybe?) that can do low-level N-dim point indexing and at runtime
      exposes only an "intersect" method.

      It should give sizable performance gains (smaller index, faster
      searching) over what we have today, and even over what auto-prefix
      with efficient numeric terms would do.

      There are many steps here ... and I think adding this is analogous to
      how we added FSTs, where we first added low level data structure
      support and then gradually cutover the places that benefit from an
      FST.

      So for the first step, I'd like to just add the low-level block
      KD-tree impl into oal.util.bkd, but make a couple improvements over
      what we have now in sandbox:

      • Use byte[] as the value not int (@rjernst's good idea!)
      • Generalize it to arbitrary dimensions vs. specialized/forked 1D,
        2D, 3D cases we have now

      This is already hard enough After that we can build the
      DimensionalFormat on top, then cutover existing specialized block
      KD-trees. We also need to fix OfflineSorter to use Directory API so
      we don't fill up /tmp when building a block KD-tree.

      A block KD-tree is at heart an inverted data structure, like postings,
      but is also similar to auto-prefix in that it "picks" proper
      N-dimensional "terms" (leaf blocks) to index based on how the specific
      data being indexed is distributed. I think this is a big part of why
      it's so fast, i.e. in contrast to today where we statically slice up
      the space into the same terms regardless of the data (trie shifting,
      morton codes, geohash, hilbert curves, etc.)

      I'm marking this as trunk only for now... as we iterate we can see if
      it could maybe go back to 5.x...

      1. LUCENE-6825.patch
        102 kB
        Michael McCandless
      2. LUCENE-6825.patch
        72 kB
        Michael McCandless

        Activity

        Hide
        mikemccand Michael McCandless added a comment -

        Initial patch.

        Many nocommits still, but the randomized multi-dimensional int[]
        indexing test case seems to pass at least once.

        There are many classes under oal.util.bkd, but all are package private
        except for the BKDWriter/Reader classes.

        One limitation here is that every dimension must be the same number of
        bytes. This is fine for all the KD-tree use cases today, but e.g. it
        means you can't have one dim that's a long, and another dim that's an
        int. I think this is fine for starters (progress not perfection!),
        and it is a big simplification of the code since it means all encoding
        while building the tree is fixed byte width per document.

        This is just the low-level data structure! It's like FST.java. Later (separate issues, separate commits) we need DimensionalFormat, queries that use the read-time API to execute, etc.

        I'll open a separate issue to cutover OfflineSorter to Directory API;
        I think it's a blocker for this one.

        Show
        mikemccand Michael McCandless added a comment - Initial patch. Many nocommits still, but the randomized multi-dimensional int[] indexing test case seems to pass at least once. There are many classes under oal.util.bkd, but all are package private except for the BKDWriter/Reader classes. One limitation here is that every dimension must be the same number of bytes. This is fine for all the KD-tree use cases today, but e.g. it means you can't have one dim that's a long, and another dim that's an int. I think this is fine for starters (progress not perfection!), and it is a big simplification of the code since it means all encoding while building the tree is fixed byte width per document. This is just the low-level data structure! It's like FST.java. Later (separate issues, separate commits) we need DimensionalFormat, queries that use the read-time API to execute, etc. I'll open a separate issue to cutover OfflineSorter to Directory API; I think it's a blocker for this one.
        Hide
        rjernst Ryan Ernst added a comment -

        This looks good. I think this is a great plan, to separate this out into a separate codec format, and to do it in pieces. And this patch is a good start.

        A few comments on the patch:

        • I see a number of debugging s.o.p, these should be commented out or removed?
        • Is this nocommit still relevant?

          // nocommit can we get this back?
          //state.docs.grow(count);

        • There is a nocommit on the bytesToInt method. I think as a follow up we should investigate packing values, but is the nocommit still needed? Also, it seems to exist in both reader and writer? Could it be in one place, but still package private, perhaps in `oal.util.bkd.Util`?
        • Can we avoid the bitflip in bytesToInt by using a long accumulator and casting to int? we can assert no upper bits are set before casting?
        • Can we limit the num dims to 3 for now? I see a check for < 255 to fit in a byte, but it might be nice later to use those extra bits for some other information (essentially lets reserve the bits for now, instead of allowing silly number of dimensions)
        • In `BKDWriter.finish()`, can the try/catch around building be simplified? I think you could remove the `success` marker and do the regular cleanup after build, and change the `finally` to a `catch`, then add any failures when destroying the per dim writers as suppressed exceptions to the original?
        • In `BKDWriter.build()`, there is a nocommit about destroying perdim writers, but I think that is handled by the caller in `finish()` mentioned in my previous comment? I also see some destroy calls below that...is there double destroying going on, or is this more complicated than it looks?
        Show
        rjernst Ryan Ernst added a comment - This looks good. I think this is a great plan, to separate this out into a separate codec format, and to do it in pieces. And this patch is a good start. A few comments on the patch: I see a number of debugging s.o.p, these should be commented out or removed? Is this nocommit still relevant? // nocommit can we get this back? //state.docs.grow(count); There is a nocommit on the bytesToInt method. I think as a follow up we should investigate packing values, but is the nocommit still needed? Also, it seems to exist in both reader and writer? Could it be in one place, but still package private, perhaps in `oal.util.bkd.Util`? Can we avoid the bitflip in bytesToInt by using a long accumulator and casting to int? we can assert no upper bits are set before casting? Can we limit the num dims to 3 for now? I see a check for < 255 to fit in a byte, but it might be nice later to use those extra bits for some other information (essentially lets reserve the bits for now, instead of allowing silly number of dimensions) In `BKDWriter.finish()`, can the try/catch around building be simplified? I think you could remove the `success` marker and do the regular cleanup after build, and change the `finally` to a `catch`, then add any failures when destroying the per dim writers as suppressed exceptions to the original? In `BKDWriter.build()`, there is a nocommit about destroying perdim writers, but I think that is handled by the caller in `finish()` mentioned in my previous comment? I also see some destroy calls below that...is there double destroying going on, or is this more complicated than it looks?
        Hide
        nknize Nicholas Knize added a comment -

        +1000 for graduating the data structure to core.

        // Sort all docs once by lat, once by lon:

        • I'm assuming lat, lon specific naming will be refactored to more generalized naming?
        • In an XTree implementation (similar to BKD but with more rigorous split criteria) I ran into bias issues when sorting by incremental dimensions (simple for loop like this). This is often why sort is done by a reduced dimensional encoding value (e.g., Hilbert, Morton). This is particularly important as the tree grows (which I'm guessing happens when BKD segments merge?). Maybe another new issue to investigate simple interleave packing and sorting on the packed value?

        // Find which dim has the largest span so we can split on it:

        • Maybe refactor this into a split method? It would give an opportunity to override for investigating improvements based on other split criteria (e.g., squareness, area)
        Show
        nknize Nicholas Knize added a comment - +1000 for graduating the data structure to core. // Sort all docs once by lat, once by lon: I'm assuming lat, lon specific naming will be refactored to more generalized naming? In an XTree implementation (similar to BKD but with more rigorous split criteria) I ran into bias issues when sorting by incremental dimensions (simple for loop like this). This is often why sort is done by a reduced dimensional encoding value (e.g., Hilbert, Morton). This is particularly important as the tree grows (which I'm guessing happens when BKD segments merge?). Maybe another new issue to investigate simple interleave packing and sorting on the packed value? // Find which dim has the largest span so we can split on it: Maybe refactor this into a split method? It would give an opportunity to override for investigating improvements based on other split criteria (e.g., squareness, area)
        Hide
        mikemccand Michael McCandless added a comment -

        Thanks Ryan Ernst!

        I see a number of debugging s.o.p,

        I'll nuke all of these once things seem to be working ... also, those bytesToInt in the reader/writer are temporary just for debugging the test case that's encoding ints into the byte[].

        Is this nocommit still relevant?

        // nocommit can we get this back?
        //state.docs.grow(count);

        Unfortunately, yes. One source of efficiency for the 1D,2D,3D specialized BKDs now is they can "notify" the consumer of the incoming docIDs that a chunk is coming ... but I didn't expose any means to do this in the intersect API. I could add it but held back for now since that dirties the read-time API. Though for the 1D case, this was a big speedup, because in that case I can give a pretty accurate estimate up front of how large the result set will be ... I'll mull, probably leave as a TODO for now.

        Can we avoid the bitflip in bytesToInt by using a long accumulator and casting to int?

        Hmm I don't think that's the same thing? We flip that big so that in binary space negative ints sort before positive ints? But remember this is only a test issue at this point (for this issue) ... in future issues we will need clever "encode/decode foobar as byte[]" ...

        Can we limit the num dims to 3 for now?

        How about limit to 15? I think 3 is a bit too low (and 255 is a way too high!), but there are maybe interesting things to do w/ more than 3 dims, e.g. index household income along with x, y, z points or something. So this would still give us 4 free bits in case we want them later...

        In `BKDWriter.finish()`, can the try/catch around building be simplified?

        I'll try your suggestion. I think if I change destroy to close instead, I can just tap into IOUtils goodness...

        In `BKDWriter.build()`, there is a nocommit about destroying perdim writers, but I think that is handled by the caller in `finish()`

        It is confusing!

        Each level of the recursion may have written its own files, so I think we need the toplevel destroy (removes the fully sorted by each dim file we created up front) and the destroy as we recurse (removes the partitioned subset of each dim).

        Show
        mikemccand Michael McCandless added a comment - Thanks Ryan Ernst ! I see a number of debugging s.o.p, I'll nuke all of these once things seem to be working ... also, those bytesToInt in the reader/writer are temporary just for debugging the test case that's encoding ints into the byte[]. Is this nocommit still relevant? // nocommit can we get this back? //state.docs.grow(count); Unfortunately, yes. One source of efficiency for the 1D,2D,3D specialized BKDs now is they can "notify" the consumer of the incoming docIDs that a chunk is coming ... but I didn't expose any means to do this in the intersect API. I could add it but held back for now since that dirties the read-time API. Though for the 1D case, this was a big speedup, because in that case I can give a pretty accurate estimate up front of how large the result set will be ... I'll mull, probably leave as a TODO for now. Can we avoid the bitflip in bytesToInt by using a long accumulator and casting to int? Hmm I don't think that's the same thing? We flip that big so that in binary space negative ints sort before positive ints? But remember this is only a test issue at this point (for this issue) ... in future issues we will need clever "encode/decode foobar as byte[]" ... Can we limit the num dims to 3 for now? How about limit to 15? I think 3 is a bit too low (and 255 is a way too high!), but there are maybe interesting things to do w/ more than 3 dims, e.g. index household income along with x, y, z points or something. So this would still give us 4 free bits in case we want them later... In `BKDWriter.finish()`, can the try/catch around building be simplified? I'll try your suggestion. I think if I change destroy to close instead, I can just tap into IOUtils goodness... In `BKDWriter.build()`, there is a nocommit about destroying perdim writers, but I think that is handled by the caller in `finish()` It is confusing! Each level of the recursion may have written its own files, so I think we need the toplevel destroy (removes the fully sorted by each dim file we created up front) and the destroy as we recurse (removes the partitioned subset of each dim).
        Hide
        mikemccand Michael McCandless added a comment -

        Thanks Nicholas Knize!

        I'm assuming lat, lon specific naming will be refactored to more generalized naming?

        Woops, yes: I have a nocommit to make sure I don't talk about lat/lon anymore. I'll fix ...

        In an XTree implementation (similar to BKD but with more rigorous split criteria) I ran into bias issues when sorting by incremental dimensions (simple for loop like this).

        The bias only happens if we index shapes right? I agree for indexing shapes things get more complex, but I was hoping to do that "later" and focus on making points work well for now.

        Maybe refactor this into a split method? It would give an opportunity to override for investigating improvements based on other split criteria (e.g., squareness, area)

        Good idea, will do!

        Show
        mikemccand Michael McCandless added a comment - Thanks Nicholas Knize ! I'm assuming lat, lon specific naming will be refactored to more generalized naming? Woops, yes: I have a nocommit to make sure I don't talk about lat/lon anymore. I'll fix ... In an XTree implementation (similar to BKD but with more rigorous split criteria) I ran into bias issues when sorting by incremental dimensions (simple for loop like this). The bias only happens if we index shapes right? I agree for indexing shapes things get more complex, but I was hoping to do that "later" and focus on making points work well for now. Maybe refactor this into a split method? It would give an opportunity to override for investigating improvements based on other split criteria (e.g., squareness, area) Good idea, will do!
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1707202 from Michael McCandless in branch 'dev/branches/lucene6825'
        [ https://svn.apache.org/r1707202 ]

        LUCENE-6825: make branch

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1707202 from Michael McCandless in branch 'dev/branches/lucene6825' [ https://svn.apache.org/r1707202 ] LUCENE-6825 : make branch
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1707203 from Michael McCandless in branch 'dev/branches/lucene6825'
        [ https://svn.apache.org/r1707203 ]

        LUCENE-6825: starting patch

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1707203 from Michael McCandless in branch 'dev/branches/lucene6825' [ https://svn.apache.org/r1707203 ] LUCENE-6825 : starting patch
        Hide
        mikemccand Michael McCandless added a comment -

        I put the current patch on this branch: https://svn.apache.org/repos/asf/lucene/dev/branches/lucene6825

        I'll iterate there...

        Show
        mikemccand Michael McCandless added a comment - I put the current patch on this branch: https://svn.apache.org/repos/asf/lucene/dev/branches/lucene6825 I'll iterate there...
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1707206 from Michael McCandless in branch 'dev/branches/lucene6825'
        [ https://svn.apache.org/r1707206 ]

        LUCENE-6825: rename classes

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1707206 from Michael McCandless in branch 'dev/branches/lucene6825' [ https://svn.apache.org/r1707206 ] LUCENE-6825 : rename classes
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1707213 from Michael McCandless in branch 'dev/branches/lucene6825'
        [ https://svn.apache.org/r1707213 ]

        LUCENE-6825: fix some nocommits; remove some difficult try/finally logic

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1707213 from Michael McCandless in branch 'dev/branches/lucene6825' [ https://svn.apache.org/r1707213 ] LUCENE-6825 : fix some nocommits; remove some difficult try/finally logic
        Hide
        dsmiley David Smiley added a comment -

        Exciting! Mike, have you considered a new module for this; like "bkd" or "multidimensional"? Or of course the existing spatial module? I wonder what others think.

        Show
        dsmiley David Smiley added a comment - Exciting! Mike, have you considered a new module for this; like "bkd" or "multidimensional"? Or of course the existing spatial module? I wonder what others think.
        Hide
        mikemccand Michael McCandless added a comment -

        have you considered a new module for this

        Well, I think a codec format is a natural way to expose this service, since (like postings, doc values, etc.), it's a low-level utility that can be used for diverse use cases (2D and 3D spatial, numeric range filtering, binary range filtering so we can support IPv6, BigInteger, BigDecimal, etc.). For it to be exposed as a part of the codec means it needs to be in core...

        Show
        mikemccand Michael McCandless added a comment - have you considered a new module for this Well, I think a codec format is a natural way to expose this service, since (like postings, doc values, etc.), it's a low-level utility that can be used for diverse use cases (2D and 3D spatial, numeric range filtering, binary range filtering so we can support IPv6, BigInteger, BigDecimal, etc.). For it to be exposed as a part of the codec means it needs to be in core...
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1708778 from Michael McCandless in branch 'dev/branches/lucene6825'
        [ https://svn.apache.org/r1708778 ]

        LUCENE-6825: merge trunk

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1708778 from Michael McCandless in branch 'dev/branches/lucene6825' [ https://svn.apache.org/r1708778 ] LUCENE-6825 : merge trunk
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1708913 from Michael McCandless in branch 'dev/branches/lucene6825'
        [ https://svn.apache.org/r1708913 ]

        LUCENE-6825: cutover to Directory API, fix some bugs

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1708913 from Michael McCandless in branch 'dev/branches/lucene6825' [ https://svn.apache.org/r1708913 ] LUCENE-6825 : cutover to Directory API, fix some bugs
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1709001 from Michael McCandless in branch 'dev/branches/lucene6825'
        [ https://svn.apache.org/r1709001 ]

        LUCENE-6825: remove sops, add more tests (e.g. BigInteger!); cutover to try-with-resources/TrackingDirectoryWrapper; remove some nocommits

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1709001 from Michael McCandless in branch 'dev/branches/lucene6825' [ https://svn.apache.org/r1709001 ] LUCENE-6825 : remove sops, add more tests (e.g. BigInteger!); cutover to try-with-resources/TrackingDirectoryWrapper; remove some nocommits
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1709330 from Michael McCandless in branch 'dev/branches/lucene6825'
        [ https://svn.apache.org/r1709330 ]

        LUCENE-6825: remove all nocommits; add missing MDW.createTempOutput wrapping; fix double-write per dim during build

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1709330 from Michael McCandless in branch 'dev/branches/lucene6825' [ https://svn.apache.org/r1709330 ] LUCENE-6825 : remove all nocommits; add missing MDW.createTempOutput wrapping; fix double-write per dim during build
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1709331 from Michael McCandless in branch 'dev/branches/lucene6825'
        [ https://svn.apache.org/r1709331 ]

        LUCENE-6825: merge trunk

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1709331 from Michael McCandless in branch 'dev/branches/lucene6825' [ https://svn.apache.org/r1709331 ] LUCENE-6825 : merge trunk
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1709429 from Michael McCandless in branch 'dev/branches/lucene6825'
        [ https://svn.apache.org/r1709429 ]

        LUCENE-6825: simplify the writer/reader classes, remove DEBUG outputs, fix ant precommit

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1709429 from Michael McCandless in branch 'dev/branches/lucene6825' [ https://svn.apache.org/r1709429 ] LUCENE-6825 : simplify the writer/reader classes, remove DEBUG outputs, fix ant precommit
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1709473 from Michael McCandless in branch 'dev/branches/lucene6825'
        [ https://svn.apache.org/r1709473 ]

        LUCENE-6825: merge trunk

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1709473 from Michael McCandless in branch 'dev/branches/lucene6825' [ https://svn.apache.org/r1709473 ] LUCENE-6825 : merge trunk
        Hide
        mikemccand Michael McCandless added a comment -

        I think this first step is ready. Tests pass, ant precommit passes.

        Here's a patch from the branch vs current trunk

        It just adds the low-level block KD data structure to oal.util.bkd, plus a couple other fixes:

        • I forgot to add a wrapper for createTempOutput to MDW in
          LUCENE-6829
        • Improved OfflineSorter a bit: don't create so much young garbage
          while sorting and allow for fixed-width byte[] sequences encoding
          so BKD doesn't have to double-write its inputs

        Next step (I'll open a followon issue) is to make a DimensionFormat that uses this data structure...

        Show
        mikemccand Michael McCandless added a comment - I think this first step is ready. Tests pass, ant precommit passes. Here's a patch from the branch vs current trunk It just adds the low-level block KD data structure to oal.util.bkd, plus a couple other fixes: I forgot to add a wrapper for createTempOutput to MDW in LUCENE-6829 Improved OfflineSorter a bit: don't create so much young garbage while sorting and allow for fixed-width byte[] sequences encoding so BKD doesn't have to double-write its inputs Next step (I'll open a followon issue) is to make a DimensionFormat that uses this data structure...
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1709705 from Michael McCandless in branch 'dev/branches/lucene6825'
        [ https://svn.apache.org/r1709705 ]

        LUCENE-6825: pull out MAX_DIMS as constant; remove sop; add comment

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1709705 from Michael McCandless in branch 'dev/branches/lucene6825' [ https://svn.apache.org/r1709705 ] LUCENE-6825 : pull out MAX_DIMS as constant; remove sop; add comment
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1709778 from Michael McCandless in branch 'dev/branches/lucene6825'
        [ https://svn.apache.org/r1709778 ]

        LUCENE-6825: put this back

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1709778 from Michael McCandless in branch 'dev/branches/lucene6825' [ https://svn.apache.org/r1709778 ] LUCENE-6825 : put this back
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1709779 from Michael McCandless in branch 'dev/branches/lucene6825'
        [ https://svn.apache.org/r1709779 ]

        LUCENE-6825: merge trunk

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1709779 from Michael McCandless in branch 'dev/branches/lucene6825' [ https://svn.apache.org/r1709779 ] LUCENE-6825 : merge trunk
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1709783 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1709783 ]

        LUCENE-6825: add low-level support for block-KD trees

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1709783 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1709783 ] LUCENE-6825 : add low-level support for block-KD trees
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1709928 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1709928 ]

        LUCENE-6825: make sure we close open file on exception

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1709928 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1709928 ] LUCENE-6825 : make sure we close open file on exception
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1709930 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1709930 ]

        LUCENE-6825: fix test bug

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1709930 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1709930 ] LUCENE-6825 : fix test bug
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1710752 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1710752 ]

        LUCENE-6825: add dimensionally indexed values

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1710752 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1710752 ] LUCENE-6825 : add dimensionally indexed values
        Hide
        steve_rowe Steve Rowe added a comment -

        My Jenkins found a seed that reproduces for me: TestDimensionalValues tests (testMultiValued() 100% and testMerge() sometimes) trigger an NPE in DimensionalWriter.merge() in the DimensionalReader.intersect() implementation there:

           [junit4] Suite: org.apache.lucene.index.TestDimensionalValues
           [junit4]   2> NOTE: download the large Jenkins line-docs file by running 'ant get-jenkins-line-docs' in the lucene directory.
           [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestDimensionalValues -Dtests.method=testMultiValued -Dtests.seed=367B5FB4E6C5CEFF -Dtests.slow=true -Dtests.linedocsfile=/home/jenkins/lucene-data/enwiki.random.lines.txt -Dtests.locale=ga_IE -Dtests.timezone=Mexico/BajaSur -Dtests.asserts=true -Dtests.file.encoding=ISO-8859-1
           [junit4] ERROR   0.57s J0 | TestDimensionalValues.testMultiValued <<<
           [junit4]    > Throwable #1: org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed
           [junit4]    > 	at __randomizedtesting.SeedInfo.seed([367B5FB4E6C5CEFF:E25B3B8628078EB7]:0)
           [junit4]    > 	at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:713)
           [junit4]    > 	at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:727)
           [junit4]    > 	at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1457)
           [junit4]    > 	at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1240)
           [junit4]    > 	at org.apache.lucene.index.RandomIndexWriter.addDocument(RandomIndexWriter.java:173)
           [junit4]    > 	at org.apache.lucene.index.TestDimensionalValues.verify(TestDimensionalValues.java:829)
           [junit4]    > 	at org.apache.lucene.index.TestDimensionalValues.verify(TestDimensionalValues.java:791)
           [junit4]    > 	at org.apache.lucene.index.TestDimensionalValues.testMultiValued(TestDimensionalValues.java:212)
           [junit4]    > 	at java.lang.Thread.run(Thread.java:745)
           [junit4]    > Caused by: java.lang.NullPointerException
           [junit4]    > 	at org.apache.lucene.codecs.DimensionalWriter$1.intersect(DimensionalWriter.java:56)
           [junit4]    > 	at org.apache.lucene.codecs.simpletext.SimpleTextDimensionalWriter.writeField(SimpleTextDimensionalWriter.java:139)
           [junit4]    > 	at org.apache.lucene.codecs.DimensionalWriter.merge(DimensionalWriter.java:45)
           [junit4]    > 	at org.apache.lucene.index.SegmentMerger.mergeDimensionalValues(SegmentMerger.java:168)
           [junit4]    > 	at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:117)
           [junit4]    > 	at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4055)
           [junit4]    > 	at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3635)
           [junit4]    > 	at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:588)
           [junit4]    > 	at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:626)
           [junit4]   2> DFómh 27, 2015 6:47:10 A.M. com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler uncaughtException
           [junit4]   2> WARNING: Uncaught exception in thread: Thread[Lucene Merge Thread #1,5,TGRP-TestDimensionalValues]
           [junit4]   2> org.apache.lucene.index.MergePolicy$MergeException: java.lang.NullPointerException
           [junit4]   2> 	at __randomizedtesting.SeedInfo.seed([367B5FB4E6C5CEFF]:0)
           [junit4]   2> 	at org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:668)
           [junit4]   2> 	at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:648)
           [junit4]   2> Caused by: java.lang.NullPointerException
           [junit4]   2> 	at org.apache.lucene.codecs.DimensionalWriter$1.intersect(DimensionalWriter.java:56)
           [junit4]   2> 	at org.apache.lucene.codecs.simpletext.SimpleTextDimensionalWriter.writeField(SimpleTextDimensionalWriter.java:139)
           [junit4]   2> 	at org.apache.lucene.codecs.DimensionalWriter.merge(DimensionalWriter.java:45)
           [junit4]   2> 	at org.apache.lucene.index.SegmentMerger.mergeDimensionalValues(SegmentMerger.java:168)
           [junit4]   2> 	at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:117)
           [junit4]   2> 	at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4055)
           [junit4]   2> 	at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3635)
           [junit4]   2> 	at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:588)
           [junit4]   2> 	at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:626)
           [junit4]   2> 
           [junit4]   2> NOTE: download the large Jenkins line-docs file by running 'ant get-jenkins-line-docs' in the lucene directory.
           [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestDimensionalValues -Dtests.method=testMerge -Dtests.seed=367B5FB4E6C5CEFF -Dtests.slow=true -Dtests.linedocsfile=/home/jenkins/lucene-data/enwiki.random.lines.txt -Dtests.locale=ga_IE -Dtests.timezone=Mexico/BajaSur -Dtests.asserts=true -Dtests.file.encoding=ISO-8859-1
           [junit4] ERROR   0.14s J0 | TestDimensionalValues.testMerge <<<
           [junit4]    > Throwable #1: com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=222, name=Lucene Merge Thread #1, state=RUNNABLE, group=TGRP-TestDimensionalValues]
           [junit4]    > 	at __randomizedtesting.SeedInfo.seed([367B5FB4E6C5CEFF:85DA8150E91E786A]:0)
           [junit4]    > Caused by: org.apache.lucene.index.MergePolicy$MergeException: java.lang.NullPointerException
           [junit4]    > 	at __randomizedtesting.SeedInfo.seed([367B5FB4E6C5CEFF]:0)
           [junit4]    > 	at org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:668)
           [junit4]    > 	at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:648)
           [junit4]    > Caused by: java.lang.NullPointerException
           [junit4]    > 	at org.apache.lucene.codecs.DimensionalWriter$1.intersect(DimensionalWriter.java:56)
           [junit4]    > 	at org.apache.lucene.codecs.simpletext.SimpleTextDimensionalWriter.writeField(SimpleTextDimensionalWriter.java:139)
           [junit4]    > 	at org.apache.lucene.codecs.DimensionalWriter.merge(DimensionalWriter.java:45)
           [junit4]    > 	at org.apache.lucene.index.SegmentMerger.mergeDimensionalValues(SegmentMerger.java:168)
           [junit4]    > 	at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:117)
           [junit4]    > 	at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4055)
           [junit4]    > 	at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3635)
           [junit4]    > 	at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:588)
           [junit4]    > 	at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:626)
           [junit4]   2> NOTE: leaving temporary files on disk at: /var/lib/jenkins/jobs/Lucene-Solr-tests-trunk/workspace/lucene/build/core/test/J0/temp/lucene.index.TestDimensionalValues_367B5FB4E6C5CEFF-001
           [junit4]   2> NOTE: test params are: codec=Asserting(Lucene53): {}, docValues:{}, sim=RandomSimilarityProvider(queryNorm=false,coord=yes): {}, locale=ga_IE, timezone=Mexico/BajaSur
           [junit4]   2> NOTE: Linux 4.1.0-custom2-amd64 amd64/Oracle Corporation 1.8.0_45 (64-bit)/cpus=16,threads=1,free=263920904,total=390594560
           [junit4]   2> NOTE: All tests run in this JVM: [TestTermScorer, Test2BPostings, TestLucene50CompoundFormat, TestFilterLeafReader, TestStressNRT, TestBagOfPostings, TestLazyProxSkipping, Test2BBinaryDocValues, TestTimSorter, TestPagedBytes, TestTerms, TestSizeBoundedForceMerge, TestMultiTermQueryRewrites, TestFileSwitchDirectory, TestSpanSearchEquivalence, TestDeletionPolicy, TestVersion, TestNumericDocValuesUpdates, TestSpanCollection, TestSpanBoostQuery, TestExternalCodecs, TestDocsAndPositions, TestTermVectors, TestSimpleFSLockFactory, TestMergedIterator, Test2BPagedBytes, TestTopDocsMerge, TestIsCurrent, TestNot, TestWeakIdentityMap, TestBytesRefHash, TestConstantScoreQuery, TestBufferedChecksum, TestRollingBuffer, TestSimilarity2, TestLSBRadixSorter, TestRollingUpdates, TestParallelTermEnum, TestDimensionalValues]
           [junit4] Completed [83/400] on J0 in 27.01s, 25 tests, 2 errors <<< FAILURES!
        
        Show
        steve_rowe Steve Rowe added a comment - My Jenkins found a seed that reproduces for me: TestDimensionalValues tests ( testMultiValued() 100% and testMerge() sometimes) trigger an NPE in DimensionalWriter.merge() in the DimensionalReader.intersect() implementation there: [junit4] Suite: org.apache.lucene.index.TestDimensionalValues [junit4] 2> NOTE: download the large Jenkins line-docs file by running 'ant get-jenkins-line-docs' in the lucene directory. [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestDimensionalValues -Dtests.method=testMultiValued -Dtests.seed=367B5FB4E6C5CEFF -Dtests.slow=true -Dtests.linedocsfile=/home/jenkins/lucene-data/enwiki.random.lines.txt -Dtests.locale=ga_IE -Dtests.timezone=Mexico/BajaSur -Dtests.asserts=true -Dtests.file.encoding=ISO-8859-1 [junit4] ERROR 0.57s J0 | TestDimensionalValues.testMultiValued <<< [junit4] > Throwable #1: org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed [junit4] > at __randomizedtesting.SeedInfo.seed([367B5FB4E6C5CEFF:E25B3B8628078EB7]:0) [junit4] > at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:713) [junit4] > at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:727) [junit4] > at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1457) [junit4] > at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1240) [junit4] > at org.apache.lucene.index.RandomIndexWriter.addDocument(RandomIndexWriter.java:173) [junit4] > at org.apache.lucene.index.TestDimensionalValues.verify(TestDimensionalValues.java:829) [junit4] > at org.apache.lucene.index.TestDimensionalValues.verify(TestDimensionalValues.java:791) [junit4] > at org.apache.lucene.index.TestDimensionalValues.testMultiValued(TestDimensionalValues.java:212) [junit4] > at java.lang.Thread.run(Thread.java:745) [junit4] > Caused by: java.lang.NullPointerException [junit4] > at org.apache.lucene.codecs.DimensionalWriter$1.intersect(DimensionalWriter.java:56) [junit4] > at org.apache.lucene.codecs.simpletext.SimpleTextDimensionalWriter.writeField(SimpleTextDimensionalWriter.java:139) [junit4] > at org.apache.lucene.codecs.DimensionalWriter.merge(DimensionalWriter.java:45) [junit4] > at org.apache.lucene.index.SegmentMerger.mergeDimensionalValues(SegmentMerger.java:168) [junit4] > at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:117) [junit4] > at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4055) [junit4] > at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3635) [junit4] > at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:588) [junit4] > at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:626) [junit4] 2> DFómh 27, 2015 6:47:10 A.M. com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler uncaughtException [junit4] 2> WARNING: Uncaught exception in thread: Thread[Lucene Merge Thread #1,5,TGRP-TestDimensionalValues] [junit4] 2> org.apache.lucene.index.MergePolicy$MergeException: java.lang.NullPointerException [junit4] 2> at __randomizedtesting.SeedInfo.seed([367B5FB4E6C5CEFF]:0) [junit4] 2> at org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:668) [junit4] 2> at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:648) [junit4] 2> Caused by: java.lang.NullPointerException [junit4] 2> at org.apache.lucene.codecs.DimensionalWriter$1.intersect(DimensionalWriter.java:56) [junit4] 2> at org.apache.lucene.codecs.simpletext.SimpleTextDimensionalWriter.writeField(SimpleTextDimensionalWriter.java:139) [junit4] 2> at org.apache.lucene.codecs.DimensionalWriter.merge(DimensionalWriter.java:45) [junit4] 2> at org.apache.lucene.index.SegmentMerger.mergeDimensionalValues(SegmentMerger.java:168) [junit4] 2> at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:117) [junit4] 2> at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4055) [junit4] 2> at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3635) [junit4] 2> at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:588) [junit4] 2> at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:626) [junit4] 2> [junit4] 2> NOTE: download the large Jenkins line-docs file by running 'ant get-jenkins-line-docs' in the lucene directory. [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestDimensionalValues -Dtests.method=testMerge -Dtests.seed=367B5FB4E6C5CEFF -Dtests.slow=true -Dtests.linedocsfile=/home/jenkins/lucene-data/enwiki.random.lines.txt -Dtests.locale=ga_IE -Dtests.timezone=Mexico/BajaSur -Dtests.asserts=true -Dtests.file.encoding=ISO-8859-1 [junit4] ERROR 0.14s J0 | TestDimensionalValues.testMerge <<< [junit4] > Throwable #1: com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=222, name=Lucene Merge Thread #1, state=RUNNABLE, group=TGRP-TestDimensionalValues] [junit4] > at __randomizedtesting.SeedInfo.seed([367B5FB4E6C5CEFF:85DA8150E91E786A]:0) [junit4] > Caused by: org.apache.lucene.index.MergePolicy$MergeException: java.lang.NullPointerException [junit4] > at __randomizedtesting.SeedInfo.seed([367B5FB4E6C5CEFF]:0) [junit4] > at org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:668) [junit4] > at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:648) [junit4] > Caused by: java.lang.NullPointerException [junit4] > at org.apache.lucene.codecs.DimensionalWriter$1.intersect(DimensionalWriter.java:56) [junit4] > at org.apache.lucene.codecs.simpletext.SimpleTextDimensionalWriter.writeField(SimpleTextDimensionalWriter.java:139) [junit4] > at org.apache.lucene.codecs.DimensionalWriter.merge(DimensionalWriter.java:45) [junit4] > at org.apache.lucene.index.SegmentMerger.mergeDimensionalValues(SegmentMerger.java:168) [junit4] > at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:117) [junit4] > at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4055) [junit4] > at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3635) [junit4] > at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:588) [junit4] > at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:626) [junit4] 2> NOTE: leaving temporary files on disk at: /var/lib/jenkins/jobs/Lucene-Solr-tests-trunk/workspace/lucene/build/core/test/J0/temp/lucene.index.TestDimensionalValues_367B5FB4E6C5CEFF-001 [junit4] 2> NOTE: test params are: codec=Asserting(Lucene53): {}, docValues:{}, sim=RandomSimilarityProvider(queryNorm=false,coord=yes): {}, locale=ga_IE, timezone=Mexico/BajaSur [junit4] 2> NOTE: Linux 4.1.0-custom2-amd64 amd64/Oracle Corporation 1.8.0_45 (64-bit)/cpus=16,threads=1,free=263920904,total=390594560 [junit4] 2> NOTE: All tests run in this JVM: [TestTermScorer, Test2BPostings, TestLucene50CompoundFormat, TestFilterLeafReader, TestStressNRT, TestBagOfPostings, TestLazyProxSkipping, Test2BBinaryDocValues, TestTimSorter, TestPagedBytes, TestTerms, TestSizeBoundedForceMerge, TestMultiTermQueryRewrites, TestFileSwitchDirectory, TestSpanSearchEquivalence, TestDeletionPolicy, TestVersion, TestNumericDocValuesUpdates, TestSpanCollection, TestSpanBoostQuery, TestExternalCodecs, TestDocsAndPositions, TestTermVectors, TestSimpleFSLockFactory, TestMergedIterator, Test2BPagedBytes, TestTopDocsMerge, TestIsCurrent, TestNot, TestWeakIdentityMap, TestBytesRefHash, TestConstantScoreQuery, TestBufferedChecksum, TestRollingBuffer, TestSimilarity2, TestLSBRadixSorter, TestRollingUpdates, TestParallelTermEnum, TestDimensionalValues] [junit4] Completed [83/400] on J0 in 27.01s, 25 tests, 2 errors <<< FAILURES!
        Hide
        mikemccand Michael McCandless added a comment -

        Thanks Steve Rowe, I'll dig ...

        Show
        mikemccand Michael McCandless added a comment - Thanks Steve Rowe , I'll dig ...
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1710830 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1710830 ]

        LUCENE-6825: don't NPE when trying to merge a segment that has no documents that indexed dimensional values

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1710830 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1710830 ] LUCENE-6825 : don't NPE when trying to merge a segment that has no documents that indexed dimensional values
        Hide
        lvton Lvton. Smith added a comment -

        For 1D case, would B+ tree perform as well as bkd-tree? I think so. What about you, Michael?

        Show
        lvton Lvton. Smith added a comment - For 1D case, would B+ tree perform as well as bkd-tree? I think so. What about you, Michael?
        Hide
        lvton Lvton. Smith added a comment -

        For 1D case, would B+ tree perform as well as bkd-tree? I think so. What about you, Michael?

        Show
        lvton Lvton. Smith added a comment - For 1D case, would B+ tree perform as well as bkd-tree? I think so. What about you, Michael?
        Hide
        mikemccand Michael McCandless added a comment -

        I think a B+ tree would be overkill in the 1D case? Also, Lucene segments are write-once, so the ability/freedom to "insert" in the tree (and "delete") would be unused.

        Show
        mikemccand Michael McCandless added a comment - I think a B+ tree would be overkill in the 1D case? Also, Lucene segments are write-once, so the ability/freedom to "insert" in the tree (and "delete") would be unused.
        Hide
        lvton Lvton. Smith added a comment -

        It should give sizable performance gains (smaller index, faster searching) over what we have today, and even over what auto-prefix with efficient numeric terms would do.

        smaller index
        yes!!

        faster searching??
        I cannot understand that
        "Auto-prefix" uses less terms, although larger index. But, "bkd", I think, is equal to using "every number term between the range". Why faster?

        Show
        lvton Lvton. Smith added a comment - It should give sizable performance gains ( smaller index, faster searching ) over what we have today, and even over what auto-prefix with efficient numeric terms would do. smaller index yes!! faster searching?? I cannot understand that "Auto-prefix" uses less terms, although larger index. But, "bkd", I think, is equal to using "every number term between the range". Why faster?

          People

          • Assignee:
            mikemccand Michael McCandless
            Reporter:
            mikemccand Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development