Lucene - Core
  1. Lucene - Core
  2. LUCENE-3199

Add non-desctructive sort to BytesRefHash

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: None
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Currently the BytesRefHash is destructive. We can add a method that returns a non-destructively generated int[].

      1. LUCENE-3199.patch
        6 kB
        Jason Rutherglen
      2. LUCENE-3199.patch
        7 kB
        Jason Rutherglen
      3. LUCENE-3199.patch
        22 kB
        Simon Willnauer
      4. LUCENE-3199.patch
        22 kB
        Simon Willnauer

        Issue Links

          Activity

          Hide
          Jason Rutherglen added a comment -

          I think the issue with this, as it relates to realtime search, is in order to sort, we'll need to freeze indexing.

          Show
          Jason Rutherglen added a comment - I think the issue with this, as it relates to realtime search, is in order to sort, we'll need to freeze indexing.
          Hide
          Jason Rutherglen added a comment -

          Here's a version of this issue. Added are a couple of new methods to TestBytesRefHash to test the new frozen compact and sorting functionality of BytesRefHash.

          This is being posted now because it's useful in relation to LUCENE-2312 and a terms dictionary that is composed of sorted by term[id]s int[]s.

          Show
          Jason Rutherglen added a comment - Here's a version of this issue. Added are a couple of new methods to TestBytesRefHash to test the new frozen compact and sorting functionality of BytesRefHash. This is being posted now because it's useful in relation to LUCENE-2312 and a terms dictionary that is composed of sorted by term [id] s int[]s.
          Hide
          Jason Rutherglen added a comment -

          This is a minor update when compared with the last patch.

          It adds the option of pruning the [oversized] int[] returned by the compact method.

          Added are additional unit tests.

          Show
          Jason Rutherglen added a comment - This is a minor update when compared with the last patch. It adds the option of pruning the [oversized] int[] returned by the compact method. Added are additional unit tests.
          Hide
          Simon Willnauer added a comment -

          hey jason, I actually moved this a little further and added a ReadOnly View To BytesRefHash. This View provides next(), seekExact() and seekCeil() methods just like we have TermsEnum.
          The view is actually sorted if needed and can incrementally merge with a previously created view.
          Initially I wondered if this approach would be feasible performance wise but in fact this is actually really fast. I did some poor-mans benchmarks where I opened a new view every 500 to 1000 new unique terms and this takes around 0.001 to 0.01 millisecond on average. I have never seen it taking longer than 0.1 ms. I think it would be worth while exploring if we can go that simple and reopen such a view for each document while we are indexing. The view actually allocates only one additional array and reuses all other references from the BytesRefHash instance. It seems this one additional int[] is not too bad though.

          the patch is still rough. I will work further on it next week.

          Show
          Simon Willnauer added a comment - hey jason, I actually moved this a little further and added a ReadOnly View To BytesRefHash. This View provides next(), seekExact() and seekCeil() methods just like we have TermsEnum. The view is actually sorted if needed and can incrementally merge with a previously created view. Initially I wondered if this approach would be feasible performance wise but in fact this is actually really fast. I did some poor-mans benchmarks where I opened a new view every 500 to 1000 new unique terms and this takes around 0.001 to 0.01 millisecond on average. I have never seen it taking longer than 0.1 ms. I think it would be worth while exploring if we can go that simple and reopen such a view for each document while we are indexing. The view actually allocates only one additional array and reuses all other references from the BytesRefHash instance. It seems this one additional int[] is not too bad though. the patch is still rough. I will work further on it next week.
          Hide
          Simon Willnauer added a comment -

          new version, fixed one concurrency issue and added some doc strings

          Show
          Simon Willnauer added a comment - new version, fixed one concurrency issue and added some doc strings
          Hide
          Uwe Schindler added a comment -

          Cool idea with the view! Policeman work: SorterTemplate looks correct

          Show
          Uwe Schindler added a comment - Cool idea with the view! Policeman work: SorterTemplate looks correct
          Hide
          Jason Rutherglen added a comment -

          Simon,

          In summary this is using the BytesRefHash sort, performing array copies and
          then merge [sorting] into a new copy / view.

          Array copies are fast and counter intuitively generate far less garbage than
          objects (in Java).

          Instead of creating term 'segments' that would be merged while iterating the
          terms enum, we'll be generating static point-in-time terms dict views. These
          will be useful for enabling DocTermsIndex field caches for RT, the only
          remaining design 'challenge' for RT / LUCENE-2312. Because there is a terms
          hash, we can seek exact to the term rather than perform an [optimized] seek to
          the term.

          Show
          Jason Rutherglen added a comment - Simon, In summary this is using the BytesRefHash sort, performing array copies and then merge [sorting] into a new copy / view. Array copies are fast and counter intuitively generate far less garbage than objects (in Java). Instead of creating term 'segments' that would be merged while iterating the terms enum, we'll be generating static point-in-time terms dict views. These will be useful for enabling DocTermsIndex field caches for RT, the only remaining design 'challenge' for RT / LUCENE-2312 . Because there is a terms hash, we can seek exact to the term rather than perform an [optimized] seek to the term.
          Hide
          Jason Rutherglen added a comment -

          Simon, I think your patch should be in a different issue, eg, sorted bytes ref hash view or something.

          Show
          Jason Rutherglen added a comment - Simon, I think your patch should be in a different issue, eg, sorted bytes ref hash view or something.
          Hide
          Jason Rutherglen added a comment -

          I started integrating the patch into LUCENE-2312. I think the main functionality missing is a reverse int[] that points from a term id to the sorted ords array. That array would be used for implementing the RT version of DocTermsIndex, where a doc id -> term id -> sorted term id index.

          Show
          Jason Rutherglen added a comment - I started integrating the patch into LUCENE-2312 . I think the main functionality missing is a reverse int[] that points from a term id to the sorted ords array. That array would be used for implementing the RT version of DocTermsIndex, where a doc id -> term id -> sorted term id index.
          Hide
          Jason Rutherglen added a comment -

          Ok, solved the above comment by taking the sorted ord array and building a new reverse array from that...

          Show
          Jason Rutherglen added a comment - Ok, solved the above comment by taking the sorted ord array and building a new reverse array from that...

            People

            • Assignee:
              Unassigned
              Reporter:
              Jason Rutherglen
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:

                Development