Lucene - Core
  1. Lucene - Core
  2. LUCENE-5025

Allow more than 2.1B "tail nodes" when building FST

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.4, 6.0
    • Component/s: core/FSTs
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      We recently relaxed some of the limits for big FSTs, but there is
      one more limit I think we should fix. E.g. Aaron hit it in building
      the world's biggest FST: http://aaron.blog.archive.org/2013/05/29/worlds-biggest-fst/

      The issue is NodeHash, which currently uses a GrowableWriter (packed
      ints impl that can grow both number of bits and number of values):
      it's indexed by int not long.

      This is a hash table that's used to share suffixes, so we need random
      get/put on a long index of long values, i.e. this is logically a long[].

      I think one simple way to do this is to make a "paged"
      GrowableWriter...

      Along with this we'd need to fix the hash codes to be long not
      int.

      1. LUCENE-5025.patch
        19 kB
        Michael McCandless

        Activity

        Hide
        Michael McCandless added a comment -

        Initial patch, also folding in Adrien's patch from LUCENE-5026.

        I had to make one change to that patch so that the last page (GrowableWriter) is sized down, ie if I use 1<<30 page size but only need 100 values then that single GrowableWriter should be size=100 I think.

        I'm running Test2BFSTs now ...

        Show
        Michael McCandless added a comment - Initial patch, also folding in Adrien's patch from LUCENE-5026 . I had to make one change to that patch so that the last page (GrowableWriter) is sized down, ie if I use 1<<30 page size but only need 100 values then that single GrowableWriter should be size=100 I think. I'm running Test2BFSTs now ...
        Hide
        Robert Muir added a comment -

        I had to make one change to that patch so that the last page (GrowableWriter) is sized down, ie if I use 1<<30 page size but only need 100 values then that single GrowableWriter should be size=100 I think.

        I played with this in the patch, seems to work fine!
        There is a stray print left in rehash(), but precommit should find that

        Show
        Robert Muir added a comment - I had to make one change to that patch so that the last page (GrowableWriter) is sized down, ie if I use 1<<30 page size but only need 100 values then that single GrowableWriter should be size=100 I think. I played with this in the patch, seems to work fine! There is a stray print left in rehash(), but precommit should find that
        Hide
        Michael McCandless added a comment -

        Thanks Rob.

        Test2BFSTs also passed, after 4.5 hours and 40 GB heap!

        I think it's ready ... I'll clean up that print and the nocommits and commit ... thanks Adrien for creating PagedGrowableWriter!

        Show
        Michael McCandless added a comment - Thanks Rob. Test2BFSTs also passed, after 4.5 hours and 40 GB heap! I think it's ready ... I'll clean up that print and the nocommits and commit ... thanks Adrien for creating PagedGrowableWriter!
        Hide
        Michael McCandless added a comment -

        Hmm just hit a failure in the test case for PagedGrowableWriter:

        ant test  -Dtestcase=TestPackedInts -Dtests.method=testPagedGrowableWriter -Dtests.seed=5F62E708DEB71329 -Dtests.slow=true -Dtests.locale=ar_BH -Dtests.timezone=America/Tijuana -Dtests.file.encoding=UTF-8
        

        I haven't looked into it yet ...

        Show
        Michael McCandless added a comment - Hmm just hit a failure in the test case for PagedGrowableWriter: ant test -Dtestcase=TestPackedInts -Dtests.method=testPagedGrowableWriter -Dtests.seed=5F62E708DEB71329 -Dtests.slow=true -Dtests.locale=ar_BH -Dtests.timezone=America/Tijuana -Dtests.file.encoding=UTF-8 I haven't looked into it yet ...
        Hide
        Robert Muir added a comment -

        I think that bug is in PagedGrowableWriter.resize().

        Currently, if you resize one where the last page isnt full, you try to access values that dont exist,
        because it just copies pagesize() values.

        I think instead it should be (something like):

        for (int i = 0; i < numCommonPages; ++i) {
              ...
              final int valuesToCopy;
              // if its the last page, it might be sized down
              if (i == subWriters.length - 1) {
                valuesToCopy = (int)(size() % pageSize());
              } else {
                valuesToCopy = pageSize();
              }
              PackedInts.copy(subWriters[i], 0, newWriter.subWriters[i], 0, valuesToCopy, copyBuffer);
            }
        
        Show
        Robert Muir added a comment - I think that bug is in PagedGrowableWriter.resize(). Currently, if you resize one where the last page isnt full, you try to access values that dont exist, because it just copies pagesize() values. I think instead it should be (something like): for ( int i = 0; i < numCommonPages; ++i) { ... final int valuesToCopy; // if its the last page, it might be sized down if (i == subWriters.length - 1) { valuesToCopy = ( int )(size() % pageSize()); } else { valuesToCopy = pageSize(); } PackedInts.copy(subWriters[i], 0, newWriter.subWriters[i], 0, valuesToCopy, copyBuffer); }
        Hide
        Michael McCandless added a comment -

        Thanks Rob! That's it (due to a similar fix I made in the ctor ...). I'll fix.

        Show
        Michael McCandless added a comment - Thanks Rob! That's it (due to a similar fix I made in the ctor ...). I'll fix.
        Hide
        Steve Rowe added a comment -

        Bulk close resolved 4.4 issues

        Show
        Steve Rowe added a comment - Bulk close resolved 4.4 issues

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development