Lucene - Core
  1. Lucene - Core
  2. LUCENE-3354

Extend FieldCache architecture to multiple Values

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0-ALPHA
    • Component/s: None
    • Labels:
      None

      Description

      I would consider this a bug. It appears lots of people are working around this limitation,
      why don't we just change the underlying data structures to natively support multiValued fields in the FieldCache architecture?

      Then functions() will work properly, and we can do things like easily geodist() on a multiValued field.

      Thoughts?

      1. LUCENE-3354_testspeed.patch
        0.7 kB
        Robert Muir
      2. LUCENE-3354.patch
        12 kB
        Martijn van Groningen
      3. LUCENE-3354.patch
        6 kB
        Martijn van Groningen

        Issue Links

          Activity

          Hide
          Varun Thacker added a comment -

          Hi,

          I have a doubt on FieldCache supporting MultiValued fields in general. So FieldCache on a multiValued field works by consuming it from FieldCache.DocTermOrds but,

          • I was trying out FunctionQuery in Solr and still got a "cannot FieldCache on multiValued field" error. This is because any impl. of FieldCacheSource for example StrFieldSource#getValues() returns DocTermsIndexDocValues where FieldCache.DocTermsIndex instance loads up. Is this supposed to be consumed like this?
          • Secondly slightly off topic but I went through the lucene4547 branch where there was a discussion on how to consume DocValues. I'm still trying to figure a lot of stuff around DocValues, FieldCache etc. but do we need to discuss all these issues and it's impact on Solr and ES as a whole?
          Show
          Varun Thacker added a comment - Hi, I have a doubt on FieldCache supporting MultiValued fields in general. So FieldCache on a multiValued field works by consuming it from FieldCache.DocTermOrds but, I was trying out FunctionQuery in Solr and still got a "cannot FieldCache on multiValued field" error. This is because any impl. of FieldCacheSource for example StrFieldSource#getValues() returns DocTermsIndexDocValues where FieldCache.DocTermsIndex instance loads up. Is this supposed to be consumed like this? Secondly slightly off topic but I went through the lucene4547 branch where there was a discussion on how to consume DocValues. I'm still trying to figure a lot of stuff around DocValues, FieldCache etc. but do we need to discuss all these issues and it's impact on Solr and ES as a whole?
          Hide
          Martijn van Groningen added a comment -

          Test passes on Jenkins.

          Show
          Martijn van Groningen added a comment - Test passes on Jenkins.
          Hide
          Robert Muir added a comment -

          OK, thanks. I bet this was probably slowing things down for simpletext or something stupid

          Show
          Robert Muir added a comment - OK, thanks. I bet this was probably slowing things down for simpletext or something stupid
          Hide
          Martijn van Groningen added a comment -

          I don't think there is any reason for generating long unicode strings. Only the cache behavior needs to be tested.

          Show
          Martijn van Groningen added a comment - I don't think there is any reason for generating long unicode strings. Only the cache behavior needs to be tested.
          Hide
          Robert Muir added a comment -

          attached is a patch that seems to help for me, it doesn't create such long unicode strings in the test.

          Is there some reason why the test would want very long strings?

          Show
          Robert Muir added a comment - attached is a patch that seems to help for me, it doesn't create such long unicode strings in the test. Is there some reason why the test would want very long strings?
          Hide
          Robert Muir added a comment -

          Thanks Martijn: any idea how we can speed this test up? for our 'ant test' runs with multiplier=3, this takes a significant amount of time (over 15 minutes!), more than all the other tests combined.

          Before the commit my builds were taking about 9 minutes, log here: http://sierranevada.servebeer.com/

              [junit] Testsuite: org.apache.lucene.search.TestFieldCache
              [junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 1,062.362 sec
          
          Show
          Robert Muir added a comment - Thanks Martijn: any idea how we can speed this test up? for our 'ant test' runs with multiplier=3, this takes a significant amount of time (over 15 minutes!), more than all the other tests combined. Before the commit my builds were taking about 9 minutes, log here: http://sierranevada.servebeer.com/ [junit] Testsuite: org.apache.lucene.search.TestFieldCache [junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 1,062.362 sec
          Hide
          Martijn van Groningen added a comment -

          I committed a fix. Test pass now on my local box with -Dtests.multiplier=3.
          If build is successful on Jenkins we can close this issue.

          Show
          Martijn van Groningen added a comment - I committed a fix. Test pass now on my local box with -Dtests.multiplier=3. If build is successful on Jenkins we can close this issue.
          Hide
          Robert Muir added a comment -

          The new multivalued test in TestFieldCache exhibits some problems if NUM_ORD > 2.

          This is the case if you e.g. use -Dtests.multiplier=3 (like hudson does)... I temporarily disabled it and put in a loud system.out.println

          -    NUM_ORDS = atLeast(2);
          +    System.out.println("WARNING: NUM_ORDS is wired to 2, test fails otherwise!!!!!!!!!!!!!!!!!!!!!");
          +    NUM_ORDS = 2; //atLeast(2);
          
          Show
          Robert Muir added a comment - The new multivalued test in TestFieldCache exhibits some problems if NUM_ORD > 2. This is the case if you e.g. use -Dtests.multiplier=3 (like hudson does)... I temporarily disabled it and put in a loud system.out.println - NUM_ORDS = atLeast(2); + System.out.println("WARNING: NUM_ORDS is wired to 2, test fails otherwise!!!!!!!!!!!!!!!!!!!!!"); + NUM_ORDS = 2; //atLeast(2);
          Hide
          Robert Muir added a comment -

          reopening: there is a problem in the test

          Show
          Robert Muir added a comment - reopening: there is a problem in the test
          Hide
          Martijn van Groningen added a comment -

          Committed in revision 1158393 (trunk).

          Show
          Martijn van Groningen added a comment - Committed in revision 1158393 (trunk).
          Hide
          Michael McCandless added a comment -

          Patch looks good Martijn!

          Show
          Michael McCandless added a comment - Patch looks good Martijn!
          Hide
          Martijn van Groningen added a comment -

          Updated the patch. Added a test to TestFieldCache. I think this is ready to be committed. New issues should be concerned with integrating DocTermOrds into function queries, sorting, grouping and more.

          Show
          Martijn van Groningen added a comment - Updated the patch. Added a test to TestFieldCache. I think this is ready to be committed. New issues should be concerned with integrating DocTermOrds into function queries, sorting, grouping and more.
          Hide
          Martijn van Groningen added a comment -

          Oops, uploaded the wrong patch.

          Show
          Martijn van Groningen added a comment - Oops, uploaded the wrong patch.
          Hide
          Martijn van Groningen added a comment -

          Attached initial patch.
          FieldCache has a new method:

          FieldCache#getDocTermOrds(reader, field)
          

          The DocTermOrdsCreator currently doesn't validate any thing. I'm not sure what it should validate (DocTermsIndex doesn't validate either...).

          This patch does not rely on the patch in LUCENE-3360. Implement LUCENE-3360 properly might take some time. I think issue can be implemented much quicker.

          Show
          Martijn van Groningen added a comment - Attached initial patch. FieldCache has a new method: FieldCache#getDocTermOrds(reader, field) The DocTermOrdsCreator currently doesn't validate any thing. I'm not sure what it should validate (DocTermsIndex doesn't validate either...). This patch does not rely on the patch in LUCENE-3360 . Implement LUCENE-3360 properly might take some time. I think issue can be implemented much quicker.
          Hide
          Martijn van Groningen added a comment -

          I opened LUCENE-3360 for moving FieldCache to IndexReader. This issue should be concerned with adding getDocTermOrds() to FieldCache.

          Show
          Martijn van Groningen added a comment - I opened LUCENE-3360 for moving FieldCache to IndexReader. This issue should be concerned with adding getDocTermOrds() to FieldCache.
          Hide
          Bill Bell added a comment -

          Lots of activity... Can someone lead this?

          Bill

          Show
          Bill Bell added a comment - Lots of activity... Can someone lead this? Bill
          Hide
          Hoss Man added a comment -

          This would also remove the insanity issues.

          FWIW: the WeakHashMap isn't the sole source of "insanity" - that can also come about from inconsistent usage for a single field (ie: asking for string and int caches for the same field)

          Show
          Hoss Man added a comment - This would also remove the insanity issues. FWIW: the WeakHashMap isn't the sole source of "insanity" - that can also come about from inconsistent usage for a single field (ie: asking for string and int caches for the same field)
          Hide
          Uwe Schindler added a comment -

          If that ability is removed from Lucene, I guess we could always move some of the old FieldCache logic to Solr though.

          Solr can always use SlowMultiReaderWrapper (see above)

          Show
          Uwe Schindler added a comment - If that ability is removed from Lucene, I guess we could always move some of the old FieldCache logic to Solr though. Solr can always use SlowMultiReaderWrapper (see above)
          Hide
          Yonik Seeley added a comment -

          (icluding the broken Solr parts still using TopLevel FieldCache entries).

          Some top-level field cache uses are very much by design in Solr.
          If that ability is removed from Lucene, I guess we could always move some of the old FieldCache logic to Solr though.

          Show
          Yonik Seeley added a comment - (icluding the broken Solr parts still using TopLevel FieldCache entries). Some top-level field cache uses are very much by design in Solr. If that ability is removed from Lucene, I guess we could always move some of the old FieldCache logic to Solr though.
          Hide
          Michael McCandless added a comment -

          +1 to moving FC to atomic readers only, and let SlowMultiReaderWrapper absorb the insanity.

          Show
          Michael McCandless added a comment - +1 to moving FC to atomic readers only, and let SlowMultiReaderWrapper absorb the insanity.
          Hide
          Martijn van Groningen added a comment -

          What are thoughts on using DocValues rather then FieldCache?

          Maybe both should be available. Not all fields have indexed docvalues.

          We should start with this in 4.0! For backwards compatibility we could still have the FieldCache class, but just delegating.

          Changing the architecture seems like a big task to me. Maybe that should be done in a different issue. This issue will then depend on it.

          Show
          Martijn van Groningen added a comment - What are thoughts on using DocValues rather then FieldCache? Maybe both should be available. Not all fields have indexed docvalues. We should start with this in 4.0! For backwards compatibility we could still have the FieldCache class, but just delegating. Changing the architecture seems like a big task to me. Maybe that should be done in a different issue. This issue will then depend on it.
          Hide
          Robert Muir added a comment -

          +1, die insanity, die.

          Show
          Robert Muir added a comment - +1, die insanity, die.
          Hide
          Uwe Schindler added a comment -

          In general the FieldCache should come from the reader (and non-atomic readers should throw UOE) and not from a static method of a random abstract class somewhere in the search package. The orginal FieldCache design was broken and there are many issues around this. This would also remove the insanity issues. We can of course make SlowMultiReaderWrapper behave correct, but then all users know that they do something wrong (icluding the broken Solr parts still using TopLevel FieldCache entries).

          We should start with this in 4.0! For backwards compatibility we could still have the FieldCache class, but just delegating.

          Show
          Uwe Schindler added a comment - In general the FieldCache should come from the reader (and non-atomic readers should throw UOE) and not from a static method of a random abstract class somewhere in the search package. The orginal FieldCache design was broken and there are many issues around this. This would also remove the insanity issues. We can of course make SlowMultiReaderWrapper behave correct, but then all users know that they do something wrong (icluding the broken Solr parts still using TopLevel FieldCache entries). We should start with this in 4.0! For backwards compatibility we could still have the FieldCache class, but just delegating.
          Hide
          Ryan McKinley added a comment -

          What are thoughts on using DocValues rather then FieldCache?

          If we do choose to extend the FieldCache architecture, it would be so much cleaner if it were a simple Map<K,V> directly on the Reader rather then a static thing holding a WeakHashMap<Reader,Cache>

          Show
          Ryan McKinley added a comment - What are thoughts on using DocValues rather then FieldCache? If we do choose to extend the FieldCache architecture, it would be so much cleaner if it were a simple Map<K,V> directly on the Reader rather then a static thing holding a WeakHashMap<Reader,Cache>
          Hide
          Martijn van Groningen added a comment -

          +1. If DocTermOrds is available in FieldCache, then Grouping (Term based impl) can also use DocTermOrds.

          Show
          Martijn van Groningen added a comment - +1. If DocTermOrds is available in FieldCache, then Grouping (Term based impl) can also use DocTermOrds.
          Hide
          Michael McCandless added a comment -

          +1, though really this should be a Lucene issue (FieldCache is in Lucene).

          We actually have a start at this: the core part of UnInvertedField was factored into Lucene as oal.index.DocTermOrds. I think all we need to do is make this accessible through FieldCache.

          Show
          Michael McCandless added a comment - +1, though really this should be a Lucene issue (FieldCache is in Lucene). We actually have a start at this: the core part of UnInvertedField was factored into Lucene as oal.index.DocTermOrds. I think all we need to do is make this accessible through FieldCache.

            People

            • Assignee:
              Unassigned
              Reporter:
              Bill Bell
            • Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development