Lucene - Core
  1. Lucene - Core
  2. LUCENE-3121

FST should offer lookup-by-output API when output strictly increases

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.6, 4.0-ALPHA
    • Component/s: core/other
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Spinoff from "FST and FieldCache" java-dev thread http://lucene.markmail.org/thread/swoawlv3fq4dntvl

      FST is able to associate arbitrary outputs with the sorted input keys, but in the special (and, common) case where the function is strictly monotonic (each output only "increases" vs prior outputs), such as mapping to term ords or mapping to file offsets in the terms dict, we should offer a lookup-by-output API that efficiently walks the FST and locates input key (exact or floor or ceil) matching that output.

      1. LUCENE-3121.patch
        21 kB
        Michael McCandless

        Issue Links

          Activity

          Uwe Schindler made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Gavin made changes -
          Link This issue is depended upon by LUCENE-3729 [ LUCENE-3729 ]
          Gavin made changes -
          Link This issue blocks LUCENE-3729 [ LUCENE-3729 ]
          David Smiley made changes -
          Link This issue blocks LUCENE-3729 [ LUCENE-3729 ]
          Hide
          Michael McCandless added a comment -

          Yes, that is in fact the very idea I had originally that I raised on the list that you in turn created this JIRA issue for.

          Aha! Thanks for reminding me I think it would work well... the FST should compress the terms nicely; just a question of at what cost to lookup time. But, then, for some cases of DV/FC consumption the doc -> ord lookup is far more important (which wouldn't change), so it could be a good win...

          Show
          Michael McCandless added a comment - Yes, that is in fact the very idea I had originally that I raised on the list that you in turn created this JIRA issue for. Aha! Thanks for reminding me I think it would work well... the FST should compress the terms nicely; just a question of at what cost to lookup time. But, then, for some cases of DV/FC consumption the doc -> ord lookup is far more important (which wouldn't change), so it could be a good win...
          Hide
          David Smiley added a comment -

          That would be interesting! Or, maybe to DocValues instead...

          Yes, that is in fact the very idea I had originally that I raised on the list that you in turn created this JIRA issue for. It was almost a year ago so perhaps you forgot. Well this JIRA issue would be the first step, the next step would be another JIRA issue for Field Cache integration and another for DocValues. I feel a tad guilty for suggesting this work without taking part... I'm neck deep in geospatial Lucene extensions.

          Show
          David Smiley added a comment - That would be interesting! Or, maybe to DocValues instead... Yes, that is in fact the very idea I had originally that I raised on the list that you in turn created this JIRA issue for. It was almost a year ago so perhaps you forgot. Well this JIRA issue would be the first step, the next step would be another JIRA issue for Field Cache integration and another for DocValues. I feel a tad guilty for suggesting this work without taking part... I'm neck deep in geospatial Lucene extensions.
          Michael McCandless made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Fix Version/s 3.6 [ 12319070 ]
          Resolution Fixed [ 1 ]
          Hide
          Michael McCandless added a comment -

          Would the next step being applying this to the Field Cache?

          That would be interesting! Or, maybe to DocValues instead...

          Today neither DocValues nor FieldCache use an FST to hold the term data, but you're right we could explore this now, since we can lookup-by-ord or lookup-by-address (with this patch).

          For example, the DocValues.BYTES_VAR/FIXED_SORTED_DEREF could hold the Term <-> ord/address map as an FST (doc -> ord/address would still be packed ints)... same for FieldCache.DocTermsIndex/.DocTermOrds This should be a sizable reduction in RAM required for the term data... but lookup time would get slower too.

          Show
          Michael McCandless added a comment - Would the next step being applying this to the Field Cache? That would be interesting! Or, maybe to DocValues instead... Today neither DocValues nor FieldCache use an FST to hold the term data, but you're right we could explore this now, since we can lookup-by-ord or lookup-by-address (with this patch). For example, the DocValues.BYTES_VAR/FIXED_SORTED_DEREF could hold the Term <-> ord/address map as an FST (doc -> ord/address would still be packed ints)... same for FieldCache.DocTermsIndex/.DocTermOrds This should be a sizable reduction in RAM required for the term data... but lookup time would get slower too.
          Hide
          Dawid Weiss added a comment -

          The patch looks good to me. Although I must say the API makes it more complex than it really is (those nextFinalOutputs and "real" vs non-real arcs in readFirstRealArc... brr...

          Show
          Dawid Weiss added a comment - The patch looks good to me. Although I must say the API makes it more complex than it really is (those nextFinalOutputs and "real" vs non-real arcs in readFirstRealArc... brr...
          Hide
          David Smiley added a comment -

          Awesome Mike! Would the next step being applying this to the Field Cache?

          Show
          David Smiley added a comment - Awesome Mike! Would the next step being applying this to the Field Cache?
          Michael McCandless made changes -
          Attachment LUCENE-3121.patch [ 12510981 ]
          Hide
          Michael McCandless added a comment -

          Patch, adding Util.getByOutput. I think it's ready.

          It only works w/ an FST whose outputs are strictly monotonically increasing w/ the inputs.

          Currently it only works with FST<Long> but we can "generify" that at some point...

          This is only the "exact" case; in theory we could do floor/ceiling too, but it'd be just as hairy as seekFloor/seekCeil in FSTEnum... let's wait for a real need.

          Show
          Michael McCandless added a comment - Patch, adding Util.getByOutput. I think it's ready. It only works w/ an FST whose outputs are strictly monotonically increasing w/ the inputs. Currently it only works with FST<Long> but we can "generify" that at some point... This is only the "exact" case; in theory we could do floor/ceiling too, but it'd be just as hairy as seekFloor/seekCeil in FSTEnum... let's wait for a real need.
          Michael McCandless made changes -
          Field Original Value New Value
          Assignee Michael McCandless [ mikemccand ]
          Michael McCandless created issue -

            People

            • Assignee:
              Michael McCandless
              Reporter:
              Michael McCandless
            • Votes:
              1 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development