Lucene - Core
  1. Lucene - Core
  2. LUCENE-3836

Most Codec.*Format().*Reader() methods should use SegmentReadState

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: None
    • Fix Version/s: 4.0-ALPHA
    • Component/s: core/codecs
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Codec formats API for opening readers is inconsistent - sometimes it uses SegmentReadState, in other cases it uses individual arguments that are already available via SegmentReadState. This complicates extending the API, e.g. if additional per-segment state would need to be passed to the readers.

      1. LUCENE-3836.patch
        17 kB
        Andrzej Bialecki

        Activity

        Hide
        Andrzej Bialecki added a comment -

        Patch that implements the change. If there are no objections I'd like to commit this soon.

        Show
        Andrzej Bialecki added a comment - Patch that implements the change. If there are no objections I'd like to commit this soon.
        Hide
        Robert Muir added a comment -

        I think this change is OK: I just want to mention that avoiding SegmentReadState
        was definitely intentional... well most of my issues are really based on
        SegmentWriteState, but I think the whole concept is broken, see below:

        SegmentWriteState is bad news, for many codec APIs
        they would be underpopulated, or even have bogus data!

        For example, what would be SegmentWriteState.numDocs for StoredFieldsWriter?

        I understand that at a glance having foo(A) where A has A.B and A.C and A.D seems simpler than foo(B, C),
        but I think its confusing to pass "A" at all if there is an A.D thats somehow bogus, invalid, etc.

        In that case its actually much clearer to pass B and C directly... personally I think we
        should revisit these 'argument holder' APIs and likely remove them completely.

        Because of that: for most codec APIs I avoided SegmentWriteState and also SegmentReadState (for symmetry).

        Show
        Robert Muir added a comment - I think this change is OK: I just want to mention that avoiding SegmentReadState was definitely intentional... well most of my issues are really based on SegmentWriteState, but I think the whole concept is broken, see below: SegmentWriteState is bad news, for many codec APIs they would be underpopulated, or even have bogus data! For example, what would be SegmentWriteState.numDocs for StoredFieldsWriter? I understand that at a glance having foo(A) where A has A.B and A.C and A.D seems simpler than foo(B, C), but I think its confusing to pass "A" at all if there is an A.D thats somehow bogus, invalid, etc. In that case its actually much clearer to pass B and C directly... personally I think we should revisit these 'argument holder' APIs and likely remove them completely. Because of that: for most codec APIs I avoided SegmentWriteState and also SegmentReadState (for symmetry).
        Hide
        Michael McCandless added a comment -

        I agree catch-all "argument holder" classes are dangerous... they can bloat over time and probably lead to bugs...

        Show
        Michael McCandless added a comment - I agree catch-all "argument holder" classes are dangerous... they can bloat over time and probably lead to bugs...
        Hide
        Andrzej Bialecki added a comment -

        I hear you .. SegmentWriteState is bad, I agree. But the argument about SegmentWriteState is not really applicable to SegmentReadState - write state is mutable and can change under your feet whereas SegmentReadState is immutable, created once in SegmentReader and used only for initialization of format readers. On the other hand, if we insist that we always pass individual arguments around then providing some additional segment-global context to format readers requires changing method signatures (adding arguments).

        The background for this issue is that I started looking at updateable fields, where updates are put in a segment (or reader) of its own and they provide an "overlay" for the main segment, with a special codec magic to pull and remap data from the "overlay" as the main data is accessed. However, in order to do that I need to provide this data when format readers are initialized. I can't do this when creating a Codec instance (Codec is automatically instantiated) or when creating Codec.*Format(), because format instances are usually shared as well.

        So the idea I had in mind was to use SegmentReaderState uniformly, and put this overlay data in SegmentReadState so that it's passed down to formats during format readers' creation. I'm open to other ideas...

        Show
        Andrzej Bialecki added a comment - I hear you .. SegmentWriteState is bad, I agree. But the argument about SegmentWriteState is not really applicable to SegmentReadState - write state is mutable and can change under your feet whereas SegmentReadState is immutable, created once in SegmentReader and used only for initialization of format readers. On the other hand, if we insist that we always pass individual arguments around then providing some additional segment-global context to format readers requires changing method signatures (adding arguments). The background for this issue is that I started looking at updateable fields, where updates are put in a segment (or reader) of its own and they provide an "overlay" for the main segment, with a special codec magic to pull and remap data from the "overlay" as the main data is accessed. However, in order to do that I need to provide this data when format readers are initialized. I can't do this when creating a Codec instance (Codec is automatically instantiated) or when creating Codec.*Format(), because format instances are usually shared as well. So the idea I had in mind was to use SegmentReaderState uniformly, and put this overlay data in SegmentReadState so that it's passed down to formats during format readers' creation. I'm open to other ideas...
        Hide
        Michael McCandless added a comment -

        The background for this issue is that I started looking at updateable fields, where updates are put in a segment (or reader) of its own and they provide an "overlay" for the main segment, with a special codec magic to pull and remap data from the "overlay" as the main data is accessed. However, in order to do that I need to provide this data when format readers are initialized. I can't do this when creating a Codec instance (Codec is automatically instantiated) or when creating Codec.*Format(), because format instances are usually shared as well.

        Sweet!

        Couldn't the stacking/overlaying live "above" codec? Eg, the codec thinks it's reading 3 segments, but really the code above knows there's 1 base segment and 2 stacked on top?

        Show
        Michael McCandless added a comment - The background for this issue is that I started looking at updateable fields, where updates are put in a segment (or reader) of its own and they provide an "overlay" for the main segment, with a special codec magic to pull and remap data from the "overlay" as the main data is accessed. However, in order to do that I need to provide this data when format readers are initialized. I can't do this when creating a Codec instance (Codec is automatically instantiated) or when creating Codec.*Format(), because format instances are usually shared as well. Sweet! Couldn't the stacking/overlaying live "above" codec? Eg, the codec thinks it's reading 3 segments, but really the code above knows there's 1 base segment and 2 stacked on top?
        Hide
        Andrzej Bialecki added a comment - - edited

        I think this could work, too - I would instantiate the "overlay" data in SegmentReader, and then I'd create the "overlay" codec's format readers in SegmentReader, using the original format readers plus the overlay data. I'll try this approach ... I'll create a separate issue to discuss this.

        (The reason I'm doing this at the codec level is that I wanted to avoid heavy mods to SegmentReader, and it's easier to visualize how this data is re-mapped and stacked at the level of fairly simple codec APIs).

        Let's close this as won't fix for now.

        Show
        Andrzej Bialecki added a comment - - edited I think this could work, too - I would instantiate the "overlay" data in SegmentReader, and then I'd create the "overlay" codec's format readers in SegmentReader, using the original format readers plus the overlay data. I'll try this approach ... I'll create a separate issue to discuss this. (The reason I'm doing this at the codec level is that I wanted to avoid heavy mods to SegmentReader, and it's easier to visualize how this data is re-mapped and stacked at the level of fairly simple codec APIs). Let's close this as won't fix for now.
        Hide
        Robert Muir added a comment -

        (The reason I'm doing this at the codec level is that I wanted to avoid heavy mods to SegmentReader, and it's easier to visualize how this data is re-mapped and stacked at the level of fairly simple codec APIs).

        But SegmentReader is fairly simple these days, its just basically a pointer to a core (SegmentCoreReaders) + deletes.

        Maybe it should stay the same, but instead we could have a StackedReader (perhaps a bad name), that points to multiple cores + deletes + mask files or whatever it needs and returns masked enums over the underlying Enums itself (e.g. combining enums from the underlying impls, passing masks down as Bits, and such). SegmentReader would stay as-is.

        Show
        Robert Muir added a comment - (The reason I'm doing this at the codec level is that I wanted to avoid heavy mods to SegmentReader, and it's easier to visualize how this data is re-mapped and stacked at the level of fairly simple codec APIs). But SegmentReader is fairly simple these days, its just basically a pointer to a core (SegmentCoreReaders) + deletes. Maybe it should stay the same, but instead we could have a StackedReader (perhaps a bad name), that points to multiple cores + deletes + mask files or whatever it needs and returns masked enums over the underlying Enums itself (e.g. combining enums from the underlying impls, passing masks down as Bits, and such). SegmentReader would stay as-is.
        Hide
        Andrzej Bialecki added a comment -

        Thanks for the insightful comments - this looks promising. I opened LUCENE-3837 to discuss a broader design for updateable fields.

        Show
        Andrzej Bialecki added a comment - Thanks for the insightful comments - this looks promising. I opened LUCENE-3837 to discuss a broader design for updateable fields.

          People

          • Assignee:
            Andrzej Bialecki
            Reporter:
            Andrzej Bialecki
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development