Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: None
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      This issue adds two new Codec implementations:

      • TeeCodec: there have been attempts in the past to implement parallel writing to multiple indexes so that they are all synchronized. This was however complicated due to the complexity of IndexWriter/SegmentMerger logic. The solution presented here offers a similar functionality but working on a different level - as the name suggests, the TeeCodec duplicates index data into multiple output Directories.
      • TeeDirectory (used also in TeeCodec) is a simple abstraction to perform Directory operations on several directories in parallel (effectively mirroring their data). Optionally it's possible to specify a set of suffixes of files that should be mirrored so that non-matching files are skipped.
      • FilteringCodec is related in a remote way to the ideas of index pruning presented in LUCENE-1812 and the concept of tiered search. Since we can use TeeCodec to write to multiple output Directories in a synchronized way, we could also filter out or modify some of the data that is being written. The FilteringCodec provides this functionality, so that you can use like this:
        IndexWriter --> TeeCodec
                         |  |
                         |  +--> StandardCodec --> Directory1
                         +--> FilteringCodec --> StandardCodec --> Directory2
        

      The end result of this chain is two indexes that are kept in sync - one is the full regular index, and the other one is a filtered index.

      1. LUCENE-2632.patch
        163 kB
        Andrzej Bialecki
      2. LUCENE-2632.patch
        144 kB
        Andrzej Bialecki
      3. LUCENE-2632.patch
        110 kB
        Robert Muir
      4. LUCENE-2632.patch
        95 kB
        Andrzej Bialecki
      5. LUCENE-2632.patch
        52 kB
        Andrzej Bialecki
      6. LUCENE-2632-filtering.patch
        75 kB
        Shai Erera
      7. LUCENE-2632-trunk.patch
        172 kB
        Shai Erera

        Activity

        Hide
        Andrzej Bialecki added a comment -

        Initial implementation. TeeSinkCodec was only tested for writing - for reading it always uses Directory at position 0 on the list, so if it's not a full index (e.g. passed through FilteringCodec) funny things may happen during segment merging...

        Show
        Andrzej Bialecki added a comment - Initial implementation. TeeSinkCodec was only tested for writing - for reading it always uses Directory at position 0 on the list, so if it's not a full index (e.g. passed through FilteringCodec) funny things may happen during segment merging...
        Hide
        Michael McCandless added a comment -

        This looks very cool Andrzej!

        Show
        Michael McCandless added a comment - This looks very cool Andrzej!
        Hide
        Andrzej Bialecki added a comment -

        Patch updated to the latest trunk.

        A few notes:

        • FilteringCodec tests pass
        • TeeDirectory tests pass
        • TeeCodec tests pass only partially, failures occur during merge operations.
        Show
        Andrzej Bialecki added a comment - Patch updated to the latest trunk. A few notes: FilteringCodec tests pass TeeDirectory tests pass TeeCodec tests pass only partially, failures occur during merge operations.
        Hide
        Uwe Schindler added a comment -

        Hey cool, sounds like this unmaintainable ParallelReaders obsolete by doing the splitting to several directories/parallel fields in the codec - so merging automtically works correct with every MP?

        Show
        Uwe Schindler added a comment - Hey cool, sounds like this unmaintainable ParallelReaders obsolete by doing the splitting to several directories/parallel fields in the codec - so merging automtically works correct with every MP?
        Hide
        Andrzej Bialecki added a comment -

        That's one potential application, I'm sure there are dozen other things you can do with this (e.g. hot backup).

        Merges don't work well in this patch yet, which is a problem and also the handling of compound files is hackish (I didn't even try to handle the main CFS ones, but I added a workaround for the nrm.cfs / cfe files). I'd appreciate review and help in debugging/fixing.

        Show
        Andrzej Bialecki added a comment - That's one potential application, I'm sure there are dozen other things you can do with this (e.g. hot backup). Merges don't work well in this patch yet, which is a problem and also the handling of compound files is hackish (I didn't even try to handle the main CFS ones, but I added a workaround for the nrm.cfs / cfe files). I'd appreciate review and help in debugging/fixing.
        Hide
        Robert Muir added a comment -

        just some ideas taking a very quick glance

        • I like the idea of separate impl packages you used here...
        • the TODO/etc in term vectors makes me wish our codec consumer APIs for Fields/TermVectors were more consistent...
        • lots of methods have 'synchronized' where I think its not needed? e.g. TeeTermVectorsWriter
        • close() in Tee* should probably add all closeables to a list and then IOUtils.close() that.
        • TeeCodec.files() i think should call mainCodec.files, so that it doesn't rely upon mainCodec having the default implementation in Codec.java (some dont).

        Sorry, none of this fixes the merging test, which i think is somehow related to files(), but I didn't spot the bug yet at a glance.

        Show
        Robert Muir added a comment - just some ideas taking a very quick glance I like the idea of separate impl packages you used here... the TODO/etc in term vectors makes me wish our codec consumer APIs for Fields/TermVectors were more consistent... lots of methods have 'synchronized' where I think its not needed? e.g. TeeTermVectorsWriter close() in Tee* should probably add all closeables to a list and then IOUtils.close() that. TeeCodec.files() i think should call mainCodec.files, so that it doesn't rely upon mainCodec having the default implementation in Codec.java (some dont). Sorry, none of this fixes the merging test, which i think is somehow related to files(), but I didn't spot the bug yet at a glance.
        Hide
        Robert Muir added a comment -

        Hmm actually another thing to investigate is Directory's CFS-related methods with things that delegate like TeeDirectory?

        We had problems around here before with FileSwitchDirectory... LUCENE-3380, maybe you just found another one.

        Show
        Robert Muir added a comment - Hmm actually another thing to investigate is Directory's CFS-related methods with things that delegate like TeeDirectory? We had problems around here before with FileSwitchDirectory... LUCENE-3380 , maybe you just found another one.
        Hide
        Andrzej Bialecki added a comment -

        the TODO/etc in term vectors makes me wish our codec consumer APIs for Fields/TermVectors were more consistent...

        Also, the handling of segments.gen and compound files that bypasses codec actually forced me to implement TeeDirectory.

        Re. synchronization - yes, many of these should be removed. I synced everything for now to narrow down the source of merge problems. TeeCodec.files() - well spotted, this should be fixed.

        Show
        Andrzej Bialecki added a comment - the TODO/etc in term vectors makes me wish our codec consumer APIs for Fields/TermVectors were more consistent... Also, the handling of segments.gen and compound files that bypasses codec actually forced me to implement TeeDirectory. Re. synchronization - yes, many of these should be removed. I synced everything for now to narrow down the source of merge problems. TeeCodec.files() - well spotted, this should be fixed.
        Hide
        Robert Muir added a comment -

        Also, the handling of segments.gen and compound files that bypasses codec actually forced me to implement TeeDirectory.

        True, though I don't know of any simple solutions to either of these

        for CFS, we made some tiny steps in LUCENE-3728, but the codec only has limited control here (e.g. it can store certain things
        outside of CFS, this is how preflex codec reads separate norms). But it cannot yet customize the CFS filenames nor the actual
        format/packing process.

        Show
        Robert Muir added a comment - Also, the handling of segments.gen and compound files that bypasses codec actually forced me to implement TeeDirectory. True, though I don't know of any simple solutions to either of these for CFS, we made some tiny steps in LUCENE-3728 , but the codec only has limited control here (e.g. it can store certain things outside of CFS, this is how preflex codec reads separate norms). But it cannot yet customize the CFS filenames nor the actual format/packing process.
        Hide
        Robert Muir added a comment -

        Updated patch from andrzej's fixing the FNFE in merge (though it fails for other reasons, one step at a time).

        The problem is due to the strange way in which Norms-writers extend PerDocConsumer... they override canMerge/getDocValuesForMerge/getDocValuesType completely differently.

        Furthermore you could not delegate these, as they are protected.

        as a temporary hack i made them public and delegated in Tee's impl... but I'm going to instead open a separate issue to clean this up.

        Show
        Robert Muir added a comment - Updated patch from andrzej's fixing the FNFE in merge (though it fails for other reasons, one step at a time). The problem is due to the strange way in which Norms-writers extend PerDocConsumer... they override canMerge/getDocValuesForMerge/getDocValuesType completely differently. Furthermore you could not delegate these, as they are protected. as a temporary hack i made them public and delegated in Tee's impl... but I'm going to instead open a separate issue to clean this up.
        Hide
        Andrzej Bialecki added a comment -

        Updated patch, based on Robert's version:

        • catch up with trunk,
        • tweak WriteFilter api to be more convenient and useful,
        • add javadocs,
        • add "initialSync" functionality to TeeDirectory to sync existing files on open, and add a corresponding test.

        All tests pass.

        Show
        Andrzej Bialecki added a comment - Updated patch, based on Robert's version: catch up with trunk, tweak WriteFilter api to be more convenient and useful, add javadocs, add "initialSync" functionality to TeeDirectory to sync existing files on open, and add a corresponding test. All tests pass.
        Hide
        Andrzej Bialecki added a comment -

        Some modifications in WriteFilter API, and a better accounting of numDocs / term stats.

        Also, a bonus: a single-pass TeeIndexSplitter that uses a combination of TeeCodec/FilteringCodec.

        All tests pass.

        Show
        Andrzej Bialecki added a comment - Some modifications in WriteFilter API, and a better accounting of numDocs / term stats. Also, a bonus: a single-pass TeeIndexSplitter that uses a combination of TeeCodec/FilteringCodec. All tests pass.
        Hide
        Shai Erera added a comment -

        Hey Andrzej, this is so cool !

        I was interested in using MultiPassIndexSplitter, and while chatting with Mike about the need to process the input directory a couple of times, he pointed me to this issue (i.e. TeeIndexSplitter). I was wondering what's the state of the issue? Are there still pending tasks to complete or can it be committed?

        Show
        Shai Erera added a comment - Hey Andrzej, this is so cool ! I was interested in using MultiPassIndexSplitter, and while chatting with Mike about the need to process the input directory a couple of times, he pointed me to this issue (i.e. TeeIndexSplitter). I was wondering what's the state of the issue? Are there still pending tasks to complete or can it be committed?
        Hide
        Andrzej Bialecki added a comment -

        There have been some changes in 4x since the last patch, so I don't expect it would apply cleanly - but that last patch was working well, so it should be easy to bring it up to date.

        Show
        Andrzej Bialecki added a comment - There have been some changes in 4x since the last patch, so I don't expect it would apply cleanly - but that last patch was working well, so it should be easy to bring it up to date.
        Hide
        Shai Erera added a comment -

        I tried "patch --dry-run" and here are the results:

        • On 4x, all seem to be well except:
          • Lucene40NormsFormat.java and PreFlexRWNormsConsumer.java (some hunk failed)
          • SimpleTextNormsConsumer.java does not exist on 4x
        • On trunk, all seem to be well except:
          • Lucene40NormsFormat.java (hunk failure)
          • SimpleTextNormsConsumer.java not found
          • PreFlexRWNormsConsumer.java not found, but that's expected as 5.0 won't need to support pre-flex (3x) indexes

        I think then what needs to be done is make the patch apply-able against 4x, and then we'll 'svn merge' to trunk? I haven't looked at what doesn't apply in Lucene40NormsFormat though – will you check it?

        Show
        Shai Erera added a comment - I tried "patch --dry-run" and here are the results: On 4x, all seem to be well except: Lucene40NormsFormat.java and PreFlexRWNormsConsumer.java (some hunk failed) SimpleTextNormsConsumer.java does not exist on 4x On trunk, all seem to be well except: Lucene40NormsFormat.java (hunk failure) SimpleTextNormsConsumer.java not found PreFlexRWNormsConsumer.java not found, but that's expected as 5.0 won't need to support pre-flex (3x) indexes I think then what needs to be done is make the patch apply-able against 4x, and then we'll 'svn merge' to trunk? I haven't looked at what doesn't apply in Lucene40NormsFormat though – will you check it?
        Hide
        Shai Erera added a comment -

        Patch brings the latest patch upto trunk. There have been many changes to the API, and currently TestTeeCodec fails. I still didn't dig deep into it, but sometimes it succeeds to call IW.close(), only to fail on a missing segments_1 later on.

        I don't know if the test passed before, to judge if it's something that broke following the migration to trunk API. I'll try to dig deeper later.

        Also, I noticed that the patch includes a FilteringCodec. I thought that TeeCodec will extend it, but it doesn't. Also, LUCENE-4391 added FilterCodec, so I think FilteringCodec can be removed from this issue. And potentially we should make TeeCodec extend FilterCodec.

        I've kept FilteringCodec in this patch though.

        Show
        Shai Erera added a comment - Patch brings the latest patch upto trunk. There have been many changes to the API, and currently TestTeeCodec fails. I still didn't dig deep into it, but sometimes it succeeds to call IW.close(), only to fail on a missing segments_1 later on. I don't know if the test passed before, to judge if it's something that broke following the migration to trunk API. I'll try to dig deeper later. Also, I noticed that the patch includes a FilteringCodec. I thought that TeeCodec will extend it, but it doesn't. Also, LUCENE-4391 added FilterCodec, so I think FilteringCodec can be removed from this issue. And potentially we should make TeeCodec extend FilterCodec. I've kept FilteringCodec in this patch though.
        Hide
        Shai Erera added a comment -

        Reviewed the issue's description and patch again – FilteringCodec has nothing to do with FilterCodec. The latter is used if you want to override a specific Codec's behavior. FilteringCodec is about not filtering data that is written to a Directory, as Andrzej explains in the issue's description.

        Therefore FilteringCodec should remain. Maybe we can split this issue .. FilteringCodec is unrelated to TeeCodec/Directory/IndexSplitter as far as I could tell?

        Anyway, still need to make TestTeeCodec work.

        Show
        Shai Erera added a comment - Reviewed the issue's description and patch again – FilteringCodec has nothing to do with FilterCodec. The latter is used if you want to override a specific Codec's behavior. FilteringCodec is about not filtering data that is written to a Directory, as Andrzej explains in the issue's description. Therefore FilteringCodec should remain. Maybe we can split this issue .. FilteringCodec is unrelated to TeeCodec/Directory/IndexSplitter as far as I could tell? Anyway, still need to make TestTeeCodec work.
        Hide
        Shai Erera added a comment - - edited

        Patch includes FilteringCodec only files. I've fixed some minor issues such as license docs.

        About the *.impl package, I think that if all classes were under *.filtering, we could make all but FilteringCodec, WriteFilter and Noop* classes package-private, as everything seems to be controlled by WriteFilter. What do you think?

        Anyway, this isolated patch is cleaner and so now perhaps we can think of a different design, such as move WriteFilter functionality to the different Formats/Consumers and let users override that by using FilterCodec over FilteringCodec and providing their own Consumer/Formats. After all, WriteFilter by default doesn't filter anything ...

        And now that we have FilterCodec, perhaps we should rename FilteringCodec to something else, like IndexFilteringCodec, or DataFilteringCodec ... make it more distinguishable than FilterCodec.

        Comments are welcome.

        Show
        Shai Erera added a comment - - edited Patch includes FilteringCodec only files. I've fixed some minor issues such as license docs. About the *.impl package, I think that if all classes were under *.filtering, we could make all but FilteringCodec, WriteFilter and Noop* classes package-private, as everything seems to be controlled by WriteFilter. What do you think? Anyway, this isolated patch is cleaner and so now perhaps we can think of a different design, such as move WriteFilter functionality to the different Formats/Consumers and let users override that by using FilterCodec over FilteringCodec and providing their own Consumer/Formats. After all, WriteFilter by default doesn't filter anything ... And now that we have FilterCodec, perhaps we should rename FilteringCodec to something else, like IndexFilteringCodec, or DataFilteringCodec ... make it more distinguishable than FilterCodec. Comments are welcome.
        Hide
        Robert Muir added a comment -

        First glance, i wonder if this should go in misc/ or instead the codecs/ module?

        We should address the nocommits I put in here as well: I feel like they relate to LUCENE-3787 but I don't remember to be honest.

        Show
        Robert Muir added a comment - First glance, i wonder if this should go in misc/ or instead the codecs/ module? We should address the nocommits I put in here as well: I feel like they relate to LUCENE-3787 but I don't remember to be honest.
        Hide
        Robert Muir added a comment -

        Sorry, most of my comment is useless. I clicked on the wrong patch (the sort order is screwy because the filename was different)

        Show
        Robert Muir added a comment - Sorry, most of my comment is useless. I clicked on the wrong patch (the sort order is screwy because the filename was different)
        Hide
        Shai Erera added a comment -

        the sort order is screwy because the filename was different

        Sorry about that, I wanted to emphasize that this patch covers only FilteringCodec. And I've put it under the codec module too.

        We should address the nocommits I put in here as well

        I checked and they are not related to FilteringCodec. I'll remove them from the filtering patch. And yes, it seems that it's related to LUCENE-3787 and that TeeCodec needed them.

        Show
        Shai Erera added a comment - the sort order is screwy because the filename was different Sorry about that, I wanted to emphasize that this patch covers only FilteringCodec. And I've put it under the codec module too. We should address the nocommits I put in here as well I checked and they are not related to FilteringCodec. I'll remove them from the filtering patch. And yes, it seems that it's related to LUCENE-3787 and that TeeCodec needed them.
        Hide
        Robert Muir added a comment -

        Right but it still makes the change here to PerDocConsumer etc. I wasn't happy with this solution, I'll revisit
        and try to remember exactly why.

        Show
        Robert Muir added a comment - Right but it still makes the change here to PerDocConsumer etc. I wasn't happy with this solution, I'll revisit and try to remember exactly why.

          People

          • Assignee:
            Unassigned
            Reporter:
            Andrzej Bialecki
          • Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:

              Development