Lucene - Core
  1. Lucene - Core
  2. LUCENE-2793

Directory createOutput and openInput should take an IOContext

    Details

    • Lucene Fields:
      New, Patch Available

      Description

      Today for merging we pass down a larger readBufferSize than for searching because we get better performance.

      I think we should generalize this to a class (IOContext), which would hold the buffer size, but then could hold other flags like DIRECT (bypass OS's buffer cache), SEQUENTIAL, etc.

      Then, we can make the DirectIOLinuxDirectory fully usable because we would only use DIRECT/SEQUENTIAL during merging.

      This will require fixing how IW pools readers, so that a reader opened for merging is not then used for searching, and vice/versa. Really, it's only all the open file handles that need to be different – we could in theory share del docs, norms, etc, if that were somehow possible.

      1. LUCENE-2793.patch
        31 kB
        Jason Rutherglen
      2. LUCENE-2793.patch
        16 kB
        Varun Thacker
      3. LUCENE-2793.patch
        2 kB
        Varun Thacker
      4. LUCENE-2793.patch
        15 kB
        Varun Thacker
      5. LUCENE-2793.patch
        24 kB
        Varun Thacker
      6. LUCENE-2793.patch
        75 kB
        Varun Thacker
      7. LUCENE-2793.patch
        3 kB
        Varun Thacker
      8. LUCENE-2793.patch
        94 kB
        Varun Thacker
      9. LUCENE-2793.patch
        92 kB
        Varun Thacker
      10. LUCENE-2793.patch
        106 kB
        Varun Thacker
      11. LUCENE-2793.patch
        140 kB
        Varun Thacker
      12. LUCENE-2793.patch
        132 kB
        Varun Thacker
      13. LUCENE-2793.patch
        198 kB
        Varun Thacker
      14. LUCENE-2793.patch
        212 kB
        Varun Thacker
      15. LUCENE-2793.patch
        223 kB
        Varun Thacker
      16. LUCENE-2793.patch
        231 kB
        Simon Willnauer
      17. LUCENE-2793.patch
        51 kB
        Varun Thacker
      18. LUCENE-2793-nrt.patch
        4 kB
        Michael McCandless
      19. LUCENE-2793.patch
        180 kB
        Varun Thacker
      20. LUCENE-2793.patch
        180 kB
        Varun Thacker
      21. LUCENE-2793.patch
        10 kB
        Varun Thacker
      22. LUCENE-2793.patch
        9 kB
        Varun Thacker
      23. LUCENE-2793.patch
        8 kB
        Varun Thacker
      24. LUCENE-2793.patch
        17 kB
        Varun Thacker
      25. LUCENE-2793.patch
        8 kB
        Varun Thacker
      26. LUCENE-2793.patch
        17 kB
        Simon Willnauer
      27. LUCENE-2793.patch
        17 kB
        Simon Willnauer
      28. LUCENE-2793_final.patch
        315 kB
        Simon Willnauer

        Issue Links

          Activity

          Hide
          Robert Muir added a comment -

          Sounds great, at least in theory we could do the DIRECT thing on windows too for this case.
          and we can certainly give the sequential hint.

          in general hints like this are probably good for custom dir impls too, we never know what they are doing
          but its good to somehow inform them of our intentions so they have a chance to be optimal.

          Show
          Robert Muir added a comment - Sounds great, at least in theory we could do the DIRECT thing on windows too for this case. and we can certainly give the sequential hint. in general hints like this are probably good for custom dir impls too, we never know what they are doing but its good to somehow inform them of our intentions so they have a chance to be optimal.
          Hide
          Uwe Schindler added a comment -

          And MMapDir should return a simple/optimized direct FSIndexInput also. Because MMapping for merging is just wasted resources/address space.

          Show
          Uwe Schindler added a comment - And MMapDir should return a simple/optimized direct FSIndexInput also. Because MMapping for merging is just wasted resources/address space.
          Hide
          Robert Muir added a comment -

          There is another problem we should solve here, and that is the buffersize problem.

          This is totally broken at the moment for custom directories, here's an example.
          I wanted to set the buffersize by default to 4096 (since i measured this is like a 20% improvement for my directory impl).

          looking at the apis you would think that you simply override the openInput that takes no buffer size like this:

            @Override
            public IndexInput openInput(String name) throws IOException {
              return openInput(name, 4096);
            }
          

          unfortunately this doesnt work at all! instead you have to do something like this for it to actually "work":

             @Override
             public IndexInput openInput(String name, int bufferSize) throws IOException {
                ensureOpen();
                return new IndexInput(name, Math.max(bufferSize, 4096));
             }
          

          The problem is, throughout lucene's APIs, the directory's "default" is never used, instead the static BufferedIndexInput.BUFFER_SIZE is used everywhere... eg SegmentReader.get:

            public static SegmentReader get(boolean readOnly, SegmentInfo si, int termInfosIndexDivisor) throws CorruptIndexException, IOException {
              return get(readOnly, si.dir, si, BufferedIndexInput.BUFFER_SIZE, true, termInfosIndexDivisor);
            }
          

          So I think lucene's apis should never specify buffersize, we should remove it completely from the codecs api, and it should be replaced with IOContext.

          Show
          Robert Muir added a comment - There is another problem we should solve here, and that is the buffersize problem. This is totally broken at the moment for custom directories, here's an example. I wanted to set the buffersize by default to 4096 (since i measured this is like a 20% improvement for my directory impl). looking at the apis you would think that you simply override the openInput that takes no buffer size like this: @Override public IndexInput openInput(String name) throws IOException { return openInput(name, 4096); } unfortunately this doesnt work at all! instead you have to do something like this for it to actually "work": @Override public IndexInput openInput(String name, int bufferSize) throws IOException { ensureOpen(); return new IndexInput(name, Math.max(bufferSize, 4096)); } The problem is, throughout lucene's APIs, the directory's "default" is never used, instead the static BufferedIndexInput.BUFFER_SIZE is used everywhere... eg SegmentReader.get: public static SegmentReader get(boolean readOnly, SegmentInfo si, int termInfosIndexDivisor) throws CorruptIndexException, IOException { return get(readOnly, si.dir, si, BufferedIndexInput.BUFFER_SIZE, true, termInfosIndexDivisor); } So I think lucene's apis should never specify buffersize, we should remove it completely from the codecs api, and it should be replaced with IOContext.
          Hide
          Jason Rutherglen added a comment -

          Perhaps we can add the ability to throttle Directory level IO to this issue.

          Show
          Jason Rutherglen added a comment - Perhaps we can add the ability to throttle Directory level IO to this issue.
          Hide
          Jason Rutherglen added a comment -

          Shall I take this one? With the plan being to add config options to IWC so that IW uses the DirectIOLinuxDirectory (and it's variants) only for merging?

          Show
          Jason Rutherglen added a comment - Shall I take this one? With the plan being to add config options to IWC so that IW uses the DirectIOLinuxDirectory (and it's variants) only for merging?
          Hide
          Michael McCandless added a comment -

          Yes please!

          But, this issue only adds the IOContext, threading it down to when you open an input / create an output. That context should hold enough "information" to allow the Dir impl to make decisions like buffer sizes and avoiding buffer cache, etc.

          Show
          Michael McCandless added a comment - Yes please! But, this issue only adds the IOContext, threading it down to when you open an input / create an output. That context should hold enough "information" to allow the Dir impl to make decisions like buffer sizes and avoiding buffer cache, etc.
          Hide
          Jason Rutherglen added a comment -

          this issue only adds the IOContext, threading it down to when you open an input / create an output

          Does this mean we're not implementing this part?

          This will require fixing how IW pools readers, so that a reader opened for merging is not then used for searching, and vice/versa. Really, it's only all the open file handles that need to be different - we could in theory share del docs, norms, etc, if that were somehow possible.

          Show
          Jason Rutherglen added a comment - this issue only adds the IOContext, threading it down to when you open an input / create an output Does this mean we're not implementing this part? This will require fixing how IW pools readers, so that a reader opened for merging is not then used for searching, and vice/versa. Really, it's only all the open file handles that need to be different - we could in theory share del docs, norms, etc, if that were somehow possible.
          Hide
          Michael McCandless added a comment -

          Sorry, I think we should also do that as part of this issue. Basically the IOContext needs to become part of the cache key uses in IW's ReaderPool?

          Show
          Michael McCandless added a comment - Sorry, I think we should also do that as part of this issue. Basically the IOContext needs to become part of the cache key uses in IW's ReaderPool?
          Hide
          Jason Rutherglen added a comment -

          Basically the IOContext needs to become part of the cache key uses in IW's ReaderPool?

          Great, I'll implement this.

          Show
          Jason Rutherglen added a comment - Basically the IOContext needs to become part of the cache key uses in IW's ReaderPool? Great, I'll implement this.
          Hide
          Jason Rutherglen added a comment -

          Ok, I created an IOFactory class that generates input and output streams. It's settable on IOContext. The merge IOContext may be set on IWC. I will test once the object model gets the nod...

          Show
          Jason Rutherglen added a comment - Ok, I created an IOFactory class that generates input and output streams. It's settable on IOContext. The merge IOContext may be set on IWC. I will test once the object model gets the nod...
          Hide
          Earwin Burrfoot added a comment -

          Looks crazy. In a bad tangled way.
          You get IOFactory from Directory, put into IOContext, and then invoke it, passing it (wow!) an IOContext and a Directory. What if you pass totally different Directory? Different IOContext? It blows up eerily.

          And there's no justification for this - we already have an IOFactory, it's called Directory! It just needs an extra parameter on its factory methods (createInput/Output), that's all.

          Show
          Earwin Burrfoot added a comment - Looks crazy. In a bad tangled way. You get IOFactory from Directory, put into IOContext, and then invoke it, passing it (wow!) an IOContext and a Directory. What if you pass totally different Directory? Different IOContext? It blows up eerily. And there's no justification for this - we already have an IOFactory, it's called Directory! It just needs an extra parameter on its factory methods (createInput/Output), that's all.
          Hide
          Jason Rutherglen added a comment -

          You get IOFactory from Directory

          That's for the default, the main use is the static IOFactory class.

          we already have an IOFactory, it's called Directory

          Right, however we're basically trying to intermix Directory's, which doesn't work when pointed at the same underlying File. I thought about a meta-Directory that routes based on the IOContext, however we'd still need a way to create an IndexInput and IndexOutput, from different Directory implementations.

          Show
          Jason Rutherglen added a comment - You get IOFactory from Directory That's for the default, the main use is the static IOFactory class. we already have an IOFactory, it's called Directory Right, however we're basically trying to intermix Directory's, which doesn't work when pointed at the same underlying File. I thought about a meta-Directory that routes based on the IOContext, however we'd still need a way to create an IndexInput and IndexOutput, from different Directory implementations.
          Hide
          Earwin Burrfoot added a comment -

          You get IOFactory from Directory

          That's for the default, the main use is the static IOFactory class.

          You lost me here. If you got A from B, you don't have to pass B again to invoke A, if you do - that's 99% a design mistake.
          But still, my point was that you don't need IOFactory at all.

          Right, however we're basically trying to intermix Directory's, which doesn't work when pointed at the same underlying File. I thought about a meta-Directory that routes based on the IOContext, however we'd still need a way to create an IndexInput and IndexOutput, from different Directory implementations.

          What Directories are you trying to intermix? What for?

          I thought the only thing done in that issue is an attempt to give Directory hints as to why we're going to open its streams.
          A simple enum IOContext and extra parameter on createOutput/Input would suffice. But with Lucene's micromanagement attitude, an enum turns into slightly more complex thing, with bufferSizes and whatnot.
          Still - no need for mixing Directories.

          Show
          Earwin Burrfoot added a comment - You get IOFactory from Directory That's for the default, the main use is the static IOFactory class. You lost me here. If you got A from B, you don't have to pass B again to invoke A, if you do - that's 99% a design mistake. But still, my point was that you don't need IOFactory at all. Right, however we're basically trying to intermix Directory's, which doesn't work when pointed at the same underlying File. I thought about a meta-Directory that routes based on the IOContext, however we'd still need a way to create an IndexInput and IndexOutput, from different Directory implementations. What Directories are you trying to intermix? What for? I thought the only thing done in that issue is an attempt to give Directory hints as to why we're going to open its streams. A simple enum IOContext and extra parameter on createOutput/Input would suffice. But with Lucene's micromanagement attitude, an enum turns into slightly more complex thing, with bufferSizes and whatnot. Still - no need for mixing Directories.
          Hide
          Earwin Burrfoot added a comment -

          In fact, I suggest dropping bufferSize altogether. As far as I can recall, it was introduced as a precursor to IOContext and can now be safely replaced.

          Even if we want to give user control over buffer size for all streams, or only those opened in specific IOContext, he can pass these numbers as config parameters to his Directory impl.
          That makes total sense, as:
          1. IndexWriter/IndexReader couldn't care less about buffer sizes, it just passes them to the Directory. It's not their concern.
          2. A bunch of Directories doesn't use said bufferSize at all, making this parameter not only private Directory affairs, but even further - implementation-specific.

          So my bet is - introduce IOContext as a simple Enum, change bufferSize parameter on createInput/Output to IOContext, done.

          Show
          Earwin Burrfoot added a comment - In fact, I suggest dropping bufferSize altogether. As far as I can recall, it was introduced as a precursor to IOContext and can now be safely replaced. Even if we want to give user control over buffer size for all streams, or only those opened in specific IOContext, he can pass these numbers as config parameters to his Directory impl. That makes total sense, as: 1. IndexWriter/IndexReader couldn't care less about buffer sizes, it just passes them to the Directory. It's not their concern. 2. A bunch of Directories doesn't use said bufferSize at all, making this parameter not only private Directory affairs, but even further - implementation-specific. So my bet is - introduce IOContext as a simple Enum, change bufferSize parameter on createInput/Output to IOContext, done.
          Hide
          Robert Muir added a comment -

          In fact, I suggest dropping bufferSize altogether.

          +1

          https://issues.apache.org/jira/browse/LUCENE-2793?focusedCommentId=12966963&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12966963

          So my bet is - introduce IOContext as a simple Enum, change bufferSize parameter on createInput/Output to IOContext, done.

          I agree, its the directories job to then take that IOContext and create an appropriate IndexOutput.

          Show
          Robert Muir added a comment - In fact, I suggest dropping bufferSize altogether. +1 https://issues.apache.org/jira/browse/LUCENE-2793?focusedCommentId=12966963&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12966963 So my bet is - introduce IOContext as a simple Enum, change bufferSize parameter on createInput/Output to IOContext, done. I agree, its the directories job to then take that IOContext and create an appropriate IndexOutput.
          Hide
          Jason Rutherglen added a comment -

          Ok, lets remove bufferSize.

          its the directories job to then take that IOContext and create an appropriate IndexOutput.

          That's how the patch works. If that doesn't look good, the other option is placing the logic in FSDirectory and removing the FSDirectory subclasses?

          Show
          Jason Rutherglen added a comment - Ok, lets remove bufferSize. its the directories job to then take that IOContext and create an appropriate IndexOutput. That's how the patch works. If that doesn't look good, the other option is placing the logic in FSDirectory and removing the FSDirectory subclasses?
          Hide
          Robert Muir added a comment -

          That's how the patch works. If that doesn't look good, the other option is placing the logic in FSDirectory and removing the FSDirectory subclasses?

          We don't need to make any changes to Directory itself here, other than openInput/createOutput taking IOContext instead of bufferSize.
          The directory is the factory for opening the correct openInput/createOutput based on its parameters, this just needs to be one.

          As Mike said, the important thing is to not re-use an input that was opened with IOContext=Merging for search.
          If you ask the directory to openInput with IOContext=merging, thats all it should be used for, because that Directory
          could be implementing that with O_DIRECT or similar, which is slow for searching.

          we can muck with DirectIOLinuxDirectory later on another issue (e.g. make it a directory wrapper that just overrides
          openInput and only does something interesting when IOContext=Merging, or NIOFSDirectory subclass, or whatever
          we end up deciding).

          Show
          Robert Muir added a comment - That's how the patch works. If that doesn't look good, the other option is placing the logic in FSDirectory and removing the FSDirectory subclasses? We don't need to make any changes to Directory itself here, other than openInput/createOutput taking IOContext instead of bufferSize. The directory is the factory for opening the correct openInput/createOutput based on its parameters, this just needs to be one. As Mike said, the important thing is to not re-use an input that was opened with IOContext=Merging for search. If you ask the directory to openInput with IOContext=merging, thats all it should be used for, because that Directory could be implementing that with O_DIRECT or similar, which is slow for searching. we can muck with DirectIOLinuxDirectory later on another issue (e.g. make it a directory wrapper that just overrides openInput and only does something interesting when IOContext=Merging, or NIOFSDirectory subclass, or whatever we end up deciding).
          Hide
          Jason Rutherglen added a comment -

          The directory is the factory for opening the correct openInput/createOutput based on its parameters, this just needs to be one.

          Right, however which Directory impl is going to do this? Today they're tied to their specific implementations of IndexInputs and IndexOutputs, eg, NIOFSIndexInput, DirectIOLinuxIndexInput, etc. In order to get an NIOFSIndexInput we need to instantiate NIOFSDirectory, however if we also want to use DirectIOLinuxDirectory, then (in today's model) we need to instantiate it as well. To create an index or output, all we need is a parent FSDIrectory and the rest is encapsulated in the underlying input/output implementation.

          Show
          Jason Rutherglen added a comment - The directory is the factory for opening the correct openInput/createOutput based on its parameters, this just needs to be one. Right, however which Directory impl is going to do this? Today they're tied to their specific implementations of IndexInputs and IndexOutputs, eg, NIOFSIndexInput, DirectIOLinuxIndexInput, etc. In order to get an NIOFSIndexInput we need to instantiate NIOFSDirectory, however if we also want to use DirectIOLinuxDirectory, then (in today's model) we need to instantiate it as well. To create an index or output, all we need is a parent FSDIrectory and the rest is encapsulated in the underlying input/output implementation.
          Hide
          Shai Erera added a comment -

          If we drop bufferSize, how will I be able to tell Lucene I'm willing to spare, say, 1 MB buffer for this IndexInput/Output? Am I supposed to create my own Directory and open II/IO, depending on the IOContext.Mode and decide on the buffer size then?

          If so, then I don't understand how this would work (think that's what Jason is asking) - if I impl MyDirectory (extending Directory? FSDirectory?) which is probably going to be a wrapper Directory, what II/IO impl should I invoke? I'll need to extend BufferedIndexInput/Output, impl its abstract methods, just for delegating to the wrapped Directory's II/IO?

          If I misunderstood the intentions, I'd appreciate if you can clarify them. But if not, I think IOContext should include a bufferSize hint. If the Directory does not intend to do anything special for 'merging', then it can take bufferSize into account. If however it's a Linux Directory that wants to set the O_DIRECT, then it can ignore bufferSize. But I think it's a useful hint.

          Show
          Shai Erera added a comment - If we drop bufferSize, how will I be able to tell Lucene I'm willing to spare, say, 1 MB buffer for this IndexInput/Output? Am I supposed to create my own Directory and open II/IO, depending on the IOContext.Mode and decide on the buffer size then? If so, then I don't understand how this would work (think that's what Jason is asking) - if I impl MyDirectory (extending Directory? FSDirectory?) which is probably going to be a wrapper Directory, what II/IO impl should I invoke? I'll need to extend BufferedIndexInput/Output, impl its abstract methods, just for delegating to the wrapped Directory's II/IO? If I misunderstood the intentions, I'd appreciate if you can clarify them. But if not, I think IOContext should include a bufferSize hint. If the Directory does not intend to do anything special for 'merging', then it can take bufferSize into account. If however it's a Linux Directory that wants to set the O_DIRECT, then it can ignore bufferSize. But I think it's a useful hint.
          Hide
          Jason Rutherglen added a comment -

          In the patch the bufferSize is an int member of IOContext. We could feasibly go to a map of attributes model if we think there'll be a lot more options in the future? Or the IOContext class can be subclassed.

          Also, in the patch the IOContext for searching and merging is set in IWC, then should be reused. This way the bufferSize is settable by the user, which I don't think is easily possible today (perhaps the main motivation for this issue?).

          Show
          Jason Rutherglen added a comment - In the patch the bufferSize is an int member of IOContext. We could feasibly go to a map of attributes model if we think there'll be a lot more options in the future? Or the IOContext class can be subclassed. Also, in the patch the IOContext for searching and merging is set in IWC, then should be reused. This way the bufferSize is settable by the user, which I don't think is easily possible today (perhaps the main motivation for this issue?).
          Hide
          Robert Muir added a comment -

          If we drop bufferSize, how will I be able to tell Lucene I'm willing to spare, say, 1 MB buffer for this IndexInput/Output? Am I supposed to create my own Directory and open II/IO, depending on the IOContext.Mode and decide on the buffer size then?

          with .setBufferSize. The only place in lucene that intelligently uses buffersize (Skiplist reading) sets it this way, not with the ctor param.

          otherwise, i think you are confused if you think lucene actually passes any intelligent value for this ctor parameter.
          it doesn't... it usually passes the static BufferedIndexInput.BUFFER_SIZE

          If so, then I don't understand how this would work (think that's what Jason is asking) - if I impl MyDirectory (extending Directory? FSDirectory?) which is probably going to be a wrapper Directory, what II/IO impl should I invoke? I'll need to extend BufferedIndexInput/Output, impl its abstract methods, just for delegating to the wrapped Directory's II/IO?

          I don't understand whats so terribly difficult about:

          if (context == MERGING)
           return mySpecialmergingInput();
          else
           return super.openInput(...);
          
          Show
          Robert Muir added a comment - If we drop bufferSize, how will I be able to tell Lucene I'm willing to spare, say, 1 MB buffer for this IndexInput/Output? Am I supposed to create my own Directory and open II/IO, depending on the IOContext.Mode and decide on the buffer size then? with .setBufferSize. The only place in lucene that intelligently uses buffersize (Skiplist reading) sets it this way, not with the ctor param. otherwise, i think you are confused if you think lucene actually passes any intelligent value for this ctor parameter. it doesn't... it usually passes the static BufferedIndexInput.BUFFER_SIZE If so, then I don't understand how this would work (think that's what Jason is asking) - if I impl MyDirectory (extending Directory? FSDirectory?) which is probably going to be a wrapper Directory, what II/IO impl should I invoke? I'll need to extend BufferedIndexInput/Output, impl its abstract methods, just for delegating to the wrapped Directory's II/IO? I don't understand whats so terribly difficult about: if (context == MERGING) return mySpecialmergingInput(); else return super.openInput(...);
          Hide
          Shai Erera added a comment -

          This way the bufferSize is settable by the user, which I don't think is easily possible today (perhaps the main motivation for this issue?)

          Exactly ! That's what I thought of this issue in the first place - allow the app to more easily control the size of the buffers used by Lucene. That that IOContext will allow me to set different bufferSizes per the work that Lucene's doing (i.e. merge, search) is a bonus for me .

          I still think bufferSize can be used as a hint by the specific Directory impl, but it should be there.

          If we want to allow control of different buffer sizes per operation, then maybe we can do the following:

          • Set a default buffer size on IOContext.
          • Add a setBufferSize(OpType type, int/long bufferSize)
          • Add getBufferSize(OpType) – returns the set buffer size, or the default if not set.

          I don't think it's too complicated and gives enough control to the application?

          Show
          Shai Erera added a comment - This way the bufferSize is settable by the user, which I don't think is easily possible today (perhaps the main motivation for this issue?) Exactly ! That's what I thought of this issue in the first place - allow the app to more easily control the size of the buffers used by Lucene. That that IOContext will allow me to set different bufferSizes per the work that Lucene's doing (i.e. merge, search) is a bonus for me . I still think bufferSize can be used as a hint by the specific Directory impl, but it should be there. If we want to allow control of different buffer sizes per operation, then maybe we can do the following: Set a default buffer size on IOContext. Add a setBufferSize(OpType type, int/long bufferSize) Add getBufferSize(OpType) – returns the set buffer size, or the default if not set. I don't think it's too complicated and gives enough control to the application?
          Hide
          Shai Erera added a comment -

          Robert, nothing is too difficult to understand. But your code example tells me that if I want to control the buffer size used by Lucene (whether it's done intelligently or not today is what we try to fix here), I need to create my own Dir impl. That seems an overkill to me, for just controlling the bufferSize !?

          Show
          Shai Erera added a comment - Robert, nothing is too difficult to understand. But your code example tells me that if I want to control the buffer size used by Lucene (whether it's done intelligently or not today is what we try to fix here), I need to create my own Dir impl. That seems an overkill to me, for just controlling the bufferSize !?
          Hide
          Robert Muir added a comment -

          But your code example tells me that if I want to control the buffer size used by Lucene (whether it's done intelligently or not today is what we try to fix here), I need to create my own Dir impl. That seems an overkill to me, for just controlling the bufferSize !?

          No you dont... you can also call the .setBufferSize like the skiplist reader does.

          IndexInput input = dir.openInput(....)
          if (input instanceof BufferedIndexInput)
          ...setBufferSize(whatever_you_want);
          

          Because after all, buffersize only makes sense on Buffered*.

          It makes no sense to be on Directory's openInput, e.g. it makes no sense for directories like MMapDirectory.

          Show
          Robert Muir added a comment - But your code example tells me that if I want to control the buffer size used by Lucene (whether it's done intelligently or not today is what we try to fix here), I need to create my own Dir impl. That seems an overkill to me, for just controlling the bufferSize !? No you dont... you can also call the .setBufferSize like the skiplist reader does. IndexInput input = dir.openInput(....) if (input instanceof BufferedIndexInput) ...setBufferSize(whatever_you_want); Because after all, buffersize only makes sense on Buffered*. It makes no sense to be on Directory's openInput, e.g. it makes no sense for directories like MMapDirectory.
          Hide
          Shai Erera added a comment -

          No you dont... you can also call the .setBufferSize like the skiplist reader does.

          I'd still need to impl my Directory though right? That's the overkill I'm trying to avoid.

          But, I've been thinking about how would a merge/search code, which can only tell the Directory it's in SEARCH / MERGE context, get the buffer size the application wanted to use in that context. I don't think it has a way to do so without either using a static class, something we try to avoid, or propagating those settings down everywhere, which does not make sense either.

          So, Robert, I think you're right – bufferSize should not exist on IOContext. A custom Directory impl seems unavoided. So I think it'd be good if we can create a BufferedDirectoryWrapper which wraps a Directory and offers a BufferedIndexInput/OutputWrapper which delegate the necessary calls to the wrapped Directory. We've found ourselves needing to implement that kind of Directory several times already, and it'd be good if Lucene can offer one.

          That Directory would then take IOContext -> bufferSize map/properties/whatever and can take that into account in openInput/createOutput.

          If users will need to impl Directory wrapping, if they want to control buffer sizes, I suggest we make that as painless as possible.

          Show
          Shai Erera added a comment - No you dont... you can also call the .setBufferSize like the skiplist reader does. I'd still need to impl my Directory though right? That's the overkill I'm trying to avoid. But, I've been thinking about how would a merge/search code, which can only tell the Directory it's in SEARCH / MERGE context, get the buffer size the application wanted to use in that context. I don't think it has a way to do so without either using a static class, something we try to avoid, or propagating those settings down everywhere, which does not make sense either. So, Robert, I think you're right – bufferSize should not exist on IOContext. A custom Directory impl seems unavoided. So I think it'd be good if we can create a BufferedDirectoryWrapper which wraps a Directory and offers a BufferedIndexInput/OutputWrapper which delegate the necessary calls to the wrapped Directory. We've found ourselves needing to implement that kind of Directory several times already, and it'd be good if Lucene can offer one. That Directory would then take IOContext -> bufferSize map/properties/whatever and can take that into account in openInput/createOutput. If users will need to impl Directory wrapping, if they want to control buffer sizes, I suggest we make that as painless as possible.
          Hide
          Earwin Burrfoot added a comment -

          What's with ongoing crazyness?

          DirectIOLinuxDirectory

          First you introduce a kind of directory that is utterly useless except certain special situations. Then, instead of fixing the directory/folding its code somewhere normal, you try to workaround by switching between directories. What's the point of using abstract classes or interfaces, if you leak their implementation's logic all over the place?
          Or making DIOLD wrap something. Yeah! Wrap my RAMDir!

          bufferSize

          This value is only meaningful to a certain subset of Directory implementations. So the only logical place we want to see this value set - is these very impls.
          Sample code:

          Directory ramDir = new RAMDirectory();
          ramDir.createIndexInput(name, context);
          // See, ma? No bufferSizes, they are pointless for RAMDir
          
          Directory fsDir = new NIOFSDirectory();
          fsDir.setBufferSize(IOContext.NORMAL_READ, 1024);
          fsDir.setBufferSize(IOContext.MERGE, 4096);
          fsDir.createIndexInput(name, context)
          // See, ma? The only one who's really concerned with 'actual' buffer size is this concrete Directory impl
          // All client code is only concerned with the context.
          // It's NIOFSDirectory's business to give meaningful interpretation for IOContext and assign the buffer sizes.
          

          You don't need custom Directory impls to make DIOLD work, you should freakin' fix it.
          The proper way is to test out the things, and then move DirectIO code to the only place it makes sense in - FSDir? Probably make it switch on/off-able, maybe not.

          You don't need custom Directory impls to set buffer sizes (neither cast to BufferedIndexInput!), you should add the setting to these Directories, which make sense of it.

          Show
          Earwin Burrfoot added a comment - What's with ongoing crazyness? DirectIOLinuxDirectory First you introduce a kind of directory that is utterly useless except certain special situations. Then, instead of fixing the directory/folding its code somewhere normal, you try to workaround by switching between directories. What's the point of using abstract classes or interfaces, if you leak their implementation's logic all over the place? Or making DIOLD wrap something. Yeah! Wrap my RAMDir! bufferSize This value is only meaningful to a certain subset of Directory implementations. So the only logical place we want to see this value set - is these very impls. Sample code: Directory ramDir = new RAMDirectory(); ramDir.createIndexInput(name, context); // See, ma? No bufferSizes, they are pointless for RAMDir Directory fsDir = new NIOFSDirectory(); fsDir.setBufferSize(IOContext.NORMAL_READ, 1024); fsDir.setBufferSize(IOContext.MERGE, 4096); fsDir.createIndexInput(name, context) // See, ma? The only one who's really concerned with 'actual' buffer size is this concrete Directory impl // All client code is only concerned with the context. // It's NIOFSDirectory's business to give meaningful interpretation for IOContext and assign the buffer sizes. You don't need custom Directory impls to make DIOLD work, you should freakin' fix it. The proper way is to test out the things, and then move DirectIO code to the only place it makes sense in - FSDir? Probably make it switch on/off-able, maybe not. You don't need custom Directory impls to set buffer sizes (neither cast to BufferedIndexInput!), you should add the setting to these Directories, which make sense of it.
          Hide
          Shai Erera added a comment -

          I assume you mean setBufferSize(IOContext, size) should be added to specific Directory impls, and not Directory? Because in your example code above, it looks like it's added to Directory itself. Though we can add it to Directory as well, and do nothing there. It simplifies matters as you don't need to check whether the Dir you receive supports setting buffer size (in case you're not the one creating it).

          At any rate, this looks like it can work too.

          Show
          Shai Erera added a comment - I assume you mean setBufferSize(IOContext, size) should be added to specific Directory impls, and not Directory? Because in your example code above, it looks like it's added to Directory itself. Though we can add it to Directory as well, and do nothing there. It simplifies matters as you don't need to check whether the Dir you receive supports setting buffer size (in case you're not the one creating it). At any rate, this looks like it can work too.
          Hide
          Robert Muir added a comment -

          The proper way is to test out the things, and then move DirectIO code to the only place it makes sense in - FSDir? Probably make it switch on/off-able, maybe not.

          I'm not sure it should be there... at least not soon. its not even something you can implement in pure java?
          we definitely have to keep it still simple and possible for people to use the java library in a platform-indepedent way.
          its also a bit dangerous, whenever JNI is involved....even if its working.

          So I think its craziness, to put this direct-io stuff in fsdirectory itself.

          As I said before though, i wouldn't mind if we had something more like a 'modules/native' and FSDirectory checked, if this was available and automagically used it...
          but I can't see myself thinking that we should put this logic into fsdir itself, sorry.

          Sample code

          My problem with your sample code is that it appears that the .setBufferSize method is on Directory itself.
          Again i disagree with this because:

          • its useless to certain directories like MMapDirectory
          • its dangerous in the direct-io case (different platforms have strict requirements that things be sector-aligned etc, see the mac case where it actually 'works' if the buffer isnt, but is just slow).

          I definitely don't like the confusion regarding buffersizes now. A very small % of the time its actually meaningful and should be respected,
          but most of the time the value is completely bogus.

          Show
          Robert Muir added a comment - The proper way is to test out the things, and then move DirectIO code to the only place it makes sense in - FSDir? Probably make it switch on/off-able, maybe not. I'm not sure it should be there... at least not soon. its not even something you can implement in pure java? we definitely have to keep it still simple and possible for people to use the java library in a platform-indepedent way. its also a bit dangerous, whenever JNI is involved....even if its working. So I think its craziness, to put this direct-io stuff in fsdirectory itself. As I said before though, i wouldn't mind if we had something more like a 'modules/native' and FSDirectory checked, if this was available and automagically used it... but I can't see myself thinking that we should put this logic into fsdir itself, sorry. Sample code My problem with your sample code is that it appears that the .setBufferSize method is on Directory itself. Again i disagree with this because: its useless to certain directories like MMapDirectory its dangerous in the direct-io case (different platforms have strict requirements that things be sector-aligned etc, see the mac case where it actually 'works' if the buffer isnt, but is just slow). I definitely don't like the confusion regarding buffersizes now. A very small % of the time its actually meaningful and should be respected, but most of the time the value is completely bogus.
          Hide
          Earwin Burrfoot added a comment -

          Because in your example code above, it looks like it's added to Directory itself.

          My problem with your sample code is that it appears that the .setBufferSize method is on Directory itself.

          Ohoho. My fault, sorry. It should look like:

          RAMDirectory ramDir = new RAMDirectory();
          ramDir.setBufferSize(whatever) // Compilation error!
          ramDir.createIndexInput(name, context);
          
          NIOFSDirectory fsDir = new NIOFSDirectory();
          fsDir.setBufferSize(IOContext.NORMAL_READ, 1024);
          fsDir.setBufferSize(IOContext.MERGE, 4096);
          fsDir.createIndexInput(name, context)
          
          Show
          Earwin Burrfoot added a comment - Because in your example code above, it looks like it's added to Directory itself. My problem with your sample code is that it appears that the .setBufferSize method is on Directory itself. Ohoho. My fault, sorry. It should look like: RAMDirectory ramDir = new RAMDirectory(); ramDir.setBufferSize(whatever) // Compilation error! ramDir.createIndexInput(name, context); NIOFSDirectory fsDir = new NIOFSDirectory(); fsDir.setBufferSize(IOContext.NORMAL_READ, 1024); fsDir.setBufferSize(IOContext.MERGE, 4096); fsDir.createIndexInput(name, context)
          Hide
          Earwin Burrfoot added a comment -

          As I said before though, i wouldn't mind if we had something more like a 'modules/native' and FSDirectory checked, if this was available and automagically used it...
          but I can't see myself thinking that we should put this logic into fsdir itself, sorry.

          I'm perfectly OK with that approach (having some module FSDir checks). I also feel uneasy having JNI in core.
          What I don't want to see, is Directory impls that you can't use on their own. If you can only use it for merging, then it's not a Directory, it breaks the contract! - move the code elsewhere.

          Show
          Earwin Burrfoot added a comment - As I said before though, i wouldn't mind if we had something more like a 'modules/native' and FSDirectory checked, if this was available and automagically used it... but I can't see myself thinking that we should put this logic into fsdir itself, sorry. I'm perfectly OK with that approach (having some module FSDir checks). I also feel uneasy having JNI in core. What I don't want to see, is Directory impls that you can't use on their own. If you can only use it for merging, then it's not a Directory, it breaks the contract! - move the code elsewhere.
          Hide
          Robert Muir added a comment -

          I'm perfectly OK with that approach (having some module FSDir checks). I also feel uneasy having JNI in core.
          What I don't want to see, is Directory impls that you can't use on their own. If you can only use it for merging, then it's not a Directory, it breaks the contract! - move the code elsewhere.

          Right, i think we all agree we want to fix the DirectIOLinuxDirectory into being a 'real' directory?

          As i said before, from a practical perspective, it could be named LinuxDirectory, extend NIOFS,
          and when openInput(IOContext=Merge) it opens its special input. but personally i don't care how
          we actually implement it 'becoming a real directory'. this is another issue, unrelated to this one really.

          this issue is enough and should stand on its own... we should be able to do enough nice things
          here without dealing with JNI: improving our existing directory impls to use larger buffer sizes by
          default when merging, etc (like in your example).

          Show
          Robert Muir added a comment - I'm perfectly OK with that approach (having some module FSDir checks). I also feel uneasy having JNI in core. What I don't want to see, is Directory impls that you can't use on their own. If you can only use it for merging, then it's not a Directory, it breaks the contract! - move the code elsewhere. Right, i think we all agree we want to fix the DirectIOLinuxDirectory into being a 'real' directory? As i said before, from a practical perspective, it could be named LinuxDirectory, extend NIOFS, and when openInput(IOContext=Merge) it opens its special input. but personally i don't care how we actually implement it 'becoming a real directory'. this is another issue, unrelated to this one really. this issue is enough and should stand on its own... we should be able to do enough nice things here without dealing with JNI: improving our existing directory impls to use larger buffer sizes by default when merging, etc (like in your example).
          Hide
          Varun Thacker added a comment -

          Hi. I would be interested in taking this up as a GSOC project . Are there any resource that I can read to understand the problem in depth ?

          Show
          Varun Thacker added a comment - Hi. I would be interested in taking this up as a GSOC project . Are there any resource that I can read to understand the problem in depth ?
          Hide
          Simon Willnauer added a comment -

          Hi. I would be interested in taking this up as a GSOC project . Are there any resource that I can read to understand the problem in depth ?

          Hey, I just marked it as GSoC-able so you are free to take it. I would also volunteer to mentor on this issue if mike doesn't want to take it though. Let me try to explain you quickly what this issue is about:

          Lucene uses Directory as an abstraction on top of the filesystem or RAM or any other storage to write the index data. Yet, Lucene has different "stages" where we have different requirements to the underlying storage / directory. When you index documents you eventually flush the index to disk and continue indexing until you created enough "segments" on disk that you need to merged them. This operations should if possible not pollute the FS cache since its really just housekeeping. With Java such its currently not possible to use some flags like DIRECT / SEQUENTIAL, this is what we have the native DirectIODirectory implementations in contrib for. Yet, for reading stuff from the index while searching we want to have the FS cache helping us as much as possible so this has again different requriements. What we currently do is that we pass different read buffer sizes to the directory to improve performance. All those kinds of information should be passed to the directory on a IndexInput / IndexOutput (similar to Input and OutputStream just tailored & enhanced for Lucene) basis and this is what this issue is about.

          hope that helps.

          Show
          Simon Willnauer added a comment - Hi. I would be interested in taking this up as a GSOC project . Are there any resource that I can read to understand the problem in depth ? Hey, I just marked it as GSoC-able so you are free to take it. I would also volunteer to mentor on this issue if mike doesn't want to take it though. Let me try to explain you quickly what this issue is about: Lucene uses Directory as an abstraction on top of the filesystem or RAM or any other storage to write the index data. Yet, Lucene has different "stages" where we have different requirements to the underlying storage / directory. When you index documents you eventually flush the index to disk and continue indexing until you created enough "segments" on disk that you need to merged them. This operations should if possible not pollute the FS cache since its really just housekeeping. With Java such its currently not possible to use some flags like DIRECT / SEQUENTIAL, this is what we have the native DirectIODirectory implementations in contrib for. Yet, for reading stuff from the index while searching we want to have the FS cache helping us as much as possible so this has again different requriements. What we currently do is that we pass different read buffer sizes to the directory to improve performance. All those kinds of information should be passed to the directory on a IndexInput / IndexOutput (similar to Input and OutputStream just tailored & enhanced for Lucene) basis and this is what this issue is about. hope that helps.
          Hide
          Michael McCandless added a comment -

          For LUCENE-3092, it would be nice if the IOContext would optionally include the OneMerge if in fact the file is being opened for merging... this lets the Dir impl be smart if it wants to customize its behavior according to overall properties of the merge (eg total merged bytes).

          Show
          Michael McCandless added a comment - For LUCENE-3092 , it would be nice if the IOContext would optionally include the OneMerge if in fact the file is being opened for merging... this lets the Dir impl be smart if it wants to customize its behavior according to overall properties of the merge (eg total merged bytes).
          Hide
          Earwin Burrfoot added a comment -

          As mentioned @LUCENE-3092, it would be nice not to include the OneMerge, but some meaningful value like 'expectedSize', 'expectedSegmentSize' or whatnot, that would work both for merges and flushes, and also won't introduce needless dependency on MergePolicy.

          Show
          Earwin Burrfoot added a comment - As mentioned @ LUCENE-3092 , it would be nice not to include the OneMerge, but some meaningful value like 'expectedSize', 'expectedSegmentSize' or whatnot, that would work both for merges and flushes, and also won't introduce needless dependency on MergePolicy.
          Hide
          Varun Thacker added a comment -

          I have made a class IOContext and added a IOContext object to the Directory createOutput and openInput method parameters.

          If this is the right way to go ?

          Show
          Varun Thacker added a comment - I have made a class IOContext and added a IOContext object to the Directory createOutput and openInput method parameters. If this is the right way to go ?
          Hide
          Jason Rutherglen added a comment -

          I already posted a patch to this issue a while back, https://issues.apache.org/jira/secure/attachment/12468030/LUCENE-2793.patch It seems we're looping here.

          Show
          Jason Rutherglen added a comment - I already posted a patch to this issue a while back, https://issues.apache.org/jira/secure/attachment/12468030/LUCENE-2793.patch It seems we're looping here.
          Hide
          Varun Thacker added a comment -

          I should have seen the patch! I have taken this up as a GSoC project. I'll try to use that patch too if it's ok.

          Show
          Varun Thacker added a comment - I should have seen the patch! I have taken this up as a GSoC project. I'll try to use that patch too if it's ok.
          Hide
          Simon Willnauer added a comment - - edited

          I already posted a patch to this issue a while back...

          Jason I appreciate that you are still around on this issue. Varun is doing his GSoC project on this issue and others so there might be some duplication here and there or similarities with you patch but in this case this is ok though. He needs to get started as well as getting a feeling how things work here. So I hope you don't mind. I can only encourage you to help reviewing and commenting!

          If this is the right way to go ?

          Thanks for the patch man! Good to get started eventually. here are some comments for you patch:

          • When I look at IOContext I think it should be a little more sophisticated. What I have in mind is something similar to ScorerContext in org.apache.lucene.search.Weight.java. Some kind of a copy-on-write builder pattern that lets us provide some defaults for Merging, Searching and Indexing so by default we can reuse the same instance in many places. If we follow a this copy on write model we could also make this class non-final to let people customize the context if they needs to.
          • I am not sure if we really need the enumeration in IO Context unless me make decisions based on what an input / output is used for. IMO it might make more sense to have default instances for Searching Indexing and Merging and set the flags like SEQUENTIAL and the buffers to good defaults and only tweak them when we are aware of the context ie. when we pull the input / output. The most of the usecases need good defaults and only some need tweaks but if we hold the use-case information on IOContext some directories might want to be very smart and this might duplicate code.
          • I wonder if it makes sense to bind the IO Context instance to the directory and add a factory to the directory like Directory#getIOContext(Merge|Search|Index) to enable the directory impl to set platform specific defaults. I think that would make things easier and customization would be straight forward.
          • You should add a space after a comma in the arguments list like here: int bufferSize,IOContext context)
          • Enum elements should be CamelCase and start with a leading Uppercase character
          Show
          Simon Willnauer added a comment - - edited I already posted a patch to this issue a while back... Jason I appreciate that you are still around on this issue. Varun is doing his GSoC project on this issue and others so there might be some duplication here and there or similarities with you patch but in this case this is ok though. He needs to get started as well as getting a feeling how things work here. So I hope you don't mind. I can only encourage you to help reviewing and commenting! If this is the right way to go ? Thanks for the patch man! Good to get started eventually. here are some comments for you patch: When I look at IOContext I think it should be a little more sophisticated. What I have in mind is something similar to ScorerContext in org.apache.lucene.search.Weight.java. Some kind of a copy-on-write builder pattern that lets us provide some defaults for Merging, Searching and Indexing so by default we can reuse the same instance in many places. If we follow a this copy on write model we could also make this class non-final to let people customize the context if they needs to. I am not sure if we really need the enumeration in IO Context unless me make decisions based on what an input / output is used for. IMO it might make more sense to have default instances for Searching Indexing and Merging and set the flags like SEQUENTIAL and the buffers to good defaults and only tweak them when we are aware of the context ie. when we pull the input / output. The most of the usecases need good defaults and only some need tweaks but if we hold the use-case information on IOContext some directories might want to be very smart and this might duplicate code. I wonder if it makes sense to bind the IO Context instance to the directory and add a factory to the directory like Directory#getIOContext(Merge|Search|Index) to enable the directory impl to set platform specific defaults. I think that would make things easier and customization would be straight forward. You should add a space after a comma in the arguments list like here: int bufferSize,IOContext context) Enum elements should be CamelCase and start with a leading Uppercase character
          Hide
          Varun Thacker added a comment -

          I edited IOContext .

          I'm not sure on where to go from here. Should I add the context to all the createOutput/openImput parameters in the Directory implementations or add getIOContext method to Directory?

          Show
          Varun Thacker added a comment - I edited IOContext . I'm not sure on where to go from here. Should I add the context to all the createOutput/openImput parameters in the Directory implementations or add getIOContext method to Directory?
          Hide
          Michael McCandless added a comment -

          Great to see your patch here Varun!

          I think we should start with only high level details in the IOContext?
          Ie, the Merge/Search(Reader?)/Writer is great, but I think low level
          stuff (bufferSize, sequential/direct) should stay private to the Dir
          impl?

          Ideally, I would also like to see details about the
          merge/reader/writer "context", eg for merging I'd like to see the
          OneMerge instance, for Reader/Writer maybe a SegmentInfo instance?

          This would then make Dir impls like NRTCachingDirectory (LUCENE-3092)
          "clean" (vs the sneaky ConcurrentMergeScheduler entangling it now must
          do), though, in that particular case we could accomplish this by only
          adding an estimatedSegmentSizeBytes to the IOCtx.

          We should remove the bufferSize that now plumbs all up and down
          the APIs, and replace it with IOCtx?

          Show
          Michael McCandless added a comment - Great to see your patch here Varun! I think we should start with only high level details in the IOContext? Ie, the Merge/Search(Reader?)/Writer is great, but I think low level stuff (bufferSize, sequential/direct) should stay private to the Dir impl? Ideally, I would also like to see details about the merge/reader/writer "context", eg for merging I'd like to see the OneMerge instance, for Reader/Writer maybe a SegmentInfo instance? This would then make Dir impls like NRTCachingDirectory ( LUCENE-3092 ) "clean" (vs the sneaky ConcurrentMergeScheduler entangling it now must do), though, in that particular case we could accomplish this by only adding an estimatedSegmentSizeBytes to the IOCtx. We should remove the bufferSize that now plumbs all up and down the APIs, and replace it with IOCtx?
          Hide
          Michael McCandless added a comment -

          Some kind of a copy-on-write builder pattern

          Why not use a normal struct/bean like class here?

          I wonder if it makes sense to bind the IO Context instance to the directory and add a factory to the directory like Directory#getIOContext(Merge|Search|Index) to enable the directory impl to set platform specific defaults

          Hmm, but, if the IOCtxt is 'high level', and then details (bufferSize, seq/direct flags, etc.) are private to the dir impl, I think we (Lucene) can create the IOCtx and pass it down? Ie this is a one-way communication, I think (Lucene -> Dir impl).

          Show
          Michael McCandless added a comment - Some kind of a copy-on-write builder pattern Why not use a normal struct/bean like class here? I wonder if it makes sense to bind the IO Context instance to the directory and add a factory to the directory like Directory#getIOContext(Merge|Search|Index) to enable the directory impl to set platform specific defaults Hmm, but, if the IOCtxt is 'high level', and then details (bufferSize, seq/direct flags, etc.) are private to the dir impl, I think we (Lucene) can create the IOCtx and pass it down? Ie this is a one-way communication, I think (Lucene -> Dir impl).
          Hide
          Varun Thacker added a comment -

          I removed bufferSize from all the Directory implementations and made it part of IOContext.

          Show
          Varun Thacker added a comment - I removed bufferSize from all the Directory implementations and made it part of IOContext.
          Hide
          Michael McCandless added a comment -

          I removed bufferSize from all the Directory implementations and made it part of IOContext.

          I think it shouldn't even be part of IOCtx? Ie this is private to the Dir impl?

          Another thing we have to do is add IOCtx into the cache key used in IndexWriter's ReaderPool inner class. Ie, a reader opened for merging cannot be later shared with a reader opened for searching.

          Show
          Michael McCandless added a comment - I removed bufferSize from all the Directory implementations and made it part of IOContext. I think it shouldn't even be part of IOCtx? Ie this is private to the Dir impl? Another thing we have to do is add IOCtx into the cache key used in IndexWriter's ReaderPool inner class. Ie, a reader opened for merging cannot be later shared with a reader opened for searching.
          Hide
          Simon Willnauer added a comment -

          Why not use a normal struct/bean like class here?

          What I had in mind here is a more sophisticated IOContext. I'd like to see some default IOCtx for Merge, Search, Write etc. that folks can simply modify if needed. For that to work it should be immutable. The question here is if we go the high level way like you suggested and make the most of the properties private to the dir (like buffersize etc) or is we should make it available to the user of the API. Some Dirs might have different defaults than others though.

          I can see why this should be private to the directory and it might make more sense to simply indicate what the input / output is used for and let the directory figure out the details. So yeah +1 to keep it high level. If bean or copy-on-write builder I need to see what properties should be on it. so lets keep that for later.

          Ideally, I would also like to see details about the
          merge/reader/writer "context", eg for merging I'd like to see the
          OneMerge instance, for Reader/Writer maybe a SegmentInfo instance?

          I really don't like the idea of exploiting OneMerge to the Directory. We might introduce some other internface that holds enough metadata to get the info and let OneMerge implement this but we should not pass it down to the dir. Same is true for SI I don't want directory mess around with those classes.

          Show
          Simon Willnauer added a comment - Why not use a normal struct/bean like class here? What I had in mind here is a more sophisticated IOContext. I'd like to see some default IOCtx for Merge, Search, Write etc. that folks can simply modify if needed. For that to work it should be immutable. The question here is if we go the high level way like you suggested and make the most of the properties private to the dir (like buffersize etc) or is we should make it available to the user of the API. Some Dirs might have different defaults than others though. I can see why this should be private to the directory and it might make more sense to simply indicate what the input / output is used for and let the directory figure out the details. So yeah +1 to keep it high level. If bean or copy-on-write builder I need to see what properties should be on it. so lets keep that for later. Ideally, I would also like to see details about the merge/reader/writer "context", eg for merging I'd like to see the OneMerge instance, for Reader/Writer maybe a SegmentInfo instance? I really don't like the idea of exploiting OneMerge to the Directory. We might introduce some other internface that holds enough metadata to get the info and let OneMerge implement this but we should not pass it down to the dir. Same is true for SI I don't want directory mess around with those classes.
          Hide
          Michael McCandless added a comment -

          If bean or copy-on-write builder I need to see what properties should be on it. so lets keep that for later.

          OK let's defer this decision for now.

          I just see "builder pattern" being too much of a hammer in search of
          nails...

          I really don't like the idea of exploiting OneMerge to the Directory. We might introduce some other internface that holds enough metadata to get the info and let OneMerge implement this but we should not pass it down to the dir. Same is true for SI I don't want directory mess around with those classes.

          I agree, it's spooky letting such a low level class (Dir) have access
          to high level stuff (OneMerge, SegmentInfo)... but it's really the
          only way to ensure we don't "miss" useful context about this file?

          Ie these high level classes already encompass the full context for a
          given file.

          So, if we really don't want to include these classes, and instead pick &
          choose what "digested" properties to include... then we risk some Dir
          impl not having access to something "interesting" because we forgot to
          include it, which means they'll have to hack something up, like
          NRTCachingDir does now in coupling threads to OneMerge instances.

          But, maybe the restrictions won't be so bad in practice... if we can
          try to brainstorm what Dirs may want to "know" about the full context
          and include that in IOCtx.

          For example, just by adding estimatedFullSegmentSizeBytes, for both
          Merge and Write (hmm: maybe Flush?), NRTCachingDir can cutover to
          IOCtx.

          Show
          Michael McCandless added a comment - If bean or copy-on-write builder I need to see what properties should be on it. so lets keep that for later. OK let's defer this decision for now. I just see "builder pattern" being too much of a hammer in search of nails... I really don't like the idea of exploiting OneMerge to the Directory. We might introduce some other internface that holds enough metadata to get the info and let OneMerge implement this but we should not pass it down to the dir. Same is true for SI I don't want directory mess around with those classes. I agree, it's spooky letting such a low level class (Dir) have access to high level stuff (OneMerge, SegmentInfo)... but it's really the only way to ensure we don't "miss" useful context about this file? Ie these high level classes already encompass the full context for a given file. So, if we really don't want to include these classes, and instead pick & choose what "digested" properties to include... then we risk some Dir impl not having access to something "interesting" because we forgot to include it, which means they'll have to hack something up, like NRTCachingDir does now in coupling threads to OneMerge instances. But, maybe the restrictions won't be so bad in practice... if we can try to brainstorm what Dirs may want to "know" about the full context and include that in IOCtx. For example, just by adding estimatedFullSegmentSizeBytes, for both Merge and Write (hmm: maybe Flush?), NRTCachingDir can cutover to IOCtx.
          Hide
          Chris Male added a comment -

          So, if we really don't want to include these classes, and instead pick &
          choose what "digested" properties to include... then we risk some Dir
          impl not having access to something "interesting" because we forgot to
          include it, which means they'll have to hack something up, like
          NRTCachingDir does now in coupling threads to OneMerge instances.

          But, maybe the restrictions won't be so bad in practice... if we can
          try to brainstorm what Dirs may want to "know" about the full context
          and include that in IOCtx.

          I think the alternatives to this are definitely worse. I've been trying to think of some sort of intermediary component that could decouple the interests of Directories from the sources of that info, but it just seems to make everything way more complicated. I agree it seems best to stick to what Dirs want to know, present that as the IOCtx, and then work out the sources of that information.

          Show
          Chris Male added a comment - So, if we really don't want to include these classes, and instead pick & choose what "digested" properties to include... then we risk some Dir impl not having access to something "interesting" because we forgot to include it, which means they'll have to hack something up, like NRTCachingDir does now in coupling threads to OneMerge instances. But, maybe the restrictions won't be so bad in practice... if we can try to brainstorm what Dirs may want to "know" about the full context and include that in IOCtx. I think the alternatives to this are definitely worse. I've been trying to think of some sort of intermediary component that could decouple the interests of Directories from the sources of that info, but it just seems to make everything way more complicated. I agree it seems best to stick to what Dirs want to know, present that as the IOCtx, and then work out the sources of that information.
          Hide
          Varun Thacker added a comment -

          I have edited org.apache.lucene.store and org.apache.index where needed. If this is correct I'll change the codecs too

          Show
          Varun Thacker added a comment - I have edited org.apache.lucene.store and org.apache.index where needed. If this is correct I'll change the codecs too
          Hide
          Michael McCandless added a comment -

          Patch is looking good! Some feedback:

          • I think CompoundFileReader should not store the IOCtx passed to
            the ctor? It should just fwd to the dir.openInput...? It used to
            have to store the readBufferSize in case openInput was called w/o
            a readBufferSize, but we now require IOCtx to openInput.
          • Is IOCtx never allowed to be null? (I think so?). In which case
            we should jdoc this, and Dir impls can rely on it.
          • I don't think NRTCachingDir.unCache should take an IOCtx?
          • I think NRTCachineDir.doCacheWrite should take the IOCtx, and, we
            should remove all the merge thread hacks that it now does.
          • The super(bufferSize) call is needed in SimpleFSDir (and others).
            I think what must happen here is SimpleFSDir look @ the IOCtx and
            decides what buffer size to use (the logic in IW should be moved
            down into here), with getters/setters to be able to change
            read/write buffer size for merging vs reading/flushing.
          • IOContext.Context should also have Flush? And maybe rename Search
            -> Read?
          • IndexWriter ctor shouldn't take an incoming IOCtx; instead,
            IndexWriter should create a new IOCtx each time it opens reader /
            flushes segments / does merging.
          • The ReaderPool.get* methods (inside IndexWriter) should take an
            IOCtx, and the .context from that IOCtx should be part of the key
            used to cache that reader in the pool.
          Show
          Michael McCandless added a comment - Patch is looking good! Some feedback: I think CompoundFileReader should not store the IOCtx passed to the ctor? It should just fwd to the dir.openInput...? It used to have to store the readBufferSize in case openInput was called w/o a readBufferSize, but we now require IOCtx to openInput. Is IOCtx never allowed to be null? (I think so?). In which case we should jdoc this, and Dir impls can rely on it. I don't think NRTCachingDir.unCache should take an IOCtx? I think NRTCachineDir.doCacheWrite should take the IOCtx, and, we should remove all the merge thread hacks that it now does. The super(bufferSize) call is needed in SimpleFSDir (and others). I think what must happen here is SimpleFSDir look @ the IOCtx and decides what buffer size to use (the logic in IW should be moved down into here), with getters/setters to be able to change read/write buffer size for merging vs reading/flushing. IOContext.Context should also have Flush? And maybe rename Search -> Read? IndexWriter ctor shouldn't take an incoming IOCtx; instead, IndexWriter should create a new IOCtx each time it opens reader / flushes segments / does merging. The ReaderPool.get* methods (inside IndexWriter) should take an IOCtx, and the .context from that IOCtx should be part of the key used to cache that reader in the pool.
          Hide
          Varun Thacker added a comment -

          I have made the required changes. There are places where I have made the Context=Other , some of which might be wrong. Please suggest me where to make the necessary changes.

          Show
          Varun Thacker added a comment - I have made the required changes. There are places where I have made the Context=Other , some of which might be wrong. Please suggest me where to make the necessary changes.
          Hide
          Simon Willnauer added a comment -

          Very very quick review:

          • I think OneMerge should not be required for createing a IOContext we maybe should add a default ctor.
          • IOContext.Other is confusing I think. If a IOContext doesn't make lots of sense somewhere we should not need to pass it in. Can't we simply have overloaded methods? maybe I just don't like the name, maybe use DEFAULT?
          • IOContext seem to pretty straight forward you either read or write but it seems to be confused with high level operations like Merge and Flush. Either with go on a high level or we only have read and write here. Since read / write is implicit (you either pull input or output) we should make this high-level only. So maybe we have Query or Search instead of Read here? Maybe it makes sense to specify stuff like Consume or Sequential here too some high level APIs define sequential access so I think it does not conflict?
          Show
          Simon Willnauer added a comment - Very very quick review: I think OneMerge should not be required for createing a IOContext we maybe should add a default ctor. IOContext.Other is confusing I think. If a IOContext doesn't make lots of sense somewhere we should not need to pass it in. Can't we simply have overloaded methods? maybe I just don't like the name, maybe use DEFAULT? IOContext seem to pretty straight forward you either read or write but it seems to be confused with high level operations like Merge and Flush. Either with go on a high level or we only have read and write here. Since read / write is implicit (you either pull input or output) we should make this high-level only. So maybe we have Query or Search instead of Read here? Maybe it makes sense to specify stuff like Consume or Sequential here too some high level APIs define sequential access so I think it does not conflict?
          Hide
          Michael McCandless added a comment -

          Patch looks great! Comments:

          • I think IOCtx should have ctor taking only OneMerge, which would
            set the OneMerge and set the context as Merge?
          • Likewise, a ctor taking a SegmentInfo to mean context = Flush?
          • And finally a default ctor that maps to Other (or Default or
            Unknown or Unspecified or something)
          • You don't need to create the IOCtx in IW.maybeMerge?
          • Maybe we want a "readOnce" boolean in IOCtx? When we read del
            docs, norms, terms index, doc values, segments file /
            segments.gen, we would set this? (And UnixDir would send eg
            NO_REUSE down to the OS).
          • I think we'll need a NativeMMapDir as well as NativeDir (or
            NativeUnix/WindowsDir), because mmap can also take flags giving
            hints about access patterns. I'll open a new issue...
          • Why does SegmentCoreReaders hang onto the IOCtx? Seems like
            classes shouldn't hang onto it... (also: PreFlexFields).
          • Hmm.... does createOutput even need an IOCtx...? What would a dir
            do with this? I suppose if it's a merge and we had io
            prioritization (someday) we could set lower prio... OK let's keep
            it.
          • I think Codec.fieldsProducer/Consumer should take an IOCtx?
          • Still need to fix IW's ReaderPool to key off of IOCtx.Context plus
            the info. Maybe put a // nocommit in there so we remember...?
            Eg, where you commented out the readBufferSize = ... inside
            ReaderPool.get is a good place.
          Show
          Michael McCandless added a comment - Patch looks great! Comments: I think IOCtx should have ctor taking only OneMerge, which would set the OneMerge and set the context as Merge? Likewise, a ctor taking a SegmentInfo to mean context = Flush? And finally a default ctor that maps to Other (or Default or Unknown or Unspecified or something) You don't need to create the IOCtx in IW.maybeMerge? Maybe we want a "readOnce" boolean in IOCtx? When we read del docs, norms, terms index, doc values, segments file / segments.gen, we would set this? (And UnixDir would send eg NO_REUSE down to the OS). I think we'll need a NativeMMapDir as well as NativeDir (or NativeUnix/WindowsDir), because mmap can also take flags giving hints about access patterns. I'll open a new issue... Why does SegmentCoreReaders hang onto the IOCtx? Seems like classes shouldn't hang onto it... (also: PreFlexFields). Hmm.... does createOutput even need an IOCtx...? What would a dir do with this? I suppose if it's a merge and we had io prioritization (someday) we could set lower prio... OK let's keep it. I think Codec.fieldsProducer/Consumer should take an IOCtx? Still need to fix IW's ReaderPool to key off of IOCtx.Context plus the info. Maybe put a // nocommit in there so we remember...? Eg, where you commented out the readBufferSize = ... inside ReaderPool.get is a good place.
          Hide
          Michael McCandless added a comment -

          I think we'll need a NativeMMapDir as well as NativeDir (or
          NativeUnix/WindowsDir), because mmap can also take flags giving
          hints about access patterns. I'll open a new issue...

          I opened LUCENE-3178.

          Show
          Michael McCandless added a comment - I think we'll need a NativeMMapDir as well as NativeDir (or NativeUnix/WindowsDir), because mmap can also take flags giving hints about access patterns. I'll open a new issue... I opened LUCENE-3178 .
          Hide
          Varun Thacker added a comment -

          I just put up the IOContext class. If this is looking good then I'll make the necessary changes to the other classes.

          Show
          Varun Thacker added a comment - I just put up the IOContext class. If this is looking good then I'll make the necessary changes to the other classes.
          Hide
          Simon Willnauer added a comment -

          Hey varun,

          here are some more comments for the latest complete patch:

          • We should have a static instance for IOContext with Context.Other which you can use in BitVector / CheckIndex for instance Maybe IOContext#DEFAULT_CONTEXT
          • It seems that we don't need to provide IOContext to FieldInfos and SegmentInfo since we are reading them into memory anyway. I think you can just use a default context here without changing the constructors. Same is true for SegmentInfos
          • This is unrelated to your patch but in PreFlexFields we should use IndexFileNames.segmentFileName(info.name, "", PreFlexCodec.FREQ_EXTENSION) and IndexFileNames.segmentFileName(info.name, "", PreFlexCodec.PROX_EXTENSION) instead of info.name + ".frq" and info.name + ".prx"
          • it seems that we should communicate the IOContext to the codec somehow. I suggest we put IOContext to SegmentWriteState and SegmentReadState that way we don't need to change the Codec interface and clutter it with internals. This would also fix mikes comment for FieldsConsumer etc.
          • TermVectorsWriter is only used in Merges so maybe it should also get a Context.Merge for consistency?
          • I really don't like OneMerge I think we should add an abstract class (maybe MergeInfo) that exposes the estimatedMergeBytes, totalDocCount for now.
          • small typo in RamDirectory, there is a space missing after the second file here: dir.copy(this, file, file,context);
          • SegmentReader should also use the static Default IOContext - make sure its used where needed

          Regarding the IOContext class I think we should design for what we have right now and since SegementInfo is not used anywhere (as far as I can see) we should add it once we need it. OneMerge should not go in there but rather the interface / abstract class I talked about above.

          Show
          Simon Willnauer added a comment - Hey varun, here are some more comments for the latest complete patch: We should have a static instance for IOContext with Context.Other which you can use in BitVector / CheckIndex for instance Maybe IOContext#DEFAULT_CONTEXT It seems that we don't need to provide IOContext to FieldInfos and SegmentInfo since we are reading them into memory anyway. I think you can just use a default context here without changing the constructors. Same is true for SegmentInfos This is unrelated to your patch but in PreFlexFields we should use IndexFileNames.segmentFileName(info.name, "", PreFlexCodec.FREQ_EXTENSION) and IndexFileNames.segmentFileName(info.name, "", PreFlexCodec.PROX_EXTENSION) instead of info.name + ".frq" and info.name + ".prx" it seems that we should communicate the IOContext to the codec somehow. I suggest we put IOContext to SegmentWriteState and SegmentReadState that way we don't need to change the Codec interface and clutter it with internals. This would also fix mikes comment for FieldsConsumer etc. TermVectorsWriter is only used in Merges so maybe it should also get a Context.Merge for consistency? I really don't like OneMerge I think we should add an abstract class (maybe MergeInfo) that exposes the estimatedMergeBytes, totalDocCount for now. small typo in RamDirectory, there is a space missing after the second file here: dir.copy(this, file, file,context); SegmentReader should also use the static Default IOContext - make sure its used where needed Regarding the IOContext class I think we should design for what we have right now and since SegementInfo is not used anywhere (as far as I can see) we should add it once we need it. OneMerge should not go in there but rather the interface / abstract class I talked about above.
          Hide
          Michael McCandless added a comment -

          It seems that we don't need to provide IOContext to FieldInfos and SegmentInfo since we are reading them into memory anyway. I think you can just use a default context here without changing the constructors. Same is true for SegmentInfo

          I think we should pass down "readOnce=true" for these cases? EG some
          kind of caching dir (or something) would know not to bother caching
          such files...

          Same for del docs, terms index, doc values (well, sometimes), etc.

          it seems that we should communicate the IOContext to the codec somehow. I suggest we put IOContext to SegmentWriteState and SegmentReadState that way we don't need to change the Codec interface and clutter it with internals. This would also fix mikes comment for FieldsConsumer etc.

          +1 that's great.

          I really don't like OneMerge I think we should add an abstract class (maybe MergeInfo) that exposes the estimatedMergeBytes, totalDocCount for now.

          If we can't include OneMerge, and I agree it'd be nice not to, I think
          we should try hard to pull stuff out of OneMerge that may be of
          interest to a Dir impl? Maybe:

          • estimatedTotalSegmentSizeBytes
          • docCount
          • optimize/expungeDeletes
          • isExternal (so Dir can know if this is addIndexes vs "normal" merging)

          Regarding the IOContext class I think we should design for what we have right now and since SegementInfo is not used anywhere (as far as I can see) we should add it once we need it. OneMerge should not go in there but rather the interface / abstract class I talked about above.

          I agree, let's wait until we have a need.

          In fact... SegmentInfo for flush won't work: we go and open all files
          for flushing, write to them, close them, and only then do we make the
          SegmentInfo.

          So it seems like we should also have some abtracted stuff about the
          to-be-flushed segment? Maybe for starters the
          estimatedSegmentSizeBytes? EG, NRTCachingDir could use this to decide
          whether to cache the new segment (today it fragile-ly relies on the
          app to open new NRT reader frequently enough).

          Show
          Michael McCandless added a comment - It seems that we don't need to provide IOContext to FieldInfos and SegmentInfo since we are reading them into memory anyway. I think you can just use a default context here without changing the constructors. Same is true for SegmentInfo I think we should pass down "readOnce=true" for these cases? EG some kind of caching dir (or something) would know not to bother caching such files... Same for del docs, terms index, doc values (well, sometimes), etc. it seems that we should communicate the IOContext to the codec somehow. I suggest we put IOContext to SegmentWriteState and SegmentReadState that way we don't need to change the Codec interface and clutter it with internals. This would also fix mikes comment for FieldsConsumer etc. +1 that's great. I really don't like OneMerge I think we should add an abstract class (maybe MergeInfo) that exposes the estimatedMergeBytes, totalDocCount for now. If we can't include OneMerge, and I agree it'd be nice not to, I think we should try hard to pull stuff out of OneMerge that may be of interest to a Dir impl? Maybe: estimatedTotalSegmentSizeBytes docCount optimize/expungeDeletes isExternal (so Dir can know if this is addIndexes vs "normal" merging) Regarding the IOContext class I think we should design for what we have right now and since SegementInfo is not used anywhere (as far as I can see) we should add it once we need it. OneMerge should not go in there but rather the interface / abstract class I talked about above. I agree, let's wait until we have a need. In fact... SegmentInfo for flush won't work: we go and open all files for flushing, write to them, close them, and only then do we make the SegmentInfo. So it seems like we should also have some abtracted stuff about the to-be-flushed segment? Maybe for starters the estimatedSegmentSizeBytes? EG, NRTCachingDir could use this to decide whether to cache the new segment (today it fragile-ly relies on the app to open new NRT reader frequently enough).
          Hide
          Varun Thacker added a comment -

          I think I have successfully threaded IOContext to the codecs and index package wherever required. There might be instances where I have used Context.Default wrongly.

          I'll begin adding documentation.

          In NRTCachingDir.doCacheWrite method where IOContext is used if the context has it's OnceMergeInfo field null might lead to a bug ? Should cases like those added to the docs ?

          Show
          Varun Thacker added a comment - I think I have successfully threaded IOContext to the codecs and index package wherever required. There might be instances where I have used Context.Default wrongly. I'll begin adding documentation. In NRTCachingDir.doCacheWrite method where IOContext is used if the context has it's OnceMergeInfo field null might lead to a bug ? Should cases like those added to the docs ?
          Hide
          Varun Thacker added a comment -

          I had messed up the patch using eclipse. This should be ok.

          Show
          Varun Thacker added a comment - I had messed up the patch using eclipse. This should be ok.
          Hide
          Simon Willnauer added a comment -

          Varun, your patch doesn't apply cleanly to my latest trunk. i think you should update your local copy again!

          I have trouble to understand how you use MergeInfo etc. I figure there might be a misunderstanding so here is what I had in mind roughly:

          
          Index: lucene/src/java/org/apache/lucene/index/MergePolicy.java
          ===================================================================
          --- lucene/src/java/org/apache/lucene/index/MergePolicy.java	(revision 1134335)
          +++ lucene/src/java/org/apache/lucene/index/MergePolicy.java	(working copy)
          @@ -64,7 +64,7 @@
              *  subset of segments to be merged as well as whether the
              *  new segment should use the compound file format. */
           
          -  public static class OneMerge {
          +  public static class OneMerge extends MergeInfo {
           
               SegmentInfo info;               // used by IndexWriter
               boolean optimize;               // used by IndexWriter
          @@ -72,25 +72,26 @@
               long mergeGen;                  // used by IndexWriter
               boolean isExternal;             // used by IndexWriter
               int maxNumSegmentsOptimize;     // used by IndexWriter
          -    public long estimatedMergeBytes;       // used by IndexWriter
               List<SegmentReader> readers;        // used by IndexWriter
               List<SegmentReader> readerClones;   // used by IndexWriter
          -    public final List<SegmentInfo> segments;
          -    public final int totalDocCount;
          +    public final List<SegmentInfo> segments = new ArrayList<SegmentInfo>();
               boolean aborted;
               Throwable error;
               boolean paused;
           
               public OneMerge(List<SegmentInfo> segments) {
          +      super(getSegments(segments));
          +    }
          +
          +    private static int getSegments(List<SegmentInfo> segments) {
                 if (0 == segments.size())
                   throw new RuntimeException("segments must include at least one segment");
                 // clone the list, as the in list may be based off original SegmentInfos and may be modified
          -      this.segments = new ArrayList<SegmentInfo>(segments);
                 int count = 0;
                 for(SegmentInfo info : segments) {
                   count += info.docCount;
                 }
          -      totalDocCount = count;
          +      return count;
               }
           
               /** Record that an exception occurred while executing
          
          Show
          Simon Willnauer added a comment - Varun, your patch doesn't apply cleanly to my latest trunk. i think you should update your local copy again! I have trouble to understand how you use MergeInfo etc. I figure there might be a misunderstanding so here is what I had in mind roughly: Index: lucene/src/java/org/apache/lucene/index/MergePolicy.java =================================================================== --- lucene/src/java/org/apache/lucene/index/MergePolicy.java (revision 1134335) +++ lucene/src/java/org/apache/lucene/index/MergePolicy.java (working copy) @@ -64,7 +64,7 @@ * subset of segments to be merged as well as whether the * new segment should use the compound file format. */ - public static class OneMerge { + public static class OneMerge extends MergeInfo { SegmentInfo info; // used by IndexWriter boolean optimize; // used by IndexWriter @@ -72,25 +72,26 @@ long mergeGen; // used by IndexWriter boolean isExternal; // used by IndexWriter int maxNumSegmentsOptimize; // used by IndexWriter - public long estimatedMergeBytes; // used by IndexWriter List<SegmentReader> readers; // used by IndexWriter List<SegmentReader> readerClones; // used by IndexWriter - public final List<SegmentInfo> segments; - public final int totalDocCount; + public final List<SegmentInfo> segments = new ArrayList<SegmentInfo>(); boolean aborted; Throwable error; boolean paused; public OneMerge(List<SegmentInfo> segments) { + super (getSegments(segments)); + } + + private static int getSegments(List<SegmentInfo> segments) { if (0 == segments.size()) throw new RuntimeException( "segments must include at least one segment" ); // clone the list, as the in list may be based off original SegmentInfos and may be modified - this .segments = new ArrayList<SegmentInfo>(segments); int count = 0; for (SegmentInfo info : segments) { count += info.docCount; } - totalDocCount = count; + return count; } /** Record that an exception occurred while executing
          Hide
          Varun Thacker added a comment -

          I hope this time I haven't messed up the making the patch. I corrected MergeInfo too.

          Show
          Varun Thacker added a comment - I hope this time I haven't messed up the making the patch. I corrected MergeInfo too.
          Hide
          Simon Willnauer added a comment -

          hi varun,

          looks good, some quick comments,

          • Can OneMerge extend MergeInfo?
          • Can we introduce a static IOContext instance like public static final IOContext DEFAULT = new IOContext() and use this everywhere where we need to use the default. All those object creations are unnecessary.
          • Can IOContext() call this(false) instead of manually setting the values
          • I think IOContext should take MergeInfo instead of OneMerge if you extend MergeInfo in OneMerge you can just pass OneMerge and you are done.

          I will have a closer look tomorrow

          simon

          Show
          Simon Willnauer added a comment - hi varun, looks good, some quick comments, Can OneMerge extend MergeInfo? Can we introduce a static IOContext instance like public static final IOContext DEFAULT = new IOContext() and use this everywhere where we need to use the default. All those object creations are unnecessary. Can IOContext() call this(false) instead of manually setting the values I think IOContext should take MergeInfo instead of OneMerge if you extend MergeInfo in OneMerge you can just pass OneMerge and you are done. I will have a closer look tomorrow simon
          Hide
          Michael McCandless added a comment -

          Patch is looking good!

          • I thinkt the Context enum values should be ALL_CAPS
          • FieldsWriter's ctor should take an IOContext? Generally low level
            places shouldn't create a new IOContext; they should be passed
            one (though there are exceptions... eg I think it's fine that
            SegmentInfos.run uses DEFAULT).
          • Same for values/IntsImpl, values/Bytes, values/Floats,
            SegmentReader.get, TermVectorsTermsReader, TermVectorsWriter,
            DefaultSegmentInfosWriter, DefaultSegmentInfosReader,
            FixedGapTermsIndexReader, VariableGapTermsIndexReader,
            NormsWriter, BitVector (read & write)
          • I think MP.OneMerge should not extend the new MergeInfo; I'm
            worried about cases where classes hang onto the IOContext (eg,
            various places save the IOContxt away as a member) because this
            could then risk holding refs to the SegmentReaders created for
            merging. I think it's less risky to fully decouple the MergeInfo
            from OneMerge. Then, maybe MergeInfo should be a static inner
            class inside IOContext?
          • IndexWriter.ReaderPool.get should pass the IOCtx down to
            SegmentReader.get
          • IndexWriter.prepareFlushedSegment should be FLUSH not DEFAULT
            context.
          • Can you add assert inside IOContext: if the ctx is MERGE then the
            MergeInfo must not be null (ie in the ctor that takes only a
            Context).
          • MergeInfo needs a few more fields (eg optimize, isExternal)
          • When IndexWriter.addIndexes makes the MERGE context it should pass
            a MergeInfo with isExternal true
          • In IndexWriter.mergeMiddle, you should make a single IOCtx(MERGE)
            up front and pass to all readerPool.get calls
          • At the end of IndexWriter.mergeMiddle, when we open the
            mergedReader... we should not pass MERGE context here, somehow,
            because this open is very different than when we open the
            to-be-merged segments. IE, we are opening the merged segment.
            Hmm, maybe, we can add a new flag to the MergeInfo,
            "isMergedSegment"? Alternatively... we use Context.READ, but then
            I think we need something else that states what "READ" this really
            is – eg NRT reader, merged segment reader, "normal" (IR.open)
            reader?
          • TermVectorsTermsWriter should be a Context.FLUSH
          • SegmentWriteState.context can be final
          • Somehow, PerFieldCodecWrapper.java shows as deleted...
          • MergeInfo.java needs copygirht header & javadoc.
          • Need whitespace around '=', eg "this.context=context;" should be
            "this.context = context;"
          Show
          Michael McCandless added a comment - Patch is looking good! I thinkt the Context enum values should be ALL_CAPS FieldsWriter's ctor should take an IOContext? Generally low level places shouldn't create a new IOContext; they should be passed one (though there are exceptions... eg I think it's fine that SegmentInfos.run uses DEFAULT). Same for values/IntsImpl, values/Bytes, values/Floats, SegmentReader.get, TermVectorsTermsReader, TermVectorsWriter, DefaultSegmentInfosWriter, DefaultSegmentInfosReader, FixedGapTermsIndexReader, VariableGapTermsIndexReader, NormsWriter, BitVector (read & write) I think MP.OneMerge should not extend the new MergeInfo; I'm worried about cases where classes hang onto the IOContext (eg, various places save the IOContxt away as a member) because this could then risk holding refs to the SegmentReaders created for merging. I think it's less risky to fully decouple the MergeInfo from OneMerge. Then, maybe MergeInfo should be a static inner class inside IOContext? IndexWriter.ReaderPool.get should pass the IOCtx down to SegmentReader.get IndexWriter.prepareFlushedSegment should be FLUSH not DEFAULT context. Can you add assert inside IOContext: if the ctx is MERGE then the MergeInfo must not be null (ie in the ctor that takes only a Context). MergeInfo needs a few more fields (eg optimize, isExternal) When IndexWriter.addIndexes makes the MERGE context it should pass a MergeInfo with isExternal true In IndexWriter.mergeMiddle, you should make a single IOCtx(MERGE) up front and pass to all readerPool.get calls At the end of IndexWriter.mergeMiddle, when we open the mergedReader... we should not pass MERGE context here, somehow, because this open is very different than when we open the to-be-merged segments. IE, we are opening the merged segment. Hmm, maybe, we can add a new flag to the MergeInfo, "isMergedSegment"? Alternatively... we use Context.READ, but then I think we need something else that states what "READ" this really is – eg NRT reader, merged segment reader, "normal" (IR.open) reader? TermVectorsTermsWriter should be a Context.FLUSH SegmentWriteState.context can be final Somehow, PerFieldCodecWrapper.java shows as deleted... MergeInfo.java needs copygirht header & javadoc. Need whitespace around '=', eg "this.context=context;" should be "this.context = context;"
          Hide
          Michael McCandless added a comment -

          Can OneMerge extend MergeInfo?

          I think this is risky because it then makes IOContext an unexpectedly
          heavy object, since OneMerge holds SegmentReaders used for merging.

          There are already some places that save away an IOContext instance as
          a member (to be used later), and if something ever does this for a MERGE
          then we can unexpectedly hold keep a lot of garbage alive.

          I think, instead, when the context is MERGE, we should make a new
          (lightweight) MergeInfo instance that pulls the necessary details from
          the OneMerge, ie fully decouples from the OneMerge instance.

          Show
          Michael McCandless added a comment - Can OneMerge extend MergeInfo? I think this is risky because it then makes IOContext an unexpectedly heavy object, since OneMerge holds SegmentReaders used for merging. There are already some places that save away an IOContext instance as a member (to be used later), and if something ever does this for a MERGE then we can unexpectedly hold keep a lot of garbage alive. I think, instead, when the context is MERGE, we should make a new (lightweight) MergeInfo instance that pulls the necessary details from the OneMerge, ie fully decouples from the OneMerge instance.
          Hide
          Michael McCandless added a comment -

          LUCENE-3203 is another example where a Dir needs the IOContext so it can optionally rate limit the bytes/second if it's a merge.

          Show
          Michael McCandless added a comment - LUCENE-3203 is another example where a Dir needs the IOContext so it can optionally rate limit the bytes/second if it's a merge.
          Hide
          Varun Thacker added a comment -

          I made the changes. I also fixed test-framework but haven't touched the test cases yet.

          Show
          Varun Thacker added a comment - I made the changes. I also fixed test-framework but haven't touched the test cases yet.
          Hide
          Simon Willnauer added a comment -

          hey varun

          patch looks close!

          here are some comments:

          • the assert context == Context.MERGE should be assert context != Context.MERGE || mergeInfo != null;
          • can you move that assert into IOContext(Context, MergeInfo) and let other related constructors call this(context, mergeInfo) instead of initializing all members themself?
          • I think there should be a public static final IOContext READONCE = new IOContext(true); then you can make the corresponding constructor private. I think the context should be Context.READ instead of default in that case right?
          • IOContext(MergePolicy.OneMerge) seems to be unnecessary. I think you should add a method to OneMerge to get a MergeInfo from it and only have a MergeInfo ctor. Then you can move MergeInfo into OneMerge too.
          • PerFieldCodecWrapper still seems to be deleted
          • In IndexReader IOContext context=null; should be IOContext context= new IOContext(READ); no?
          • no commit should be nocommit - we have a script on jenkins that checks this
          • I still see some whitespace problems in SegmentWriteState.java
          • I think IOContext.DEFAULT_IOCONTEXT should be IOContext.DEFAULT since IOContext is implicit

          I am waiting for you fixing the tests before I review further. Yet, what is missing is still the decision what buffer size to used down in direcotries etc.

          good work so far!

          Show
          Simon Willnauer added a comment - hey varun patch looks close! here are some comments: the assert context == Context.MERGE should be assert context != Context.MERGE || mergeInfo != null; can you move that assert into IOContext(Context, MergeInfo) and let other related constructors call this(context, mergeInfo) instead of initializing all members themself? I think there should be a public static final IOContext READONCE = new IOContext(true); then you can make the corresponding constructor private. I think the context should be Context.READ instead of default in that case right? IOContext(MergePolicy.OneMerge) seems to be unnecessary. I think you should add a method to OneMerge to get a MergeInfo from it and only have a MergeInfo ctor. Then you can move MergeInfo into OneMerge too. PerFieldCodecWrapper still seems to be deleted In IndexReader IOContext context=null; should be IOContext context= new IOContext(READ); no? no commit should be nocommit - we have a script on jenkins that checks this I still see some whitespace problems in SegmentWriteState.java I think IOContext.DEFAULT_IOCONTEXT should be IOContext.DEFAULT since IOContext is implicit I am waiting for you fixing the tests before I review further. Yet, what is missing is still the decision what buffer size to used down in direcotries etc. good work so far!
          Hide
          Varun Thacker added a comment -

          Sorry for messing up the patch again!

          Show
          Varun Thacker added a comment - Sorry for messing up the patch again!
          Hide
          Varun Thacker added a comment -

          I have made changes suggested by Simon and have added Context to the test cases, though I've used DEFAULT in most of it.

          Also do we need the test- TestBufferedIndexInput ? I have added a IOContext.DEFAULT and fixed it though.

          Show
          Varun Thacker added a comment - I have made changes suggested by Simon and have added Context to the test cases, though I've used DEFAULT in most of it. Also do we need the test- TestBufferedIndexInput ? I have added a IOContext.DEFAULT and fixed it though.
          Hide
          Varun Thacker added a comment -

          I made some more changes to the earlier patch. I tried putting a nocommit wherever I thought the code was leading to a assertError or a bufferSize as 0 error.

          Show
          Varun Thacker added a comment - I made some more changes to the earlier patch. I tried putting a nocommit wherever I thought the code was leading to a assertError or a bufferSize as 0 error.
          Hide
          Simon Willnauer added a comment -

          hey varun thanks for the new patch, some comments:

          • the nocommit in IW you should maybe add a second ctor to MergeInfo that takes arguments and then use something like this in IW: final IOContext context = new IOContext(new MergeInfo(info.docCount, info.sizeInBytes(true), true, false));
          • It seems kind of odd to always prefix MergeInfo with OneMerge so maybe move it into its own file
          • regarding the read buffer problem, can you simply use BufferedIndexInput.BUFFER_SIZE to initialize it and put a TODO / nocommit on top
          • You should look into contrib/misc there are some compile errors in AppendingCodec as well as in solr land.
          • I think in Directory the openInput method should be abstract: public abstract IndexInput openInput(String name, IOContext context) throws IOException; and FSDirectory should not specify this method at all. Currently your code would produce a stack overflow since it calls itself.

          If I fix the nocommits in your patch test-core passes without problems. Looking good man we are getting closer!

          Simon

          Show
          Simon Willnauer added a comment - hey varun thanks for the new patch, some comments: the nocommit in IW you should maybe add a second ctor to MergeInfo that takes arguments and then use something like this in IW: final IOContext context = new IOContext(new MergeInfo(info.docCount, info.sizeInBytes(true), true, false)); It seems kind of odd to always prefix MergeInfo with OneMerge so maybe move it into its own file regarding the read buffer problem, can you simply use BufferedIndexInput.BUFFER_SIZE to initialize it and put a TODO / nocommit on top You should look into contrib/misc there are some compile errors in AppendingCodec as well as in solr land. I think in Directory the openInput method should be abstract: public abstract IndexInput openInput(String name, IOContext context) throws IOException; and FSDirectory should not specify this method at all. Currently your code would produce a stack overflow since it calls itself. If I fix the nocommits in your patch test-core passes without problems. Looking good man we are getting closer! Simon
          Hide
          Varun Thacker added a comment -

          Made the changes and I get a Build Successful too !

          Show
          Varun Thacker added a comment - Made the changes and I get a Build Successful too !
          Hide
          Simon Willnauer added a comment -

          I created a branch for this issue and follow up issues here: https://svn.apache.org/repos/asf/lucene/dev/branches/LUCENE2793/

          Show
          Simon Willnauer added a comment - I created a branch for this issue and follow up issues here: https://svn.apache.org/repos/asf/lucene/dev/branches/LUCENE2793/
          Hide
          Simon Willnauer added a comment -

          For the record - I went through the latest patch and added some nocommits where needed. I will take this patch and commit it to the branch. We should now work on that branch to fix all the remaining issues.

          Show
          Simon Willnauer added a comment - For the record - I went through the latest patch and added some nocommits where needed. I will take this patch and commit it to the branch. We should now work on that branch to fix all the remaining issues.
          Hide
          Varun Thacker added a comment -

          I am not sure whether the MergeInfo used in SegmentMerger#mergeFields

          I have kept most of the nocommits there even after correcting it for reference.

          In MockDirectoryWrapper#crash() to randomize IOContext I have used either a READONCE or DEFAULT or Merge context. Is this the correct way to go?

          In LuceneTeseCase#newDirectory(), MockDirectoryWrapper#createOutput(), MockDirectoryWrapper#openInput() will randomizing the context here help?

          Show
          Varun Thacker added a comment - I am not sure whether the MergeInfo used in SegmentMerger#mergeFields I have kept most of the nocommits there even after correcting it for reference. In MockDirectoryWrapper#crash() to randomize IOContext I have used either a READONCE or DEFAULT or Merge context. Is this the correct way to go? In LuceneTeseCase#newDirectory(), MockDirectoryWrapper#createOutput(), MockDirectoryWrapper#openInput() will randomizing the context here help?
          Hide
          Michael McCandless added a comment -

          Patch, fixing NRTCachingDir to no longer have anything to do with the merge scheduler (yay!).

          Show
          Michael McCandless added a comment - Patch, fixing NRTCachingDir to no longer have anything to do with the merge scheduler (yay!).
          Hide
          Michael McCandless added a comment -

          I took quick look @ the branch – it's looking good! Some small stuff:

          • Should IOContext and MergeInfo be in oal.store not .index?
          • I think SegmentMerger should receive an IOCtx from its caller, and
            then apss that to all the IO ops it invokes? But the code has a
            nocommit about tripping an assert – which one?
          • I think on flush IOContext should include num docs and estimated
            segment size (we can roughly pull this from RAM used for the
            segment), but we should include comment that this is only approx.
          • Somehow, lucene/contrib/demo/data is deleted on the branch. We
            should check if anything else is missing!
          Show
          Michael McCandless added a comment - I took quick look @ the branch – it's looking good! Some small stuff: Should IOContext and MergeInfo be in oal.store not .index? I think SegmentMerger should receive an IOCtx from its caller, and then apss that to all the IO ops it invokes? But the code has a nocommit about tripping an assert – which one? I think on flush IOContext should include num docs and estimated segment size (we can roughly pull this from RAM used for the segment), but we should include comment that this is only approx. Somehow, lucene/contrib/demo/data is deleted on the branch. We should check if anything else is missing!
          Hide
          Simon Willnauer added a comment -

          Should IOContext and MergeInfo be in oal.store not .index?

          +1

          I think SegmentMerger should receive an IOCtx from its caller, and

          yeah I think we should pass the IOContext in via the ctor. Yet, for IW#addIndexes you can simply build a best effort IOContext like:

          
          

          for (IndexReader indexReader : readers)

          { numDocs += indexReader.numDocs(); }

          final IOContext context = new IOContext(new MergeInfo(numDocs, -1, true, false));
          }

          I think on flush IOContext should include num docs and estimated

          +1 I think that is good no?

          Somehow, lucene/contrib/demo/data is deleted on the branch. We should check if anything else is missing!

          oh man... I will check

          you use new IOContext(Context.FLUSH) and new IOContext(Context.READ) in your patch but we have some static like IOContext.READ maybe we need FLUSH too?

          for the tests I think we should start randomizing the IOContext. I think you should add a newIOContext(Random random) to LuceneTestCase and get the context from there in a unit test. At the end of the day we should see same behavior whatever context you pass in right?

          simon

          Show
          Simon Willnauer added a comment - Should IOContext and MergeInfo be in oal.store not .index? +1 I think SegmentMerger should receive an IOCtx from its caller, and yeah I think we should pass the IOContext in via the ctor. Yet, for IW#addIndexes you can simply build a best effort IOContext like: for (IndexReader indexReader : readers) { numDocs += indexReader.numDocs(); } final IOContext context = new IOContext(new MergeInfo(numDocs, -1, true, false)); } I think on flush IOContext should include num docs and estimated +1 I think that is good no? Somehow, lucene/contrib/demo/data is deleted on the branch. We should check if anything else is missing! oh man... I will check you use new IOContext(Context.FLUSH) and new IOContext(Context.READ) in your patch but we have some static like IOContext.READ maybe we need FLUSH too? for the tests I think we should start randomizing the IOContext. I think you should add a newIOContext(Random random) to LuceneTestCase and get the context from there in a unit test. At the end of the day we should see same behavior whatever context you pass in right? simon
          Hide
          Varun Thacker added a comment -

          I have made the necessary changes. Still I might have missed out changing couple of Test Cases to random IOContext.

          I wanted to put it our so that you'll can have a look as soon as possible.

          Show
          Varun Thacker added a comment - I have made the necessary changes. Still I might have missed out changing couple of Test Cases to random IOContext. I wanted to put it our so that you'll can have a look as soon as possible.
          Hide
          Simon Willnauer added a comment -

          Varun, patch looks great!

          here are some comments:

          • in LuceneTestCase#newIOContext(Random) your switch statement misses necessary break; statements no matter what you will always get a flush
            context. I think you should also randominze the numbers for numDocs and sizeInBytes. nobody should rely on these number they are just best effort.
          • in o.a.l.i.values.Bytes#getValues, o.a.l.i.StoredFieldsWriter#initFieldsWriter, o.a.l.i.SegmentMerger#mergeVectors, MemoryCodec#fieldsProducer, MockSingleIntIndexOutput#MockSingleIntIndexOutput you can remove the nocommit
          • the public members in FlushInfo should be final
          • the MergeInfo ctors javadoc should either describe the arguments or omit them. In this case I think you can simply omit them since they are pretty self explained. I think what we need on MergeInfo as well as on FlushInfo is a javadoc comment that says that the values are estimates not necessarily real / correct values.
          • on private IOContext (Context context, MergeInfo mergeInfo ) I don't understand the assert why is there a context != Flush?
          • in MockDirectoryWrapper there is a nocommit that says randomize the IOContext. Maybe we should put the newIOContext to _TestUtils and delegate to this from LuceneTestCase?

          Overall I think we are really close here. Once those comments are fixed I will commit the patch to the branch and we go through the remaining nocommits. Once this is done we should close this and create a new issue to fix all the buffer sizes just like LUCENE-3248.

          good job varun

          Show
          Simon Willnauer added a comment - Varun, patch looks great! here are some comments: in LuceneTestCase#newIOContext(Random) your switch statement misses necessary break; statements no matter what you will always get a flush context. I think you should also randominze the numbers for numDocs and sizeInBytes. nobody should rely on these number they are just best effort. in o.a.l.i.values.Bytes#getValues, o.a.l.i.StoredFieldsWriter#initFieldsWriter, o.a.l.i.SegmentMerger#mergeVectors, MemoryCodec#fieldsProducer, MockSingleIntIndexOutput#MockSingleIntIndexOutput you can remove the nocommit the public members in FlushInfo should be final the MergeInfo ctors javadoc should either describe the arguments or omit them. In this case I think you can simply omit them since they are pretty self explained. I think what we need on MergeInfo as well as on FlushInfo is a javadoc comment that says that the values are estimates not necessarily real / correct values. on private IOContext (Context context, MergeInfo mergeInfo ) I don't understand the assert why is there a context != Flush? in MockDirectoryWrapper there is a nocommit that says randomize the IOContext. Maybe we should put the newIOContext to _TestUtils and delegate to this from LuceneTestCase? Overall I think we are really close here. Once those comments are fixed I will commit the patch to the branch and we go through the remaining nocommits. Once this is done we should close this and create a new issue to fix all the buffer sizes just like LUCENE-3248 . good job varun
          Hide
          Varun Thacker added a comment -

          I also removed nocommits from
          o.a.l/index/codecs/DefaultDocValuesProducer.java
          o.a.l/index/values/VarStraightBytesImpl.java
          o.a.l/index/values/FixedStraightBytesImpl.java
          o.a.l/index/codecs/appending/AppendingCodec.java

          Show
          Varun Thacker added a comment - I also removed nocommits from o.a.l/index/codecs/DefaultDocValuesProducer.java o.a.l/index/values/VarStraightBytesImpl.java o.a.l/index/values/FixedStraightBytesImpl.java o.a.l/index/codecs/appending/AppendingCodec.java
          Hide
          Simon Willnauer added a comment -

          Varun this patch looks great. I am about to commit it. Can you now work through the nocommits, fix them or post questions here?

          simon

          Show
          Simon Willnauer added a comment - Varun this patch looks great. I am about to commit it. Can you now work through the nocommits, fix them or post questions here? simon
          Hide
          Simon Willnauer added a comment -

          Varun,

          the latest patch is committed. I added some minor cleanups and removed invalid nocommits. We are in good shape already

          Show
          Simon Willnauer added a comment - Varun, the latest patch is committed. I added some minor cleanups and removed invalid nocommits. We are in good shape already
          Hide
          Michael McCandless added a comment -

          To address the nocommits about losing the larger buffer size during merging, should we add set/getMergeBufferSize and set/getDefaultBufferSize to those Dir impls that do buffering? (And default to what they are today on trunk, I think 1 KB and 4 KB?)

          Show
          Michael McCandless added a comment - To address the nocommits about losing the larger buffer size during merging, should we add set/getMergeBufferSize and set/getDefaultBufferSize to those Dir impls that do buffering? (And default to what they are today on trunk, I think 1 KB and 4 KB?)
          Hide
          Varun Thacker added a comment -

          I am not sure on the way I have tried to correct the nocommits on how to use buffer sizes based on IOContext. Let me know if this is not the correct way of doing it or are there any changes required.

          Show
          Varun Thacker added a comment - I am not sure on the way I have tried to correct the nocommits on how to use buffer sizes based on IOContext. Let me know if this is not the correct way of doing it or are there any changes required.
          Hide
          Michael McCandless added a comment -

          I think BufferedIndexInput doesn't need a set/getMergeBufferSize? Ie, BII only knows its bufferSize, regardless of the context from its parent.

          Otherwise I think your patch is good: today on trunk we hardwire the 4 KB buffer size for merges, which is the same thing your patch is doing; the only difference is the constant MERGE_BUFFER_SIZE has moved from IW to BII, and each Dir impl now has the "if". As a future improvement we can add a set/getMergeBufferSize to each Dir impl...

          Show
          Michael McCandless added a comment - I think BufferedIndexInput doesn't need a set/getMergeBufferSize? Ie, BII only knows its bufferSize, regardless of the context from its parent. Otherwise I think your patch is good: today on trunk we hardwire the 4 KB buffer size for merges, which is the same thing your patch is doing; the only difference is the constant MERGE_BUFFER_SIZE has moved from IW to BII, and each Dir impl now has the "if". As a future improvement we can add a set/getMergeBufferSize to each Dir impl...
          Hide
          Varun Thacker added a comment -

          Made the necessary changes and hopefully addressed all the nocommits.

          Show
          Varun Thacker added a comment - Made the necessary changes and hopefully addressed all the nocommits.
          Hide
          Varun Thacker added a comment -

          Made some more changes...

          Show
          Varun Thacker added a comment - Made some more changes...
          Hide
          Simon Willnauer added a comment -

          Made the necessary changes and hopefully addressed all the nocommits.

          varun, I still see lots of nocommits here. Would be good if you could address them this week. You don't need to solve them but discuss them here with us. you can do that in a patch and add your comments to the parts where you are not sure how to resolve.

          I would like to commit the patches this week so we can merge to trunk soonish.

          Simon

          Show
          Simon Willnauer added a comment - Made the necessary changes and hopefully addressed all the nocommits. varun, I still see lots of nocommits here. Would be good if you could address them this week. You don't need to solve them but discuss them here with us. you can do that in a patch and add your comments to the parts where you are not sure how to resolve. I would like to commit the patches this week so we can merge to trunk soonish. Simon
          Hide
          Michael McCandless added a comment -

          +1

          Show
          Michael McCandless added a comment - +1
          Hide
          Varun Thacker added a comment -

          I removed all the remaining nocommits as I think all of them had been addressed to.

          Show
          Varun Thacker added a comment - I removed all the remaining nocommits as I think all of them had been addressed to.
          Hide
          Simon Willnauer added a comment -

          I took varuns patch and cleaned a couple of things up. I think this is ready, if nobody objects I will go ahead and commit this to the branch, merge up with trunk and upload a new patch to integrate this into trunk.

          Once this is on trunk we can follow up with native stuff etc.

          Thoughts?

          Show
          Simon Willnauer added a comment - I took varuns patch and cleaned a couple of things up. I think this is ready, if nobody objects I will go ahead and commit this to the branch, merge up with trunk and upload a new patch to integrate this into trunk. Once this is on trunk we can follow up with native stuff etc. Thoughts?
          Hide
          Simon Willnauer added a comment -

          s/4069/4096

          Show
          Simon Willnauer added a comment - s/4069/4096
          Hide
          Simon Willnauer added a comment -

          I committed the latest patch, merged the branch with trunk and created a final diff for review. I think this is ready and I would like to reintegrate rather sooner than later.

          reviews welcome

          Show
          Simon Willnauer added a comment - I committed the latest patch, merged the branch with trunk and created a final diff for review. I think this is ready and I would like to reintegrate rather sooner than later. reviews welcome
          Hide
          Michael McCandless added a comment -

          Looks good! +1 to land it!

          Just a few things:

          • Shouldn't WindowsDirectory also call BII.bufferSize(context) and
            do the same Math.max it used to do?
          • Should VarGapTermsIndexReader should pass READONCE context down when it
            opens/reads the FST? Hmm, though, it should just replace the ctx
            passed in, ie if we are merging vs reading we want to
            differentiate. Let's open separate issue for this and address
            post merge?
          • Can you open an issue for this one: "// TODO: context should be
            part of the key used to cache that reader in the pool."? This is
            pretty important, else you can get NRT readers with too-large
            buffer sizes because the readers had been opened for merging
            first.
          • Extra space in SegmentInfo.java: IOContext.READONCE );
          Show
          Michael McCandless added a comment - Looks good! +1 to land it! Just a few things: Shouldn't WindowsDirectory also call BII.bufferSize(context) and do the same Math.max it used to do? Should VarGapTermsIndexReader should pass READONCE context down when it opens/reads the FST? Hmm, though, it should just replace the ctx passed in, ie if we are merging vs reading we want to differentiate. Let's open separate issue for this and address post merge? Can you open an issue for this one: "// TODO: context should be part of the key used to cache that reader in the pool."? This is pretty important, else you can get NRT readers with too-large buffer sizes because the readers had been opened for merging first. Extra space in SegmentInfo.java: IOContext.READONCE );
          Hide
          Simon Willnauer added a comment -

          I fixed the two minor things from above, created two followup issues (LUCENE-3292 & LUCENE-3293) for the remaining TODOs and will go ahead reintegrating the branch now.

          Show
          Simon Willnauer added a comment - I fixed the two minor things from above, created two followup issues ( LUCENE-3292 & LUCENE-3293 ) for the remaining TODOs and will go ahead reintegrating the branch now.
          Hide
          Simon Willnauer added a comment -

          I reintegrated the branch and committed to trunk in revision 1144196. I will now go ahead and delete the branch. all further developments should happen on trunk. @Varun make sure you move you current work in progress to trunk and be careful with svn update on the branch since some of your changes might get lost.

          Thanks Varun... good job!

          Show
          Simon Willnauer added a comment - I reintegrated the branch and committed to trunk in revision 1144196. I will now go ahead and delete the branch. all further developments should happen on trunk. @Varun make sure you move you current work in progress to trunk and be careful with svn update on the branch since some of your changes might get lost. Thanks Varun... good job!

            People

            • Assignee:
              Varun Thacker
              Reporter:
              Michael McCandless
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development