Lucene - Core
  1. Lucene - Core
  2. LUCENE-2126

Split up IndexInput and IndexOutput into DataInput and DataOutput

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      I'd like to introduce the two new classes DataInput and DataOutput
      that contain all methods from IndexInput and IndexOutput that actually
      decode or encode data, such as readByte()/writeByte(),
      readVInt()/writeVInt().

      Methods like getFilePointer(), seek(), close(), etc., which are not
      related to data encoding, but to files as input/output source stay in
      IndexInput/IndexOutput.

      This patch also changes ByteSliceReader/ByteSliceWriter to extend
      DataInput/DataOutput. Previously ByteSliceReader implemented the
      methods that stay in IndexInput by throwing RuntimeExceptions.

      See also LUCENE-2125.

      All tests pass.

      1. lucene-2126.patch
        31 kB
        Michael Busch
      2. lucene-2126.patch
        30 kB
        Michael Busch

        Activity

        Hide
        Michael Busch added a comment -

        Btw: Please let me know if you prefer other class names. The java.io package also has interfaces with the same names DataInput and DataOutput, maybe that's confusing?

        Show
        Michael Busch added a comment - Btw: Please let me know if you prefer other class names. The java.io package also has interfaces with the same names DataInput and DataOutput, maybe that's confusing?
        Hide
        Michael McCandless added a comment -

        This sounds great! If/as we move to the intblock codecs, I wonder if we could put bulk int[] block reading/writing (like pfordelta) into DataInput/Output.

        Show
        Michael McCandless added a comment - This sounds great! If/as we move to the intblock codecs, I wonder if we could put bulk int[] block reading/writing (like pfordelta) into DataInput/Output.
        Hide
        Michael Busch added a comment -

        +1!

        Show
        Michael Busch added a comment - +1!
        Hide
        Marvin Humphrey added a comment -

        FWIW, this approach is sort of the inverse of where we've gone with Lucy.

        In Lucy, low-level unbuffered IO operations are abstracted into FileHandle,
        which is either a thin wrapper around a POSIX file descriptor (e.g.
        FSFileHandle under unixen), or a simulation thereof (e.g. FSFileHandle under
        Windows, RAMFileHandle).

        Then there are InStream and OutStream, which would be analogous to DataInput
        and DataOutput, in that they have all the Lucy-specific encoding/decoding
        methods. However, instead of requiring that subclasses implement the
        low-level IO operations, InStream "has a" FileHandle and OutStream "has a"
        FileHandle.

        The advantage of breaking out FileHandle as a separate class is that if e.g.
        you extend InStream by adding on PFOR encoding, you automatically get the
        benefit for all IO implementations. I think that under the
        DataInput/DataOutput model, that extension technique will only be available to
        core devs of Lucene, no?

        More info:

        Show
        Marvin Humphrey added a comment - FWIW, this approach is sort of the inverse of where we've gone with Lucy. In Lucy, low-level unbuffered IO operations are abstracted into FileHandle, which is either a thin wrapper around a POSIX file descriptor (e.g. FSFileHandle under unixen), or a simulation thereof (e.g. FSFileHandle under Windows, RAMFileHandle). Then there are InStream and OutStream, which would be analogous to DataInput and DataOutput, in that they have all the Lucy-specific encoding/decoding methods. However, instead of requiring that subclasses implement the low-level IO operations, InStream "has a" FileHandle and OutStream "has a" FileHandle. The advantage of breaking out FileHandle as a separate class is that if e.g. you extend InStream by adding on PFOR encoding, you automatically get the benefit for all IO implementations. I think that under the DataInput/DataOutput model, that extension technique will only be available to core devs of Lucene, no? More info: LUCY-58 FileHandle LUCY-63 InStream and OutStream
        Hide
        Michael Busch added a comment -

        Thanks for the input, Marvin.

        I can see the advantages of what you're proposing. With this patch it'd only be possible to benefit in all IndexInput/IndexOutput implementations from a new encoding/decoding method if you add it to the DataInput/Output class directly, which is only possible by changing the classes in Lucene, not from outside.

        The problem here, as so often, is backwards-compat. This patch here has no problems in that regard, as we just move the methods into new superclasses. If we'd want to implement what Lucy is doing, we'd have to deprecate all encoding/decoding methods in IndexInput/Output and add them to DataInput/Output. Then a DataInput would not be the superclass of IndexInput, but rather have an IndexInput. All users who call any of the encoding/decoding methods directly on IndexInput/Output would have to change their code to use the new classes.

        So I can certainly see the benefits, the question now is if they're at the moment important enough to justify dealing with the backwards-compat hassle?

        Show
        Michael Busch added a comment - Thanks for the input, Marvin. I can see the advantages of what you're proposing. With this patch it'd only be possible to benefit in all IndexInput/IndexOutput implementations from a new encoding/decoding method if you add it to the DataInput/Output class directly, which is only possible by changing the classes in Lucene, not from outside. The problem here, as so often, is backwards-compat. This patch here has no problems in that regard, as we just move the methods into new superclasses. If we'd want to implement what Lucy is doing, we'd have to deprecate all encoding/decoding methods in IndexInput/Output and add them to DataInput/Output. Then a DataInput would not be the superclass of IndexInput, but rather have an IndexInput. All users who call any of the encoding/decoding methods directly on IndexInput/Output would have to change their code to use the new classes. So I can certainly see the benefits, the question now is if they're at the moment important enough to justify dealing with the backwards-compat hassle?
        Hide
        Marvin Humphrey added a comment -

        I spent a long time today trying to understand why DataInput and
        DataOutput are justified so that I could formulate an intelligent reply,
        but I had to give up. :\ Please carry on.

        Show
        Marvin Humphrey added a comment - I spent a long time today trying to understand why DataInput and DataOutput are justified so that I could formulate an intelligent reply, but I had to give up. :\ Please carry on.
        Hide
        Michael Busch added a comment -

        The main reason why I'd like to separate DataInput/Output from IndexInput/Output now is LUCENE-2125. Users should be able to implement methods that serialize/deserialize attributes into/from a postinglist. These methods should only be able to call the read/write methods (which this issue moves to DataInput/Output), but not methods like close(), seek() etc..

        Thanks for spending time reviewing this and giving feedback from Lucy land, Marvin!
        I think I will go ahead and commit this, and once we see a need to allow users to extend DataInput/Output outside of Lucene we can go ahead and make the additional changes that are mentioned in your in my comments here.

        So I will commit this tomorrow if nobody objects.

        Show
        Michael Busch added a comment - The main reason why I'd like to separate DataInput/Output from IndexInput/Output now is LUCENE-2125 . Users should be able to implement methods that serialize/deserialize attributes into/from a postinglist. These methods should only be able to call the read/write methods (which this issue moves to DataInput/Output), but not methods like close(), seek() etc.. Thanks for spending time reviewing this and giving feedback from Lucy land, Marvin! I think I will go ahead and commit this, and once we see a need to allow users to extend DataInput/Output outside of Lucene we can go ahead and make the additional changes that are mentioned in your in my comments here. So I will commit this tomorrow if nobody objects.
        Hide
        Marvin Humphrey added a comment -

        > These methods should only be able to call the read/write methods (which this
        > issue moves to DataInput/Output), but not methods like close(), seek() etc..

        Ah, so that's what it is.

        In that case, let me vote my (non-binding) -1. I don't believe that the
        enforcement of such a restriction justifies the complexity cost of adding a
        new class to the public API.

        First, adding yet another class to the hierarchy steepens the learning curve
        for users and contributors. If you aren't in the rarefied echelon of
        exceptional brilliance occupied by people named Michael who work for IBM ,
        the gradual accumulation of complexity in the Lucene code base matters. Inch
        by inch, things move out of reach.

        Second, changing things now for what seems to me like a minor reason makes it
        harder to refactor the class hierarchy in the future when other, more
        important reasons are inevitably discovered.

        For LUCENE-2125, I recommend two possible options.

        • Do nothing and assume that the sort of advanced user who writes a posting
          codec won't do something incredibly stupid like call indexInput.close().
        • Add a note to the docs for writing posting codecs indicating which sort of
          of IO methods you ought not to call.

        > once we see a need to allow users to extend DataInput/Output outside of
        > Lucene we can go ahead and make the additional changes that are mentioned in
        > your in my comments here.

        In Lucy, there are three tiers of IO usage:

        • For low-level IO, use FileHandle.
        • For most applications, use InStream's encoder/decoder methods.
        • For performance-critical inner-loop material (e.g. posting decoders,
          SortCollector), access the raw memory-mapped IO buffer using
          InStream_Buf()/InStream_Advance_Buf() and use static inline functions
          such as NumUtil_decode_c32 (which does no bounds checking) from
          Lucy::Util::NumberUtils.

        While you can extend InStream to add a codec, that's not generally the best
        way to go about it, because adding a method to InStream requires that all of
        your users both use your InStream class and use a subclassed Folder which
        overrides the Folder_Open_In() factory method (analogous to
        Directory.openInput()). Better is to use the extension point provided by
        InStream_Buf()/InStream_Advance_Buf() and write a utility function which
        accepts an InStream as an argument.

        I don't expect and am not advocating that Lucene adopt the same IO hierarchy
        as Lucy, but I wanted to provide an example of other reasons why you might
        change things. (What I'd really like to see is for Lucene to come up with
        something better than the Lucy IO hierarchy.)

        One of the reasons Lucene has so many backwards compatibility headaches is
        because the public APIs are so extensive and thus constitute such an elaborate
        set of backwards compatibility promises. IMO, DataInput and DataOutput do
        not offer sufficient benefit to compensate for the increased intricacy they add
        to that backwards compatibility contract.

        Show
        Marvin Humphrey added a comment - > These methods should only be able to call the read/write methods (which this > issue moves to DataInput/Output), but not methods like close(), seek() etc.. Ah, so that's what it is. In that case, let me vote my (non-binding) -1. I don't believe that the enforcement of such a restriction justifies the complexity cost of adding a new class to the public API. First, adding yet another class to the hierarchy steepens the learning curve for users and contributors. If you aren't in the rarefied echelon of exceptional brilliance occupied by people named Michael who work for IBM , the gradual accumulation of complexity in the Lucene code base matters. Inch by inch, things move out of reach. Second, changing things now for what seems to me like a minor reason makes it harder to refactor the class hierarchy in the future when other, more important reasons are inevitably discovered. For LUCENE-2125 , I recommend two possible options. Do nothing and assume that the sort of advanced user who writes a posting codec won't do something incredibly stupid like call indexInput.close(). Add a note to the docs for writing posting codecs indicating which sort of of IO methods you ought not to call. > once we see a need to allow users to extend DataInput/Output outside of > Lucene we can go ahead and make the additional changes that are mentioned in > your in my comments here. In Lucy, there are three tiers of IO usage: For low-level IO, use FileHandle. For most applications, use InStream's encoder/decoder methods. For performance-critical inner-loop material (e.g. posting decoders, SortCollector), access the raw memory-mapped IO buffer using InStream_Buf()/InStream_Advance_Buf() and use static inline functions such as NumUtil_decode_c32 (which does no bounds checking) from Lucy::Util::NumberUtils. While you can extend InStream to add a codec, that's not generally the best way to go about it, because adding a method to InStream requires that all of your users both use your InStream class and use a subclassed Folder which overrides the Folder_Open_In() factory method (analogous to Directory.openInput()). Better is to use the extension point provided by InStream_Buf()/InStream_Advance_Buf() and write a utility function which accepts an InStream as an argument. I don't expect and am not advocating that Lucene adopt the same IO hierarchy as Lucy, but I wanted to provide an example of other reasons why you might change things. (What I'd really like to see is for Lucene to come up with something better than the Lucy IO hierarchy.) One of the reasons Lucene has so many backwards compatibility headaches is because the public APIs are so extensive and thus constitute such an elaborate set of backwards compatibility promises. IMO, DataInput and DataOutput do not offer sufficient benefit to compensate for the increased intricacy they add to that backwards compatibility contract.
        Hide
        Michael Busch added a comment - - edited

        I disagree with you here: introducing DataInput/Output makes IMO the API actually easier for the "normal" user to understand.

        I would think that most users don't implement IndexInput/Output extensions, but simply use the out-of-the-box Directory implementations, which provide IndexInput/Output impls. Also, most users probably don't even call the IndexInput/Output APIs directly.

        Do nothing and assume that the sort of advanced user who writes a posting
        codec won't do something incredibly stupid like call indexInput.close().

        Writing a posting code is much more advanced compared to using 2125's features. Ideally, a user who simply wants to store some specific information in the posting list, such as a boost, a part-of-speech identifier, another VInt, etc. should with 2125 only have to implement a new attribute including the serialize()/deserialize() methods. People who want to do that don't need to know anything about Lucene's API layer. They only need to know the APIs that DataInput/Output provide and will not get confused with methods like seek() or close(). For the standard user who only wants to write such an attribute it should not matter how Lucene's IO structure looks like - so even if we make changes that go into Lucy's direction in the future (IndexInput/Output owning a filehandle vs. the need to extend them) the serialize()/deserialize() methods of attribute would still work with DataInput/Output.

        I bet that a lot of people who used the payload feature before took a ByteArrayOutputStream together with DataOutputStream (which implements Java's DataOutput) to populate the payload byte array. With 2125 Lucene will provide an API that is similar to use, but more efficient as it remove the byte[] array indirection and overhead.

        I'm still +1 for this change. Others?

        Show
        Michael Busch added a comment - - edited I disagree with you here: introducing DataInput/Output makes IMO the API actually easier for the "normal" user to understand. I would think that most users don't implement IndexInput/Output extensions, but simply use the out-of-the-box Directory implementations, which provide IndexInput/Output impls. Also, most users probably don't even call the IndexInput/Output APIs directly. Do nothing and assume that the sort of advanced user who writes a posting codec won't do something incredibly stupid like call indexInput.close(). Writing a posting code is much more advanced compared to using 2125's features. Ideally, a user who simply wants to store some specific information in the posting list, such as a boost, a part-of-speech identifier, another VInt, etc. should with 2125 only have to implement a new attribute including the serialize()/deserialize() methods. People who want to do that don't need to know anything about Lucene's API layer. They only need to know the APIs that DataInput/Output provide and will not get confused with methods like seek() or close(). For the standard user who only wants to write such an attribute it should not matter how Lucene's IO structure looks like - so even if we make changes that go into Lucy's direction in the future (IndexInput/Output owning a filehandle vs. the need to extend them) the serialize()/deserialize() methods of attribute would still work with DataInput/Output. I bet that a lot of people who used the payload feature before took a ByteArrayOutputStream together with DataOutputStream (which implements Java's DataOutput) to populate the payload byte array. With 2125 Lucene will provide an API that is similar to use, but more efficient as it remove the byte[] array indirection and overhead. I'm still +1 for this change. Others?
        Hide
        Shai Erera added a comment -

        I bet that a lot of people who used the payload feature before took a ByteArrayOutputStream together with DataOutputStream

        I actually use ByteBuffer which has similar methods. That's good though if you know the size of the needed byte[] up front. Otherwise, you either code the extension of growing the ByteBuffer, or use DataOutputStream(ByteArrayOutputStream).

        Michael, I read through the patch (briefly though), and I was confused by the names DataInput/Ouput. Initially, when I read this issue, I thought you mean that IndexInput/Output should implement Java's DataInput/Output, but now I see you created two new such classes. So first, can we perhaps name them otherwise, like LuceneInput/Output or something similar, to not confuse w/ Java's? Second, why not have them implement Java's DataInput/Output, and add on top of them additional methods, like readVInt(), readVLong() etc.? You can keep the abstracts LuceneInput/Output to provide the common implementation.

        BTW, a small optimization that I think can be made in the classes is to introduce an internal ByteBuffer of size 8. In the methods like readInt(), you can read 4 bytes into the buffer, calling readBytes(buf.array(), 0, 4), and then buf.getInt(). That will save 4 calls to readByte(). Same will go for long, and the write variants. Doesn't work though w/ readVInt(), because we need to read 1-byte at-a-time to decode. Maybe if the use of these is usually through BufferedIndexInput/Output this does not matter much, but it will still save 2/4 method calls.

        Show
        Shai Erera added a comment - I bet that a lot of people who used the payload feature before took a ByteArrayOutputStream together with DataOutputStream I actually use ByteBuffer which has similar methods. That's good though if you know the size of the needed byte[] up front. Otherwise, you either code the extension of growing the ByteBuffer, or use DataOutputStream(ByteArrayOutputStream). Michael, I read through the patch (briefly though), and I was confused by the names DataInput/Ouput. Initially, when I read this issue, I thought you mean that IndexInput/Output should implement Java's DataInput/Output, but now I see you created two new such classes. So first, can we perhaps name them otherwise, like LuceneInput/Output or something similar, to not confuse w/ Java's? Second, why not have them implement Java's DataInput/Output, and add on top of them additional methods, like readVInt(), readVLong() etc.? You can keep the abstracts LuceneInput/Output to provide the common implementation. BTW, a small optimization that I think can be made in the classes is to introduce an internal ByteBuffer of size 8. In the methods like readInt(), you can read 4 bytes into the buffer, calling readBytes(buf.array(), 0, 4), and then buf.getInt(). That will save 4 calls to readByte(). Same will go for long, and the write variants. Doesn't work though w/ readVInt(), because we need to read 1-byte at-a-time to decode. Maybe if the use of these is usually through BufferedIndexInput/Output this does not matter much, but it will still save 2/4 method calls.
        Hide
        Marvin Humphrey added a comment -

        > I disagree with you here: introducing DataInput/Output makes IMO the API
        > actually easier for the "normal" user to understand.
        >
        > I would think that most users don't implement IndexInput/Output extensions,
        > but simply use the out-of-the-box Directory implementations, which provide
        > IndexInput/Output impls. Also, most users probably don't even call the
        > IndexInput/Output APIs directly.

        I agree with everything you say in the second paragraph, but I don't see how
        any of that supports the assertion you make in the first paragraph.

        Lucene's file system has a directory class, named "Directory", and a pair of
        classes which representing files, named "IndexInput" and "IndexOutput".
        Directories and files. Easy to understand.

        All common IO systems have entities which represent data streaming to/from a
        file. They might be called "file handles", "file descriptors", "readers" and
        "writers", "streams", or whatever, but they're all basically the same thing.

        What this patch does is fragment the pair of classes that representing file
        IO... why?

        What does a "normal" user do with a file?

        Step 1: Open the file.
        Step 2: Write data to the file.
        Step 3: Close the file.

        Then, later...

        Step 1: Open the file.
        Step 2: Read data from the file.
        Step 3: Close the file.

        You're saying that Lucene's file abstraction is easier to understand if you
        break that up?

        I grokked your first rationale – that you don't want people to be able to
        call close() on an IndexInput that they're essentially borrowing for a bit.
        OK, I think it's overkill to create an entire class to thwart something nobody
        was going to do anyway, but at least I understand why you might want to do
        that.

        But the idea that this strange fragmentation of the IO hierarchy makes things
        easier – I don't get it at all. And I certainly don't see how it's such an
        improvement over what exists now that it justifies a change to the public API.

        Show
        Marvin Humphrey added a comment - > I disagree with you here: introducing DataInput/Output makes IMO the API > actually easier for the "normal" user to understand. > > I would think that most users don't implement IndexInput/Output extensions, > but simply use the out-of-the-box Directory implementations, which provide > IndexInput/Output impls. Also, most users probably don't even call the > IndexInput/Output APIs directly. I agree with everything you say in the second paragraph, but I don't see how any of that supports the assertion you make in the first paragraph. Lucene's file system has a directory class, named "Directory", and a pair of classes which representing files, named "IndexInput" and "IndexOutput". Directories and files. Easy to understand. All common IO systems have entities which represent data streaming to/from a file. They might be called "file handles", "file descriptors", "readers" and "writers", "streams", or whatever, but they're all basically the same thing. What this patch does is fragment the pair of classes that representing file IO... why? What does a "normal" user do with a file? Step 1: Open the file. Step 2: Write data to the file. Step 3: Close the file. Then, later... Step 1: Open the file. Step 2: Read data from the file. Step 3: Close the file. You're saying that Lucene's file abstraction is easier to understand if you break that up? I grokked your first rationale – that you don't want people to be able to call close() on an IndexInput that they're essentially borrowing for a bit. OK, I think it's overkill to create an entire class to thwart something nobody was going to do anyway, but at least I understand why you might want to do that. But the idea that this strange fragmentation of the IO hierarchy makes things easier – I don't get it at all. And I certainly don't see how it's such an improvement over what exists now that it justifies a change to the public API.
        Hide
        Mark Miller added a comment -

        > I disagree with you here: introducing DataInput/Output makes IMO the API
        > actually easier for the "normal" user to understand.

        >> I agree with everything you say in the second paragraph, but I don't see how
        >> any of that supports the assertion you make in the first paragraph.

        Presumably, because the "normal" user won't touch/see the IndexInput/Output classes, but more likely may deal with DataInput/Output - and those classes
        being limited to what actually makes sense to use for them (only exposing methods they should use) - thats easier for them.

        I was leaning towards Marvin's arguments - it really seems that documentation should be enough to steer users against doing something stupid - there is no
        doubt that writing attributes into the posting list is a fairly advanced operation (though more "normal" than using IndexInput/Output). On the other hand though,
        I'm not really sold on the downsides longer term either. The complexity argument is a bit over blown. If you understand anything down to the level of these classes,
        this is a ridiculously simple change. The backcompat argument is not very persuasive either - not only does it look like a slim chance of any future issues - at this
        level we are fairly loose about back compat when something comes up. I think advanced users have already realized, the more you dig into Lucene's guts, the
        more likely you won't be able to count on jar drop in. Thats just the way things have gone. I don't see a looming concrete issue myself anyway. And if there is a
        hidden one, I don't think anyone is going to get in a ruffle about it.

        So net/net, I'm +1. Seems worth it to me to be able to give a user 2125 the correct API.

        I could go either way on the name change. Not a fan of LuceneInput/Output though.

        Show
        Mark Miller added a comment - > I disagree with you here: introducing DataInput/Output makes IMO the API > actually easier for the "normal" user to understand. >> I agree with everything you say in the second paragraph, but I don't see how >> any of that supports the assertion you make in the first paragraph. Presumably, because the "normal" user won't touch/see the IndexInput/Output classes, but more likely may deal with DataInput/Output - and those classes being limited to what actually makes sense to use for them (only exposing methods they should use) - thats easier for them. I was leaning towards Marvin's arguments - it really seems that documentation should be enough to steer users against doing something stupid - there is no doubt that writing attributes into the posting list is a fairly advanced operation (though more "normal" than using IndexInput/Output). On the other hand though, I'm not really sold on the downsides longer term either. The complexity argument is a bit over blown. If you understand anything down to the level of these classes, this is a ridiculously simple change. The backcompat argument is not very persuasive either - not only does it look like a slim chance of any future issues - at this level we are fairly loose about back compat when something comes up. I think advanced users have already realized, the more you dig into Lucene's guts, the more likely you won't be able to count on jar drop in. Thats just the way things have gone. I don't see a looming concrete issue myself anyway. And if there is a hidden one, I don't think anyone is going to get in a ruffle about it. So net/net, I'm +1. Seems worth it to me to be able to give a user 2125 the correct API. I could go either way on the name change. Not a fan of LuceneInput/Output though.
        Hide
        Michael Busch added a comment -

        What does a "normal" user do with a file?

        Step 1: Open the file.
        Step 2: Write data to the file.
        Step 3: Close the file.

        Then, later...

        Step 1: Open the file.
        Step 2: Read data from the file.
        Step 3: Close the file.

        You're saying that Lucene's file abstraction is easier to understand if you
        break that up?

        No, I'm saying "normal" users do not work directly with files, so they won't do any of your steps above. They don't need to know those I/O related classes (except Directory).

        DataInput/Output is about encoding/decoding of data, which is all a user of 2125 needs to worry about. The user doesn't have to know that the attribute is first serialized into byte slices in TermsHashPerField and then written into the file(s) the actual codec defines.

        But the idea that this strange fragmentation of the IO hierarchy makes things
        easier - I don't get it at all. And I certainly don't see how it's such an
        improvement over what exists now that it justifies a change to the public API.

        It makes it easier for a 2125 user. It does not make it harder for someone "advanced" who's dealing with IndexInput/Output already.

        It makes it also cleaner - look e.g. at ByteSliceReader/Writer: those classes just currently throw RuntimeExceptions in the methods that this patch leaves in IndexInput/Output. Why? Because they're not dealing with file I/O, but with data (de)serialization.

        Show
        Michael Busch added a comment - What does a "normal" user do with a file? Step 1: Open the file. Step 2: Write data to the file. Step 3: Close the file. Then, later... Step 1: Open the file. Step 2: Read data from the file. Step 3: Close the file. You're saying that Lucene's file abstraction is easier to understand if you break that up? No, I'm saying "normal" users do not work directly with files, so they won't do any of your steps above. They don't need to know those I/O related classes (except Directory). DataInput/Output is about encoding/decoding of data, which is all a user of 2125 needs to worry about. The user doesn't have to know that the attribute is first serialized into byte slices in TermsHashPerField and then written into the file(s) the actual codec defines. But the idea that this strange fragmentation of the IO hierarchy makes things easier - I don't get it at all. And I certainly don't see how it's such an improvement over what exists now that it justifies a change to the public API. It makes it easier for a 2125 user. It does not make it harder for someone "advanced" who's dealing with IndexInput/Output already. It makes it also cleaner - look e.g. at ByteSliceReader/Writer: those classes just currently throw RuntimeExceptions in the methods that this patch leaves in IndexInput/Output. Why? Because they're not dealing with file I/O, but with data (de)serialization.
        Hide
        Michael Busch added a comment -

        So first, can we perhaps name them otherwise, like LuceneInput/Output or something similar, to not confuse w/ Java's?

        Hmm, I was a bit concerned about confusion first too. But I'm, like Mark, not really liking LuceneInput/Output. I'd personally be ok with keeping DataInput/Output. But maybe we can come up with something better? Man, naming is always so hard...

        Second, why not have them implement Java's DataInput/Output, and add on top of them additional methods, like readVInt(), readVLong() etc.?

        I considered that, but Java's interfaces dictate what string encoding to use:
        (From java.io.DataInput's javadocs)

        Implementations of the DataInput and DataOutput interfaces represent Unicode strings in a format that is a slight modification of UTF-8.
        

        E.g. DataInput defines readChar(), which we'd have to implement. But in IndexInput we deprecated readChars(), because we don't want to use modified UTF-8 anymore.

        Show
        Michael Busch added a comment - So first, can we perhaps name them otherwise, like LuceneInput/Output or something similar, to not confuse w/ Java's? Hmm, I was a bit concerned about confusion first too. But I'm, like Mark, not really liking LuceneInput/Output. I'd personally be ok with keeping DataInput/Output. But maybe we can come up with something better? Man, naming is always so hard... Second, why not have them implement Java's DataInput/Output, and add on top of them additional methods, like readVInt(), readVLong() etc.? I considered that, but Java's interfaces dictate what string encoding to use: (From java.io.DataInput's javadocs) Implementations of the DataInput and DataOutput interfaces represent Unicode strings in a format that is a slight modification of UTF-8. E.g. DataInput defines readChar(), which we'd have to implement. But in IndexInput we deprecated readChars(), because we don't want to use modified UTF-8 anymore.
        Hide
        Michael Busch added a comment -

        There has been silence here, so I hope everyone is ok with this change now?

        I'll commit this in a day or two if nobody objects!

        Show
        Michael Busch added a comment - There has been silence here, so I hope everyone is ok with this change now? I'll commit this in a day or two if nobody objects!
        Hide
        Michael Busch added a comment -

        Updated patch to trunk.

        I'll have to make a change to the backwards-tests too, because moving the copyBytes() method from IndexOutput to DataOutput and changing its parameter from IndexInput to DataInput breaks drop-in compatibility.

        Show
        Michael Busch added a comment - Updated patch to trunk. I'll have to make a change to the backwards-tests too, because moving the copyBytes() method from IndexOutput to DataOutput and changing its parameter from IndexInput to DataInput breaks drop-in compatibility.
        Hide
        Michael McCandless added a comment -

        Michael will this land on flex or on trunk? Is it ready? I'm trying to wrapup flex

        Show
        Michael McCandless added a comment - Michael will this land on flex or on trunk? Is it ready? I'm trying to wrapup flex
        Hide
        Michael Busch added a comment -

        I'll try to commit tonight to flex, but it'll probably be tomorrow (I think I have to update the patch, cause there were some changes to IndexInput/Output). If you want to merge flex into trunk sooner I can also just commit this afterwards to trunk.

        Show
        Michael Busch added a comment - I'll try to commit tonight to flex, but it'll probably be tomorrow (I think I have to update the patch, cause there were some changes to IndexInput/Output). If you want to merge flex into trunk sooner I can also just commit this afterwards to trunk.
        Hide
        Michael Busch added a comment -

        Committed revision 929340.

        Show
        Michael Busch added a comment - Committed revision 929340.

          People

          • Assignee:
            Michael Busch
            Reporter:
            Michael Busch
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development