OpenNLP
  1. OpenNLP
  2. OPENNLP-99

EventStream should extend Iterator<Event>

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: maxent-3.0.0-sourceforge
    • Fix Version/s: None
    • Component/s: Machine Learning
    • Labels:
      None

      Description

      [As requested, brought over from Sourceforge.]

      Conceptually, EventStream is just an Iterator<Event>. You would get better interoperability with other Java libraries if EventStream were declared as such. If you didn't care about backwards compatibility, I'd say just get rid of EventStream entirely and use Iterator<Event> everywhere instead.

      If you care about backwards compatibility, you could at least declare AbstractEventStream as implementing Iterator<Event> - it declares all of hasNext(), next() and remove(). I believe that shouldn't break anything, and should make all the current EventStream implementations into Iterator<Event>s.

      Why do I want this? Because, when using OpenNLP maxent from Scala, if a RealValueFileEventStream were an Iterator<Event>, I could write:

      for (event <- stream)

      { ... }

      But since it's not, I instead have to wrap it in an Iterator:

      val events = new Iterator[Event] { def hasNext = stream.hasNext def next = stream.next }
      for (event <- events) { ... }

      Or write the while loop version:

      while (stream.hasNext)

      { val event = stream.next ... }

        Activity

        Hide
        Joern Kottmann added a comment -

        One of the big issues we have currently with the EventStream is that it cannot throw an exception when something goes wrong. Since it is a requirement to read the data from disk, there is always a potential for failures. For me it seems not possible to implement this nicely with an Iterator. In the opennlp tools project we introduced the ObjectStream to circumvent these issues.

        Show
        Joern Kottmann added a comment - One of the big issues we have currently with the EventStream is that it cannot throw an exception when something goes wrong. Since it is a requirement to read the data from disk, there is always a potential for failures. For me it seems not possible to implement this nicely with an Iterator. In the opennlp tools project we introduced the ObjectStream to circumvent these issues.
        Hide
        Steven Bethard added a comment -

        Ahh, the curse of checked exceptions bites again. I'm not sure how ObjectStream helps - could you explain? Looks like ObjectStream isn't an Iterator either.

        I see two relatively simple solutions:

        (1) Give up on checked exceptions and declare the interface with runtime exceptions instead:

        public class OpenNLPIOException extends java.lang.RuntimeException

        {...}

        public interface EventStream extends Iterator<Event>

        { public Event next() throws OpenNLPIOException; public boolean hasNext() throws OpenNLPIOException; ... }

        You still get all errors reported if something goes wrong, and people can still catch and handle those errors if they want to. Of course, Java will no longer force you to handle them at compile time. And you wouldn't be using a standard exception type.

        (2) Add a method AbstractEventStream.asIterator() that looks something like:

        public Iterator<Event> asIterator() {
        return new Iterator<Event>() {
        public boolean hasNext() {
        try {
        return this.hasNext();
        catch (IOException e)

        { throw new RuntimeException(e); }

        }
        public Event next()

        { ... }

        ...
        }

        Here, we're no longer using the standard Java Iterator interface, but this would allow you to keep EventStream as it is, but still make it easy for users to get an Iterator from most EventStreams if they wanted one.

        Show
        Steven Bethard added a comment - Ahh, the curse of checked exceptions bites again. I'm not sure how ObjectStream helps - could you explain? Looks like ObjectStream isn't an Iterator either. I see two relatively simple solutions: (1) Give up on checked exceptions and declare the interface with runtime exceptions instead: public class OpenNLPIOException extends java.lang.RuntimeException {...} public interface EventStream extends Iterator<Event> { public Event next() throws OpenNLPIOException; public boolean hasNext() throws OpenNLPIOException; ... } You still get all errors reported if something goes wrong, and people can still catch and handle those errors if they want to. Of course, Java will no longer force you to handle them at compile time. And you wouldn't be using a standard exception type. (2) Add a method AbstractEventStream.asIterator() that looks something like: public Iterator<Event> asIterator() { return new Iterator<Event>() { public boolean hasNext() { try { return this.hasNext(); catch (IOException e) { throw new RuntimeException(e); } } public Event next() { ... } ... } Here, we're no longer using the standard Java Iterator interface, but this would allow you to keep EventStream as it is, but still make it easy for users to get an Iterator from most EventStreams if they wanted one.
        Hide
        Joern Kottmann added a comment -

        ObjectStream has a read method which can retrieve an object from the stream. It also declares an exception which can be thrown in case of an error.

        In our experience over at the tools project it turned out that such a read method style interface seems to be easier to use and implement in our cases. When we made the decision we had a long discussion if we prefer an iterator like interface a stream like interface
        and we decided to take the stream like interface.
        It is also a question of taste of course. Since both interface styles can do the same.

        Lets say we want to read every line in a file. The line should be represented as a string.

        With the stream interface its something like this.
        In the read method we read until the end of the first line, and return it.
        It is repeated if it is called again, when the underlying stream is exhausted null is returned.
        User expects read method to maybe block shortly if underlying system is slow.

        Iterator like interface.
        User calls hasNext() to determine if there is one more item. Implementation must now read
        the first line inside hasNext() and then return the result from next() which seems less clear
        und less intuitive to me. User does not really know which of the two methods will block.
        InputStreams/Readers must be adapted to this interface, but are naturally easy to use with
        the stream style interface.

        In the end we had the feeling that stream style interface is a bit simpler and more intuitive to use.

        Show
        Joern Kottmann added a comment - ObjectStream has a read method which can retrieve an object from the stream. It also declares an exception which can be thrown in case of an error. In our experience over at the tools project it turned out that such a read method style interface seems to be easier to use and implement in our cases. When we made the decision we had a long discussion if we prefer an iterator like interface a stream like interface and we decided to take the stream like interface. It is also a question of taste of course. Since both interface styles can do the same. Lets say we want to read every line in a file. The line should be represented as a string. With the stream interface its something like this. In the read method we read until the end of the first line, and return it. It is repeated if it is called again, when the underlying stream is exhausted null is returned. User expects read method to maybe block shortly if underlying system is slow. Iterator like interface. User calls hasNext() to determine if there is one more item. Implementation must now read the first line inside hasNext() and then return the result from next() which seems less clear und less intuitive to me. User does not really know which of the two methods will block. InputStreams/Readers must be adapted to this interface, but are naturally easy to use with the stream style interface. In the end we had the feeling that stream style interface is a bit simpler and more intuitive to use.
        Hide
        Steven Bethard added a comment -

        I guess I don't understand what's unintuitive about the Iterator interface - your description of what to do for hasNext() and next() sound perfectly sensible to me. I can believe that the implementation might be more difficult for Iterator than ObjectStream, but lets put the implementation aside for a bit (either one is definitely possible) and talk just about how they would be used.

        I have three arguments why you should be using Iterator instead of (or at least in addition to) ObjectStream.

        (1) Iterator is a Java standard mechanism for iterating over a collection. People who have learned Iterators from other Java code can quickly understand your code because they already know how Iterators work. ObjectStream is a custom interface to OpenNLP and thus few people will have seen it before, thus it will take longer to learn.

        (2) Using an object with ObjectStream is no easier than using an object with Iterator:

        // using ObjectStream
        Event event = stream.read();
        while (event != null)

        { ... event = stream.read(); }

        // using Iterator
        Event event;
        while (iterator.hasNext())

        { event = iterator.next(); ... }

        It's basically the same amount of code either way. A little more using ObjectStream, but it could be a little less if you use an assignment in the while loop condition. An added benefit of using the Java standard Iterator interface is that you get much better interop with other JVM languages. For example, if you use the Iterator interface, you can write the following Scala code:

        for (event <- iterator)

        { ... }

        If your class implements ObjectStream instead of Iterator, then you have to go back to the while loop.

        (3) If you use Iterator, then you get Java collections and the enhanced for loop for free. For example, you could easily declare a RealValueFileEventIterable like this:

        public class RealValueFileEventIterable implements Iterable<Event> {
        private File file;
        public RealValueFileEventIterable(File file)

        { this.file = file; }

        public Iterator<Event> iterator()

        { return new RealValueFileEventStream(this.file); }

        }

        That's all you need, and then even in Java, people could write:

        for (Event event : new RealValueFileEventIterable(...))

        { ... }

        I guess may main argument is that using Iterator makes your code easier to use for Java programmers who are familiar with Java standard Interfaces. I think making OpenNLP easy to use for general Java programmers is a noble goal.

        Show
        Steven Bethard added a comment - I guess I don't understand what's unintuitive about the Iterator interface - your description of what to do for hasNext() and next() sound perfectly sensible to me. I can believe that the implementation might be more difficult for Iterator than ObjectStream, but lets put the implementation aside for a bit (either one is definitely possible) and talk just about how they would be used. I have three arguments why you should be using Iterator instead of (or at least in addition to) ObjectStream. (1) Iterator is a Java standard mechanism for iterating over a collection. People who have learned Iterators from other Java code can quickly understand your code because they already know how Iterators work. ObjectStream is a custom interface to OpenNLP and thus few people will have seen it before, thus it will take longer to learn. (2) Using an object with ObjectStream is no easier than using an object with Iterator: // using ObjectStream Event event = stream.read(); while (event != null) { ... event = stream.read(); } // using Iterator Event event; while (iterator.hasNext()) { event = iterator.next(); ... } It's basically the same amount of code either way. A little more using ObjectStream, but it could be a little less if you use an assignment in the while loop condition. An added benefit of using the Java standard Iterator interface is that you get much better interop with other JVM languages. For example, if you use the Iterator interface, you can write the following Scala code: for (event <- iterator) { ... } If your class implements ObjectStream instead of Iterator, then you have to go back to the while loop. (3) If you use Iterator, then you get Java collections and the enhanced for loop for free. For example, you could easily declare a RealValueFileEventIterable like this: public class RealValueFileEventIterable implements Iterable<Event> { private File file; public RealValueFileEventIterable(File file) { this.file = file; } public Iterator<Event> iterator() { return new RealValueFileEventStream(this.file); } } That's all you need, and then even in Java, people could write: for (Event event : new RealValueFileEventIterable(...)) { ... } I guess may main argument is that using Iterator makes your code easier to use for Java programmers who are familiar with Java standard Interfaces. I think making OpenNLP easy to use for general Java programmers is a noble goal.
        Hide
        Steven Bethard added a comment -

        Here's two discussions of the issues around Iterator and checked exceptions:

        http://stackoverflow.com/questions/2205773/what-to-do-of-exceptions-when-implementing-java-lang-iterator
        http://stackoverflow.com/questions/2346978/how-should-iterator-implementation-deal-with-checked-exceptions

        In both cases, the suggestion is basically the same as my OpenNLPIOException solution - wrap the checked exceptions with well named RuntimeExceptions and declare those as being thrown.

        Show
        Steven Bethard added a comment - Here's two discussions of the issues around Iterator and checked exceptions: http://stackoverflow.com/questions/2205773/what-to-do-of-exceptions-when-implementing-java-lang-iterator http://stackoverflow.com/questions/2346978/how-should-iterator-implementation-deal-with-checked-exceptions In both cases, the suggestion is basically the same as my OpenNLPIOException solution - wrap the checked exceptions with well named RuntimeExceptions and declare those as being thrown.
        Hide
        Joern Kottmann added a comment -

        We still need to take the exception handling into account. Using a plain Iterator the way you described it would mean not to use checked exceptions (like you already agreed). In the ObjectStream the exception handling is
        done at the end of the stream and does not look ugly at all, so why not use checked exceptions ?

        The whole point I think is that our users are usually implementing the ObjectStream, but not using it to actually read the data them self. In OpenNLP the data indexers use the stream to pull in the data they need. So our users usually
        do not do something like you did in (3).

        In the end I still believe that it is easier to implement these streams when the underlying data source is an InputStream or Reader.

        Show
        Joern Kottmann added a comment - We still need to take the exception handling into account. Using a plain Iterator the way you described it would mean not to use checked exceptions (like you already agreed). In the ObjectStream the exception handling is done at the end of the stream and does not look ugly at all, so why not use checked exceptions ? The whole point I think is that our users are usually implementing the ObjectStream, but not using it to actually read the data them self. In OpenNLP the data indexers use the stream to pull in the data they need. So our users usually do not do something like you did in (3). In the end I still believe that it is easier to implement these streams when the underlying data source is an InputStream or Reader.
        Hide
        Joern Kottmann added a comment -

        And ObjectStream has also a reset method which is used for example by the Parser to read the data in again, we would need this method on an Iterator like interface too. And the remove method does not really makes sense,
        and would not be used.

        Show
        Joern Kottmann added a comment - And ObjectStream has also a reset method which is used for example by the Parser to read the data in again, we would need this method on an Iterator like interface too. And the remove method does not really makes sense, and would not be used.
        Hide
        Steven Bethard added a comment -

        > We still need to take the exception handling into account

        Ok, rewritten with exception handling:

        // using ObjectStream
        try {
        Event event = stream.read();
        while (event != null)

        { ... event = stream.read(); }

        } catch (IOException e)

        { ... }

        // using Iterator
        Event event;
        try {
        while (iterator.hasNext()) { event = iterator.next(); ... }
        } catch (OpenNLPIOException e) { ... }

        Looks about the same to me. But maybe you had something else in mind? If you could write the code that you're concerned about, that would help me understand better.

        > our users are usually implementing the ObjectStream,
        > but not using it to actually read the data them self

        Perhaps it would help to see the use case where I wanted this? Here's the code:

        val testStream = new RealValueFileEventStream(testPath)
        val events = // wrapper for testStream that makes it into a real Iterator<Event>
        for (event <- events)

        { val prediction = model.getBestOutcome(model.eval(event.getContext)) ... }

        Basically, I take each event from a test file, and ask the model to predict an outcome. Is this an atypical use case?

        > I still believe that it is easier to implement these streams
        > when the underlying data source is an InputStream or Reader.

        Yes, I agreed to this earlier. But as a user, I care less about how the streams are implemented, and more about the API I have to use when calling them.

        Or are you suggesting that most users are going to be implementing custom ObjectStreams? If that's the case, you could easily define ObjectStream as an abstract class that implements Iterator<T> e.g.:

        public abstract class ObjectStream<T> implements Iterator<T> {
        public abstract T read();
        ...
        private T next = this.read();
        public boolean hasNext()

        { return this.next != null }

        public T next()

        { T result = this.next; this.next = this.read(); return result; }

        }

        That way, if you think it's easier to implement the ObjectStream API, you just subclass from ObjectStream and you get an Iterator for free.

        > And ObjectStream has also a reset method ...
        > we would need this method on an Iterator like interface too

        Well, the standard Java solution to this would be to declare it as an Iterable<T> instead. Then code like:

        Event event = stream.read();
        while (event != null)

        { ... event = stream.read(); }
        stream.reset()
        while (event != null) { ... event = stream.read(); }

        would instead be written as:

        for (Event event : iterable)

        { ... }
        for (Event event: iterable) { ... }

        Basically, anything that can be reset is an Iterable, and any single pass through it is an Iterator.

        I also agree that remove() isn't necessary but the Iterable interface declares it as optional, so I don't think that's a problem. Just throw an UnsupportedOperationException like it says to in the Iterator javadocs. (Also, note that your AbstractEventStream already has a remove() method that does exactly this.)

        Show
        Steven Bethard added a comment - > We still need to take the exception handling into account Ok, rewritten with exception handling: // using ObjectStream try { Event event = stream.read(); while (event != null) { ... event = stream.read(); } } catch (IOException e) { ... } // using Iterator Event event; try { while (iterator.hasNext()) { event = iterator.next(); ... } } catch (OpenNLPIOException e) { ... } Looks about the same to me. But maybe you had something else in mind? If you could write the code that you're concerned about, that would help me understand better. > our users are usually implementing the ObjectStream, > but not using it to actually read the data them self Perhaps it would help to see the use case where I wanted this? Here's the code: val testStream = new RealValueFileEventStream(testPath) val events = // wrapper for testStream that makes it into a real Iterator<Event> for (event <- events) { val prediction = model.getBestOutcome(model.eval(event.getContext)) ... } Basically, I take each event from a test file, and ask the model to predict an outcome. Is this an atypical use case? > I still believe that it is easier to implement these streams > when the underlying data source is an InputStream or Reader. Yes, I agreed to this earlier. But as a user, I care less about how the streams are implemented, and more about the API I have to use when calling them. Or are you suggesting that most users are going to be implementing custom ObjectStreams? If that's the case, you could easily define ObjectStream as an abstract class that implements Iterator<T> e.g.: public abstract class ObjectStream<T> implements Iterator<T> { public abstract T read(); ... private T next = this.read(); public boolean hasNext() { return this.next != null } public T next() { T result = this.next; this.next = this.read(); return result; } } That way, if you think it's easier to implement the ObjectStream API, you just subclass from ObjectStream and you get an Iterator for free. > And ObjectStream has also a reset method ... > we would need this method on an Iterator like interface too Well, the standard Java solution to this would be to declare it as an Iterable<T> instead. Then code like: Event event = stream.read(); while (event != null) { ... event = stream.read(); } stream.reset() while (event != null) { ... event = stream.read(); } would instead be written as: for (Event event : iterable) { ... } for (Event event: iterable) { ... } Basically, anything that can be reset is an Iterable, and any single pass through it is an Iterator. I also agree that remove() isn't necessary but the Iterable interface declares it as optional, so I don't think that's a problem. Just throw an UnsupportedOperationException like it says to in the Iterator javadocs. (Also, note that your AbstractEventStream already has a remove() method that does exactly this.)
        Hide
        Joern Kottmann added a comment -
        Show
        Joern Kottmann added a comment - Let us continue the discussion on the dev list, here is my answer: http://mail-archives.apache.org/mod_mbox/incubator-opennlp-dev/201101.mbox/%3C4D4202A4.90304@gmail.com%3E
        Hide
        Steven Bethard added a comment -

        Ok, here's a documentation patch summarizing the two main objections to Iterator.

        Show
        Steven Bethard added a comment - Ok, here's a documentation patch summarizing the two main objections to Iterator.
        Hide
        Joern Kottmann added a comment -

        The discussion continued on the linked mailing list thread and resulted in a documentation of the design decision in the ObjectStream interface. The design will not be changed.

        Show
        Joern Kottmann added a comment - The discussion continued on the linked mailing list thread and resulted in a documentation of the design decision in the ObjectStream interface. The design will not be changed.

          People

          • Assignee:
            Joern Kottmann
            Reporter:
            Steven Bethard
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development