Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.1
    • Component/s: None
    • Labels:
      None

      Description

      Implemented an Iterator interface for the Vector classes. Was necessary for porting from Float[] used in some parts of the code.

      1. VectorIterator.patch.tar.bz2
        3 kB
        Samee Zahur
      2. VectorIterator.patch.2.tar.bz2
        4 kB
        Samee Zahur
      3. VectorIterator.3.patch.bz2
        4 kB
        Samee Zahur

        Issue Links

          Activity

          Hide
          Karl Wettin added a comment -

          I just committed a bug fix in VectorView.Iterator, the index was off by offset positions. Also added some more tests.

          Show
          Karl Wettin added a comment - I just committed a bug fix in VectorView.Iterator, the index was off by offset positions. Also added some more tests.
          Hide
          Karl Wettin added a comment -

          VectorView#iterator was an eternal loop. Fixed and committed.

          I managed to write MAHOUT-26 in the svn commend by misstake.

          Show
          Karl Wettin added a comment - VectorView#iterator was an eternal loop. Fixed and committed. I managed to write MAHOUT-26 in the svn commend by misstake.
          Hide
          Karl Wettin added a comment -

          Committed revision 647469, thanks Samee!

          The only change to the patch was

          //@Override JDK 1.6

          in two places, and I removed an author tag in there too.

          Show
          Karl Wettin added a comment - Committed revision 647469, thanks Samee! The only change to the patch was //@Override JDK 1.6 in two places, and I removed an author tag in there too.
          Hide
          Karl Wettin added a comment -

          I'm also thinking we should perhaps re-use the Element instance in iterator.next?

          Yes, this is specially true when we go through the codes of DenseVector. But the case I came up was this:

          Element sum;
          boolean first=true;
          for(Element e : vec) {
            if(first) {
            sum=e;
            first=false;
            }else sum.set(e.get()+sum.get()); 
          }
          

          In the 'else' part above, if we reuse elements, both sum and e are actually referring to the same object, and if vector size is more than 2, sum will always just hold 2*lastelement rather than the intended value. These kind of errors might be harder to detect in more complicated cases.

          If this turns out to be a big performance issue people we just need to tell people that this is the way it is and that code should instead be implemented like this:

          int firstIndex = -1;
          float sum = 0f;
          for(Element e : vec) {
            if(firstIndex == -1)  {
              firstIndex = e.index();
            }
            sum += e.get();
          }
          vec.set(firstIndex, sum);
          

          If people have a hard time with that we could implement the iterator as an ad hoc class, similar to TermEnum and TermDocs classes of Lucene to really point out what's going on. But that would cripple the nice and compact Iterable code.

          Show
          Karl Wettin added a comment - I'm also thinking we should perhaps re-use the Element instance in iterator.next? Yes, this is specially true when we go through the codes of DenseVector. But the case I came up was this: Element sum; boolean first= true ; for (Element e : vec) { if (first) { sum=e; first= false ; } else sum.set(e.get()+sum.get()); } In the 'else' part above, if we reuse elements, both sum and e are actually referring to the same object, and if vector size is more than 2, sum will always just hold 2*lastelement rather than the intended value. These kind of errors might be harder to detect in more complicated cases. If this turns out to be a big performance issue people we just need to tell people that this is the way it is and that code should instead be implemented like this: int firstIndex = -1; float sum = 0f; for (Element e : vec) { if (firstIndex == -1) { firstIndex = e.index(); } sum += e.get(); } vec.set(firstIndex, sum); If people have a hard time with that we could implement the iterator as an ad hoc class, similar to TermEnum and TermDocs classes of Lucene to really point out what's going on. But that would cripple the nice and compact Iterable code.
          Hide
          Karl Wettin added a comment -

          I'll clean up the patch a bit and pop it here later today. It will most probably be committed today or tomorrow as I my self need the iterator right now for Tanimoto distance in MAHOUT-19.

          I think it is worth to continue discussing how to improve the Iterator code though. It might jst be me that like preemptive optimization. Have to benchmark and try some.

          Show
          Karl Wettin added a comment - I'll clean up the patch a bit and pop it here later today. It will most probably be committed today or tomorrow as I my self need the iterator right now for Tanimoto distance in MAHOUT-19 . I think it is worth to continue discussing how to improve the Iterator code though. It might jst be me that like preemptive optimization. Have to benchmark and try some.
          Hide
          Samee Zahur added a comment -

          I'm also thinking we should perhaps re-use the Element instance in iterator.next?

          Yes, this is specially true when we go through the codes of DenseVector. But the case I came up was this:

          Element sum;
          boolean first=true;
          for(Element e : vec) {
            if(first) {
            sum=e;
            first=false;
            }else sum.set(e.get()+sum.get()); 
          }
          

          In the 'else' part above, if we reuse elements, both sum and e are actually referring to the same object, and if vector size is more than 2, sum will always just hold 2*lastelement rather than the intended value. These kind of errors might be harder to detect in more complicated cases.

          Show
          Samee Zahur added a comment - I'm also thinking we should perhaps re-use the Element instance in iterator.next? Yes, this is specially true when we go through the codes of DenseVector. But the case I came up was this: Element sum; boolean first= true ; for (Element e : vec) { if (first) { sum=e; first= false ; } else sum.set(e.get()+sum.get()); } In the 'else' part above, if we reuse elements, both sum and e are actually referring to the same object, and if vector size is more than 2, sum will always just hold 2*lastelement rather than the intended value. These kind of errors might be harder to detect in more complicated cases.
          Hide
          Samee Zahur added a comment -

          Well the reason I added VectorPair is that I was going through the MAHOUT-20 codes and sought to remove every single for loop there I'm not sure what you mean by "static inner classes of the test" but yes, the thought of putting VectorPairElement and VectorPairIterator as inner classes to VectorPair did occur to me. At the time, I guess I just wanted to keep each file short and simple. And the thing is I really didn't see anything gained by making VectorPairElement inner. As for VectorPairIterator, we can simply make it a package private rather than inner, as users are supposed to access it as Iterator<VectorPairElement> anyway. That's pretty much all the factoring I could come up with.

          And yes, in fact more generic solutions are possible. But the most elegant ones I could come up with entailed modifying the Vector interface by making it use Java generics, like java.lang do. Presently the Vectors are tied down to Double values (might even be useful if we need complex-valued vectors later). That would allow the users to use a more or less uniform interface, and might even allow the "Element" class to be factored out. Problems with such a change might be:

          The Iterator logics used in each (VectorPair, SparseVector, DenseVector) are quite distinct and hard to specify from a generic class given that Java do not support the kind of template specialization features of C++. But still, I guess doable by implementing a generic interface by concrete classes. Would enable the users to write codes like:

          VectorIterator<Double> it = sparsevec.iterator();
          VectorIterator<Complex> jt = cmplxdensevec.iterator();
          VectorIterator<Pair<Double,Complex>> kt = Vector.pairiterator(sparsevec,cmplxdensevec);

          Could design it this way if you want me to. But is there any specific redundancy in these Iterator classes that you have in mind? And do bear in mind this would probably mean changing the Vector interface to a generic one - meaning other classes that depend on it will have to be tweaked accordingly (probably just by simply replacing Vector by Vector<Double>). This is why I opted in for the simpler designs here.

          Show
          Samee Zahur added a comment - Well the reason I added VectorPair is that I was going through the MAHOUT-20 codes and sought to remove every single for loop there I'm not sure what you mean by "static inner classes of the test" but yes, the thought of putting VectorPairElement and VectorPairIterator as inner classes to VectorPair did occur to me. At the time, I guess I just wanted to keep each file short and simple. And the thing is I really didn't see anything gained by making VectorPairElement inner. As for VectorPairIterator, we can simply make it a package private rather than inner, as users are supposed to access it as Iterator<VectorPairElement> anyway. That's pretty much all the factoring I could come up with. And yes, in fact more generic solutions are possible. But the most elegant ones I could come up with entailed modifying the Vector interface by making it use Java generics, like java.lang do. Presently the Vectors are tied down to Double values (might even be useful if we need complex-valued vectors later). That would allow the users to use a more or less uniform interface, and might even allow the "Element" class to be factored out. Problems with such a change might be: The Iterator logics used in each (VectorPair, SparseVector, DenseVector) are quite distinct and hard to specify from a generic class given that Java do not support the kind of template specialization features of C++. But still, I guess doable by implementing a generic interface by concrete classes. Would enable the users to write codes like: VectorIterator<Double> it = sparsevec.iterator(); VectorIterator<Complex> jt = cmplxdensevec.iterator(); VectorIterator<Pair<Double,Complex>> kt = Vector.pairiterator(sparsevec,cmplxdensevec); Could design it this way if you want me to. But is there any specific redundancy in these Iterator classes that you have in mind? And do bear in mind this would probably mean changing the Vector interface to a generic one - meaning other classes that depend on it will have to be tweaked accordingly (probably just by simply replacing Vector by Vector<Double>). This is why I opted in for the simpler designs here.
          Hide
          Samee Zahur added a comment -

          I.e. @Override is not valid for a top level interface method implementations, nor in interfaces that extend other interfaces.

          Thanks for the pointers, and more importantly, thanks for repeatedly going through my codes like this. So do you want me to submit another patch with @Overrides removed from those places?

          Show
          Samee Zahur added a comment - I.e. @Override is not valid for a top level interface method implementations, nor in interfaces that extend other interfaces. Thanks for the pointers, and more importantly, thanks for repeatedly going through my codes like this. So do you want me to submit another patch with @Overrides removed from those places?
          Hide
          Karl Wettin added a comment -

          I'm also thinking we should perhaps re-use the Element instance in iterator.next? Iterator a matrix of 5000 instance with 40000 features each would be instantiating 200,000,000 Elements. I'm sure the JVM pools that, but we can always help it.

          I thinks it's OK if it is a documented caveat.

          Show
          Karl Wettin added a comment - I'm also thinking we should perhaps re-use the Element instance in iterator.next? Iterator a matrix of 5000 instance with 40000 features each would be instantiating 200,000,000 Elements. I'm sure the JVM pools that, but we can always help it. I thinks it's OK if it is a documented caveat.
          Hide
          Karl Wettin added a comment -

          I understand what VectorPair does, but I'm not sure if there is any other use case than the test case? I want to refactor VectorPair, VectorPairElement and VectorPairIterator as static inner classes of the test, or alternatively introduce some sort of more generic ArrayIterator in the utillity package.

          Does that makes sense?

          Show
          Karl Wettin added a comment - I understand what VectorPair does, but I'm not sure if there is any other use case than the test case? I want to refactor VectorPair, VectorPairElement and VectorPairIterator as static inner classes of the test, or alternatively introduce some sort of more generic ArrayIterator in the utillity package. Does that makes sense?
          Hide
          Karl Wettin added a comment -

          And Override, although not a big issue, is something I use with this understanding:

          If I ever add a new method to an interface, the compiler can help me add it to every subclass by generating appropriate errors and pointing me to those missing places. However, if I remove a function from an interface, the compiler can no longer help me. I'll have to manully find the subclasses implementing that method and see if it should be removed. So for me, the @Override annotation gives the compiler a mechanism to complain if that method is missing from the interfaces/superclass, like the overrides keyword in C# and other languages. This is why I always use @Override even if I am just implementing a method, as long as a superclass/interface names it. If that's not what the annotation was designed for, I'll remove it.

          This is exactly how JDK1.6 works. JDK1.5 that Mahout use allows @Override in generalized super classes but not too much in the polymorphic scope.

          I.e. @Override is not valid for a top level interface method implementations, nor in interfaces that extend other interfaces.

          Or at least, that is how I understand the situation.

          http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=5008260

          Show
          Karl Wettin added a comment - And Override, although not a big issue, is something I use with this understanding: If I ever add a new method to an interface, the compiler can help me add it to every subclass by generating appropriate errors and pointing me to those missing places. However, if I remove a function from an interface, the compiler can no longer help me. I'll have to manully find the subclasses implementing that method and see if it should be removed. So for me, the @Override annotation gives the compiler a mechanism to complain if that method is missing from the interfaces/superclass, like the overrides keyword in C# and other languages. This is why I always use @Override even if I am just implementing a method, as long as a superclass/interface names it. If that's not what the annotation was designed for, I'll remove it. This is exactly how JDK1.6 works. JDK1.5 that Mahout use allows @Override in generalized super classes but not too much in the polymorphic scope. I.e. @Override is not valid for a top level interface method implementations, nor in interfaces that extend other interfaces. Or at least, that is how I understand the situation. http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=5008260
          Hide
          Samee Zahur added a comment -

          Hate it when I keep doing this!!!! Argghhh!!

          Patch version 3 up.

          Show
          Samee Zahur added a comment - Hate it when I keep doing this!!!! Argghhh!! Patch version 3 up.
          Hide
          Samee Zahur added a comment -

          The 2nd patch (attachment uploaded) has VectorPair included.

          And Override, although not a big issue, is something I use with this understanding:

          If I ever add a new method to an interface, the compiler can help me add it to every subclass by generating appropriate errors and pointing me to those missing places. However, if I remove a function from an interface, the compiler can no longer help me. I'll have to manully find the subclasses implementing that method and see if it should be removed. So for me, the @Override annotation gives the compiler a mechanism to complain if that method is missing from the interfaces/superclass, like the overrides keyword in C# and other languages. This is why I always use @Override even if I am just implementing a method, as long as a superclass/interface names it. If that's not what the annotation was designed for, I'll remove it.

          Samee

          Show
          Samee Zahur added a comment - The 2nd patch (attachment uploaded) has VectorPair included. And Override, although not a big issue, is something I use with this understanding: If I ever add a new method to an interface, the compiler can help me add it to every subclass by generating appropriate errors and pointing me to those missing places. However, if I remove a function from an interface, the compiler can no longer help me. I'll have to manully find the subclasses implementing that method and see if it should be removed. So for me, the @Override annotation gives the compiler a mechanism to complain if that method is missing from the interfaces/superclass, like the overrides keyword in C# and other languages. This is why I always use @Override even if I am just implementing a method, as long as a superclass/interface names it. If that's not what the annotation was designed for, I'll remove it. Samee
          Hide
          Karl Wettin added a comment -

          VectorPairIterator?

          Show
          Karl Wettin added a comment - VectorPairIterator?
          Hide
          Samee Zahur added a comment -

          Sorry missed that VectorPair - forgot to add it under svn.

          Show
          Samee Zahur added a comment - Sorry missed that VectorPair - forgot to add it under svn.
          Hide
          Karl Wettin added a comment -

          Thanks Samee, this looks nice. I've actually been needing this too and was looking at implementing it my self.

          However

          VectorPair in TestVectorIterator.java:113 is missing.

          @Override in AbstractVector.java:123 doesn't really override anything, it implements it. Compilers might throw an error there.

          Show
          Karl Wettin added a comment - Thanks Samee, this looks nice. I've actually been needing this too and was looking at implementing it my self. However VectorPair in TestVectorIterator.java:113 is missing. @Override in AbstractVector.java:123 doesn't really override anything, it implements it. Compilers might throw an error there.
          Hide
          Samee Zahur added a comment -

          This issue was created in response to MAHOUT-20

          Show
          Samee Zahur added a comment - This issue was created in response to MAHOUT-20
          Hide
          Samee Zahur added a comment -

          Complete with comments and unit tests. If you get any ideas for improvement, please suggest here.

          Samee

          Show
          Samee Zahur added a comment - Complete with comments and unit tests. If you get any ideas for improvement, please suggest here. Samee

            People

            • Assignee:
              Karl Wettin
              Reporter:
              Samee Zahur
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development