Mahout
  1. Mahout
  2. MAHOUT-379

SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 0.4
    • Fix Version/s: 0.4
    • Component/s: Math
    • Labels:
      None

      Description

      When a SequentialAccessSparseVector is serialized and deserialized using VectorWritable, the result vector and the original vector are equivalent, yet equals returns false.

      The following unit-test reproduces the problem:

      @Test
      public void testSequentialAccessSparseVectorEquals() throws Exception {
          final Vector v = new SequentialAccessSparseVector(1);
          final VectorWritable vectorWritable = new VectorWritable(v);
          final VectorWritable vectorWritable2 = new VectorWritable();
          writeAndRead(vectorWritable, vectorWritable2);
          final Vector v2 = vectorWritable2.get();
      
          assertTrue(AbstractVector.equivalent(v, v2));
          assertEquals(v, v2); // This line fails!
      }
      
      private void writeAndRead(Writable toWrite, Writable toRead) throws IOException {
          final ByteArrayOutputStream baos = new ByteArrayOutputStream();
          final DataOutputStream dos = new DataOutputStream(baos);
          toWrite.write(dos);
      
          final ByteArrayInputStream bais = new ByteArrayInputStream(baos.toByteArray());
          final DataInputStream dis = new DataInputStream(bais);
          toRead.readFields(dis);
      }
      
      

      The problem seems to be that the original vector name is null, while the new vector's name is an empty string. The same issue probably also happens with RandomAccessSparseVector.

      SequentialAccessSparseVectorWritable (line 40):

      dataOutput.writeUTF(getName() == null ? "" : getName());
      

      RandomAccessSparseVectorWritable (line 42):

      dataOutput.writeUTF(this.getName() == null ? "" : this.getName());
      

      The simplest fix is probably to change the default Vector's name from null to the empty string.

      1. ASF.LICENSE.NOT.GRANTED--MAHOUT-379.patch
        30 kB
        Sean Owen
      2. ASF.LICENSE.NOT.GRANTED--MAHOUT-379.patch
        91 kB
        Sean Owen
      3. MAHOUT-379.patch
        137 kB
        Sean Owen
      4. MAHOUT-379-lucene.patch
        2 kB
        Grant Ingersoll

        Activity

        Hide
        Sean Owen added a comment -

        Yeah let's take some time to get this right. At the moment I see four notions of equivalence in Vector (which is down from five!), and this seems like one too many:

        ==: of course
        equals(): compares values, names, not implementation
        equivalent(): compares values only
        strictEquivalence(): compares values, names, implementation

        equals() ought to be strict-ish. Its current implementation is fine, though conventional wisdom is that it's better to only consider instances of the same class equals() in order to avoid transitivity problems. I think that's a valid concern here.

        Therefore I submit that equals() should be replaced with what strictEquivalence() does.

        (And then, of course, fix the underlying issue that was raised too!)

        Show
        Sean Owen added a comment - Yeah let's take some time to get this right. At the moment I see four notions of equivalence in Vector (which is down from five!), and this seems like one too many: ==: of course equals(): compares values, names, not implementation equivalent(): compares values only strictEquivalence(): compares values, names, implementation equals() ought to be strict-ish. Its current implementation is fine, though conventional wisdom is that it's better to only consider instances of the same class equals() in order to avoid transitivity problems. I think that's a valid concern here. Therefore I submit that equals() should be replaced with what strictEquivalence() does. (And then, of course, fix the underlying issue that was raised too!)
        Hide
        Grant Ingersoll added a comment -

        I think we probably should have a discussion on the dev list, given we've been over this one several times now.

        Show
        Grant Ingersoll added a comment - I think we probably should have a discussion on the dev list, given we've been over this one several times now.
        Hide
        Sean Owen added a comment -

        This is a pre-patch, per discussion on the mailing list. Is this too much iterator rearrangement or does it clarify?

        Show
        Sean Owen added a comment - This is a pre-patch, per discussion on the mailing list. Is this too much iterator rearrangement or does it clarify?
        Hide
        Danny Leshem added a comment -

        Sean, your patch neither fixes the original issue raised, nor contains the unit-test that exposes it.

        Show
        Danny Leshem added a comment - Sean, your patch neither fixes the original issue raised, nor contains the unit-test that exposes it.
        Hide
        Sean Owen added a comment -

        Yup, this is why I said "pre-patch", in reference to a mailing list thread perhaps you didn't see. This proposes reshuffling the iterator/element stuff, which is marginally related, in the course of actually addressing the issue.

        Show
        Sean Owen added a comment - Yup, this is why I said "pre-patch", in reference to a mailing list thread perhaps you didn't see. This proposes reshuffling the iterator/element stuff, which is marginally related, in the course of actually addressing the issue.
        Hide
        Sean Owen added a comment -

        Here's another patch, which builds on the previous changes, which did not directly relate to this issue but go along with them.

        This patch does include the test case, and more, which passes.

        it resolves the issue by removing Vector names, and removing both equivalent() and strictEquivalence(). equals() now compares only values.

        Further, equals() and hashCode() are paired up in AbstractVector to do more to ensure they are consistent, and correct. We had, for instance, a transitivity problem with VectorView.equals(). Subclasses may specialize for performance; DenseVector still does this for instance and we can revisit this with care later.

        Good news is this patch net-net removes 170 lines of code, even with new unit tests.

        It builds on the previous patch's refactoring of iterator methods.

        It's a big patch so deserves some looking.

        Show
        Sean Owen added a comment - Here's another patch, which builds on the previous changes, which did not directly relate to this issue but go along with them. This patch does include the test case, and more, which passes. it resolves the issue by removing Vector names, and removing both equivalent() and strictEquivalence(). equals() now compares only values. Further, equals() and hashCode() are paired up in AbstractVector to do more to ensure they are consistent, and correct. We had, for instance, a transitivity problem with VectorView.equals(). Subclasses may specialize for performance; DenseVector still does this for instance and we can revisit this with care later. Good news is this patch net-net removes 170 lines of code, even with new unit tests. It builds on the previous patch's refactoring of iterator methods. It's a big patch so deserves some looking.
        Hide
        Sean Owen added a comment -

        Hmm my second patch didn't attach

        Show
        Sean Owen added a comment - Hmm my second patch didn't attach
        Hide
        Robin Anil added a comment -

        If the id from the vector is removed, I believe it will affect all clustering algorithms. The final stage is generating the vector_id, cluster_id pair. will have to verify if this doesn't affect that step

        Show
        Robin Anil added a comment - If the id from the vector is removed, I believe it will affect all clustering algorithms. The final stage is generating the vector_id, cluster_id pair. will have to verify if this doesn't affect that step
        Hide
        Sean Owen added a comment -

        Another update to this huge patch. As promised, *Writable no longer extend Vector classes. NamedVector is in place.

        Show
        Sean Owen added a comment - Another update to this huge patch. As promised, *Writable no longer extend Vector classes. NamedVector is in place.
        Hide
        Sean Owen added a comment -

        I'd like to commit this patch as it addresses a couple issues, but it's a big one. Deserves some looking if you have a moment. We can retroactively undo some elements later, but best to have a glance now.

        Show
        Sean Owen added a comment - I'd like to commit this patch as it addresses a couple issues, but it's a big one. Deserves some looking if you have a moment. We can retroactively undo some elements later, but best to have a glance now.
        Hide
        Grant Ingersoll added a comment -

        This fix broke the LuceneIterable functionality by removing the ability to label Lucene vectors.

        Show
        Grant Ingersoll added a comment - This fix broke the LuceneIterable functionality by removing the ability to label Lucene vectors.
        Hide
        Grant Ingersoll added a comment -

        Here's an untested fix for this.

        Show
        Grant Ingersoll added a comment - Here's an untested fix for this.
        Hide
        Sean Owen added a comment -

        Yes that sounds about right, the intent is to use NamedVector instead.

        Show
        Sean Owen added a comment - Yes that sounds about right, the intent is to use NamedVector instead.
        Hide
        Sean Owen added a comment -

        Am I right this is re-resolved?

        Show
        Sean Owen added a comment - Am I right this is re-resolved?

          People

          • Assignee:
            Sean Owen
            Reporter:
            Danny Leshem
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development