Solr
  1. Solr
  2. SOLR-1870

Binary Update Request (javabin) fails when the field type of a multivalued SolrInputDocument field is a Set (or any type that is identified as an instance of iterable)

    Details

      Description

      When the field type of a field in a SolrInputDocument is a Collection based on the Set interface, the JavaBinUpdate request fails. It works when sending the document data over XML.

      1. SOLR-1870.patch
        9 kB
        Hoss Man
      2. SOLR-1870.patch
        0.9 kB
        Noble Paul
      3. SOLR-1870.patch
        6 kB
        Hoss Man
      4. SOLR-1870-test.patch
        7 kB
        Hoss Man
      5. SOLR-1870-test.patch
        2 kB
        Hoss Man

        Activity

        Hide
        Hoss Man added a comment -

        Moving Prasanna's original stack trace and verbose details into comment and reformatting...

        I encountered this error when updating my indexer to write documents in the binary format. I initially got a ClassCastException and after digging into it found the likely cause for it.

        The following piece of code added to 'public void simple() throws IOException' method in TestUpdateRequestCodec.java illustrates the error.

        doc = new SolrInputDocument(); 
        Collection<String> foobar = new HashSet<String>();
        foobar.add("baz1");
        foobar.add("baz2");    
        doc.addField("foobar",foobar);
        updateRequest.add(doc);
        

        The test fails for any Collection derived from the Set interface but will work if the Collection is a List / Array or other types. The stack trace is as follows:

         java.lang.ClassCastException: java.lang.String
        	at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$2.readIterator(JavaBinUpdateRequestCodec.java:121)
        	at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:210)
        	at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$2.readNamedList(JavaBinUpdateRequestCodec.java:107)
        	at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:175)
        	at org.apache.solr.common.util.JavaBinCodec.readArray(JavaBinCodec.java:405)
        	at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:171)
        	at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$2.readIterator(JavaBinUpdateRequestCodec.java:119)
        	at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:210)
        	at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$2.readNamedList(JavaBinUpdateRequestCodec.java:107)
        	at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:175)
        	at org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:101)
        	at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.unmarshal(JavaBinUpdateRequestCodec.java:126)
        	at org.apache.solr.client.solrj.request.TestUpdateRequestCodec.simple(TestUpdateRequestCodec.java:82)
        	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        	at org.junit.internal.runners.TestMethodRunner.executeMethodBody(TestMethodRunner.java:99)
        	at org.junit.internal.runners.TestMethodRunner.runUnprotected(TestMethodRunner.java:81)
        	at org.junit.internal.runners.BeforeAndAfterRunner.runProtected(BeforeAndAfterRunner.java:34)
        	at org.junit.internal.runners.TestMethodRunner.runMethod(TestMethodRunner.java:75)
        	at org.junit.internal.runners.TestMethodRunner.run(TestMethodRunner.java:45)
        	at org.junit.internal.runners.TestClassMethodsRunner.invokeTestMethod(TestClassMethodsRunner.java:66)
        	at org.junit.internal.runners.TestClassMethodsRunner.run(TestClassMethodsRunner.java:35)
        	at org.junit.internal.runners.TestClassRunner$1.runUnprotected(TestClassRunner.java:42)
        	at org.junit.internal.runners.BeforeAndAfterRunner.runProtected(BeforeAndAfterRunner.java:34)
        	at org.junit.internal.runners.TestClassRunner.run(TestClassRunner.java:52)
        	at com.intellij.rt.junit4.Junit4ClassSuite.run(Junit4ClassSuite.java:99)
        	at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:40)
        

        The error most likely is in the way javabin codec encodes Sets. In the public boolean writeKnownType(Object val) method in JavaBinCodec.java, a Set is recognized as an instance of iterable and the writeIterator method is called.

        I briefly looked at the writeIterator and readIterator methods and was unable to pinpoint the error. I also looked at the javabin encoded byte stream (not exactly a good way to debug..) and it did not look like it was right.

        Show
        Hoss Man added a comment - Moving Prasanna's original stack trace and verbose details into comment and reformatting... I encountered this error when updating my indexer to write documents in the binary format. I initially got a ClassCastException and after digging into it found the likely cause for it. The following piece of code added to 'public void simple() throws IOException' method in TestUpdateRequestCodec.java illustrates the error. doc = new SolrInputDocument(); Collection< String > foobar = new HashSet< String >(); foobar.add( "baz1" ); foobar.add( "baz2" ); doc.addField( "foobar" ,foobar); updateRequest.add(doc); The test fails for any Collection derived from the Set interface but will work if the Collection is a List / Array or other types. The stack trace is as follows: java.lang.ClassCastException: java.lang.String at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$2.readIterator(JavaBinUpdateRequestCodec.java:121) at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:210) at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$2.readNamedList(JavaBinUpdateRequestCodec.java:107) at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:175) at org.apache.solr.common.util.JavaBinCodec.readArray(JavaBinCodec.java:405) at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:171) at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$2.readIterator(JavaBinUpdateRequestCodec.java:119) at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:210) at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$2.readNamedList(JavaBinUpdateRequestCodec.java:107) at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:175) at org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:101) at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.unmarshal(JavaBinUpdateRequestCodec.java:126) at org.apache.solr.client.solrj.request.TestUpdateRequestCodec.simple(TestUpdateRequestCodec.java:82) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at org.junit.internal.runners.TestMethodRunner.executeMethodBody(TestMethodRunner.java:99) at org.junit.internal.runners.TestMethodRunner.runUnprotected(TestMethodRunner.java:81) at org.junit.internal.runners.BeforeAndAfterRunner.runProtected(BeforeAndAfterRunner.java:34) at org.junit.internal.runners.TestMethodRunner.runMethod(TestMethodRunner.java:75) at org.junit.internal.runners.TestMethodRunner.run(TestMethodRunner.java:45) at org.junit.internal.runners.TestClassMethodsRunner.invokeTestMethod(TestClassMethodsRunner.java:66) at org.junit.internal.runners.TestClassMethodsRunner.run(TestClassMethodsRunner.java:35) at org.junit.internal.runners.TestClassRunner$1.runUnprotected(TestClassRunner.java:42) at org.junit.internal.runners.BeforeAndAfterRunner.runProtected(BeforeAndAfterRunner.java:34) at org.junit.internal.runners.TestClassRunner.run(TestClassRunner.java:52) at com.intellij.rt.junit4.Junit4ClassSuite.run(Junit4ClassSuite.java:99) at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:40) The error most likely is in the way javabin codec encodes Sets. In the public boolean writeKnownType(Object val) method in JavaBinCodec.java, a Set is recognized as an instance of iterable and the writeIterator method is called. I briefly looked at the writeIterator and readIterator methods and was unable to pinpoint the error. I also looked at the javabin encoded byte stream (not exactly a good way to debug..) and it did not look like it was right.
        Hide
        Hoss Man added a comment -

        patch based on Prasanna's comment which illustrates the bug ... still investigating.

        Show
        Hoss Man added a comment - patch based on Prasanna's comment which illustrates the bug ... still investigating.
        Hide
        Hoss Man added a comment -

        Tracing through this code, i don't think the bug is happening when the stream is being written ... writeIterator is fairly straightforward.

        What makes no sense to me is that in JavaBinUpdateRequestCodec, the unmarshall function declares an anonymous subclass of JavaBinCodec where the readIterator has a hardcoded assumption that every object it encounters in an "ITERATOR" sequence will be a List<NamedList> which it can then convert into a SolrInputDocument.

        based on the stack trace, these seems like an abuse of the codec format – it knows that at the "top level" there will be an Iterator of docs, so it overwrites that method in this way, ignorant of the possibility that there might be other purposes for an Iterator in the stream (like a Set of values for a field)

        Show
        Hoss Man added a comment - Tracing through this code, i don't think the bug is happening when the stream is being written ... writeIterator is fairly straightforward. What makes no sense to me is that in JavaBinUpdateRequestCodec, the unmarshall function declares an anonymous subclass of JavaBinCodec where the readIterator has a hardcoded assumption that every object it encounters in an "ITERATOR" sequence will be a List<NamedList> which it can then convert into a SolrInputDocument. based on the stack trace, these seems like an abuse of the codec format – it knows that at the "top level" there will be an Iterator of docs, so it overwrites that method in this way, ignorant of the possibility that there might be other purposes for an Iterator in the stream (like a Set of values for a field)
        Hide
        Hoss Man added a comment -

        Based largely on guessing what it was that this Update codec was trying to do, i created this patch – it's based on the theory that the "outermost" Iterator is expected to be the stream of documents, and any other Iterator should be parsed in the same way as the superclass. It uses a simple boolean to keep track of whether or not the "first" (ie: outermost) Iterator has been encountered yet (which is only feasible because it's an anonymous inner class used for only one stream)

        This patch causes all tests to pass – but that doesn't necessarily convince me that it's correct – i would definitely appreciate it if someone who has actually looked at the update codec code prior to today would review this.

        (Note: i did have to modify the document equality assertions to understand that since we will never have a Set in an unmarshaled doc it needs to convert the corrisponding Collection to a set in order for assertEquals to do a non-ordered comparison)

        Show
        Hoss Man added a comment - Based largely on guessing what it was that this Update codec was trying to do, i created this patch – it's based on the theory that the "outermost" Iterator is expected to be the stream of documents, and any other Iterator should be parsed in the same way as the superclass. It uses a simple boolean to keep track of whether or not the "first" (ie: outermost) Iterator has been encountered yet (which is only feasible because it's an anonymous inner class used for only one stream) This patch causes all tests to pass – but that doesn't necessarily convince me that it's correct – i would definitely appreciate it if someone who has actually looked at the update codec code prior to today would review this. (Note: i did have to modify the document equality assertions to understand that since we will never have a Set in an unmarshaled doc it needs to convert the corrisponding Collection to a set in order for assertEquals to do a non-ordered comparison)
        Hide
        Noble Paul added a comment -

        "top level" there will be an Iterator of docs, so it overwrites that method in this way, ignorant of the possibility that there might be other purposes for an Iterator in the stream (like a Set of values for a field)

        Iterator was created as a special type in javabin codec so that items can be streamed. Any collection should have been written as a List of specific size. When unmarshalled , the items always come out as a List unless we override the readIterator.
        .I guess that a better fix would be to write Collection as List I shall give a patch

        Show
        Noble Paul added a comment - "top level" there will be an Iterator of docs, so it overwrites that method in this way, ignorant of the possibility that there might be other purposes for an Iterator in the stream (like a Set of values for a field) Iterator was created as a special type in javabin codec so that items can be streamed. Any collection should have been written as a List of specific size. When unmarshalled , the items always come out as a List unless we override the readIterator. .I guess that a better fix would be to write Collection as List I shall give a patch
        Hide
        Noble Paul added a comment -

        fixing JavabinCodec to write collection as array

        Show
        Noble Paul added a comment - fixing JavabinCodec to write collection as array
        Hide
        Hoss Man added a comment -

        Iterator was created as a special type in javabin codec so that items can be streamed. Any collection should have been written as a List of specific size.

        I'm confused ... if Iterator support was only ever ment to be "special" for streaming items, then why did writeKnownType have support for Iterator? and why did JavaBinUpdateRequestCodec override the default behavior of readIterator to treat it special?

        As far as your patch goes: instead of adding a new "if (val instanceof Collection)" test to writeKnownType, shouldn't you replace the existing "instanceof List" with "instanceof Collection" ?

        I'm still not understanding all of this, but it also seems like both patches would be a good idea – your change ensures that all Collections are serialized as an Array, but it still leaves open the possibility of a bug if someone tries to use the codec to stream something which is not a Collection but is Iterable. perhaps that was not originally ment ot be supported, but is there any harm in it? is the special case behavior for Iterators for streaming used in a way besides the "top level" docs iterator that i mentioned?

        Show
        Hoss Man added a comment - Iterator was created as a special type in javabin codec so that items can be streamed. Any collection should have been written as a List of specific size. I'm confused ... if Iterator support was only ever ment to be "special" for streaming items, then why did writeKnownType have support for Iterator? and why did JavaBinUpdateRequestCodec override the default behavior of readIterator to treat it special? As far as your patch goes: instead of adding a new "if (val instanceof Collection)" test to writeKnownType, shouldn't you replace the existing "instanceof List" with "instanceof Collection" ? I'm still not understanding all of this, but it also seems like both patches would be a good idea – your change ensures that all Collections are serialized as an Array, but it still leaves open the possibility of a bug if someone tries to use the codec to stream something which is not a Collection but is Iterable. perhaps that was not originally ment ot be supported, but is there any harm in it? is the special case behavior for Iterators for streaming used in a way besides the "top level" docs iterator that i mentioned?
        Hide
        Hoss Man added a comment -

        Noble: the newly updated SOLR-1870-test.patch demonstrates the concern i have for your fix: if the JavaBinCodec has support for Iterator and Iterable, but the JavaBinUpdateRequestCodec makes assumptions about Iterators only being used for streaming docs, then if people add Field values containing Custom objects that implement Iterable but are not actually Collection then the JavaBinUpdateRequestCodec will marshal them correctly, but it will have a ClassCastException when unmarshaling them – ditto for people who want to add lazy Iterators as field values.

        I don't disagree that making JavaBinCodec support Collection is a good idea in general, but it doesn't fix the root problem – i think we need both changes.

        The latest SOLR-1870.patch incorporates both my suggested fix for JavaBinUpdateRequestCodec, as well as your change to JavaBinCodec (with my suggested tweak of replacing List with Collection in the if tree), and all of the tests i've previously posted (ie: SOLR-1870-test.patch is for illustrative purposes only, it's not needed)

        what do you think?

        Show
        Hoss Man added a comment - Noble: the newly updated SOLR-1870 -test.patch demonstrates the concern i have for your fix: if the JavaBinCodec has support for Iterator and Iterable, but the JavaBinUpdateRequestCodec makes assumptions about Iterators only being used for streaming docs, then if people add Field values containing Custom objects that implement Iterable but are not actually Collection then the JavaBinUpdateRequestCodec will marshal them correctly, but it will have a ClassCastException when unmarshaling them – ditto for people who want to add lazy Iterators as field values. I don't disagree that making JavaBinCodec support Collection is a good idea in general, but it doesn't fix the root problem – i think we need both changes. The latest SOLR-1870 .patch incorporates both my suggested fix for JavaBinUpdateRequestCodec, as well as your change to JavaBinCodec (with my suggested tweak of replacing List with Collection in the if tree), and all of the tests i've previously posted (ie: SOLR-1870 -test.patch is for illustrative purposes only, it's not needed) what do you think?
        Hide
        Hoss Man added a comment -

        we should try to get this into 1.4.1 if we can get consensus on the fix.

        Show
        Hoss Man added a comment - we should try to get this into 1.4.1 if we can get consensus on the fix.
        Hide
        Noble Paul added a comment - - edited

        Hoss , the fix is good. Treating all collections as type ARR instead of just collection.
        Moreover we do not need to support Iterable.
        +1

        Show
        Noble Paul added a comment - - edited Hoss , the fix is good. Treating all collections as type ARR instead of just collection. Moreover we do not need to support Iterable. +1
        Hide
        Yonik Seeley added a comment -

        concerning the "seenOuterMostDocIterator",
        Is it the case that there can be only one DocList in the update request?

        Show
        Yonik Seeley added a comment - concerning the "seenOuterMostDocIterator", Is it the case that there can be only one DocList in the update request?
        Hide
        Hoss Man added a comment -

        concerning the "seenOuterMostDocIterator", Is it the case that there can be only one DocList in the update request?

        My reading of JavaBinUpdateRequestCodec is yes, there is only one list. the marshal method takes in an updateRequest, and then writes all the documents in that updateRequest as a single Iterator to the output stream using a custom javaBinCodec that treats the doclist special – but it doesn't change anything about any other iteratros or iterables that may be field values in the DocList, so JavaBinCodec.marshal happily writes then out successfully.

        I'll i've done is update JavaBinUpdateRequestCodec.unmarshal so it only applies it's special logic to this one special top-level iterator – and will now delegate to JavaBinCodec.unmarshal for any other Iterators or Iterables.

        Even if the same InputStream/OutputStream is reused for multiple updateRequest objects using multiple marshal/unmarshal calls (something i don't think is supported in the calling code, the "seenOuterMostDocIterator" is an instance variable inside the unmarshal method, so it will be reset for each call.

        Hoss , the fix is good. Treating all collections as type ARR instead of just collection.

        Moreover we do not need to support Iterable.

        +1

        I'm confused – it seems like you are in favor of my patch, but you also say we do not need to support Iterable ... Iterable (and Iterator) are already supported by the marshalling method w/o any errors or warnings, i didn't add that, all i did was make sure Iterable (and Iterator) would be supported on the unmarshalling side as well.

        If it was never intended for the JavaBinCodec to support Iterator/Iterable except in the special case "streaming" situations like what JavaBinUpdateRequestCodec does then that's fine, but in that case we should change JavaBinCodec so it errors during marshalling – the client shouldn't be able to successfully serialize some data if the server is then going to complain that it's an incorrect format.

        Show
        Hoss Man added a comment - concerning the "seenOuterMostDocIterator", Is it the case that there can be only one DocList in the update request? My reading of JavaBinUpdateRequestCodec is yes, there is only one list. the marshal method takes in an updateRequest, and then writes all the documents in that updateRequest as a single Iterator to the output stream using a custom javaBinCodec that treats the doclist special – but it doesn't change anything about any other iteratros or iterables that may be field values in the DocList, so JavaBinCodec.marshal happily writes then out successfully. I'll i've done is update JavaBinUpdateRequestCodec.unmarshal so it only applies it's special logic to this one special top-level iterator – and will now delegate to JavaBinCodec.unmarshal for any other Iterators or Iterables. Even if the same InputStream/OutputStream is reused for multiple updateRequest objects using multiple marshal/unmarshal calls (something i don't think is supported in the calling code, the "seenOuterMostDocIterator" is an instance variable inside the unmarshal method, so it will be reset for each call. Hoss , the fix is good. Treating all collections as type ARR instead of just collection. Moreover we do not need to support Iterable. +1 I'm confused – it seems like you are in favor of my patch, but you also say we do not need to support Iterable ... Iterable (and Iterator) are already supported by the marshalling method w/o any errors or warnings, i didn't add that, all i did was make sure Iterable (and Iterator) would be supported on the unmarshalling side as well. If it was never intended for the JavaBinCodec to support Iterator/Iterable except in the special case "streaming" situations like what JavaBinUpdateRequestCodec does then that's fine, but in that case we should change JavaBinCodec so it errors during marshalling – the client shouldn't be able to successfully serialize some data if the server is then going to complain that it's an incorrect format.
        Hide
        Noble Paul added a comment - - edited

        I'm confused - it seems like you are in favor of my patch, but you also say we do not need to support Iterable ...

        Oh I misread our patch and thought you removed support for Iterable .

        In the encoded form there is only iterator/arr. So client does not send any unsupported type. for instance Set is an unsupported type in the encoded form but is supported by the library (It is lossy encoding)

        anyway your patch is fine.

        Show
        Noble Paul added a comment - - edited I'm confused - it seems like you are in favor of my patch, but you also say we do not need to support Iterable ... Oh I misread our patch and thought you removed support for Iterable . In the encoded form there is only iterator/arr. So client does not send any unsupported type. for instance Set is an unsupported type in the encoded form but is supported by the library (It is lossy encoding) anyway your patch is fine.
        Hide
        Mark Miller added a comment -

        This is marked 1.4.1 - someone want to finish it off?

        Show
        Mark Miller added a comment - This is marked 1.4.1 - someone want to finish it off?
        Hide
        Noble Paul added a comment -

        The latest patch by Hoss can be committed. I can do it if Hoss does not wish to add anything more

        Show
        Noble Paul added a comment - The latest patch by Hoss can be committed. I can do it if Hoss does not wish to add anything more
        Hide
        Yonik Seeley added a comment -

        I just reviewed this - seems fine to me.

        Show
        Yonik Seeley added a comment - I just reviewed this - seems fine to me.
        Hide
        Hoss Man added a comment -

        Committed revision 954336. ... trunk

        Committed revision 954338. ... 3x branch

        Committed revision 954340. ... 1.4 branch

        Show
        Hoss Man added a comment - Committed revision 954336. ... trunk Committed revision 954338. ... 3x branch Committed revision 954340. ... 1.4 branch

          People

          • Assignee:
            Hoss Man
            Reporter:
            Prasanna Ranganathan
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development