Solr
  1. Solr
  2. SOLR-2034

javabin should use UTF-8, not modified UTF-8

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.1
    • Component/s: None
    • Labels:
      None

      Description

      for better interoperability, javabin should use standard UTF-8 instead of modified UTF-8 (http://www.unicode.org/reports/tr26/)

      1. SOLR-2034.patch
        6 kB
        Robert Muir
      2. SOLR-2034.patch
        4 kB
        Robert Muir
      3. SOLR-2034.patch
        7 kB
        Robert Muir
      4. SOLR-2034.patch
        9 kB
        Robert Muir

        Issue Links

          Activity

          Hide
          Jan Høydahl added a comment -

          Adding correct fixed version

          Show
          Jan Høydahl added a comment - Adding correct fixed version
          Hide
          Shawn Heisey added a comment -

          So it requires modifying the code? I am not a Java programmer, nor am I familiar with Solr internals and objects. I do understand programming principles and know a few other languages, so I can sometimes muddle my way through the unfamiliar.

          I took a look at ReplicationHandler.java and RequestHandlerBase.java and was not able to find anything having to do with javabin or looking remotely like what you mentioned. I did find a reference to BinaryQueryResponseWriter, but the section of code that uses it also has this in it, suggesting that replication and javabin might be heavily intertwined:

          throw new RuntimeException("This is a binary writer , Cannot write to a characterstream");

          Show
          Shawn Heisey added a comment - So it requires modifying the code? I am not a Java programmer, nor am I familiar with Solr internals and objects. I do understand programming principles and know a few other languages, so I can sometimes muddle my way through the unfamiliar. I took a look at ReplicationHandler.java and RequestHandlerBase.java and was not able to find anything having to do with javabin or looking remotely like what you mentioned. I did find a reference to BinaryQueryResponseWriter, but the section of code that uses it also has this in it, suggesting that replication and javabin might be heavily intertwined: throw new RuntimeException("This is a binary writer , Cannot write to a characterstream");
          Hide
          Ryan McKinley added a comment -

          try:

          server.setParser(new XMLResponseParser()); 
          

          You may also have to set:

           
          server.setRequestWriter(new RequestWriter());
          
          Show
          Ryan McKinley added a comment - try: server.setParser( new XMLResponseParser()); You may also have to set: server.setRequestWriter( new RequestWriter());
          Hide
          Shawn Heisey added a comment -

          I would be happy to take that route. You'll have to tell me how to switch to XML. I have already tried a couple of things that didn't work.

          Show
          Shawn Heisey added a comment - I would be happy to take that route. You'll have to tell me how to switch to XML. I have already tried a couple of things that didn't work.
          Hide
          Ryan McKinley added a comment -

          Can you switch to xml format? then when everythign is upgraded, switch back?

          Show
          Ryan McKinley added a comment - Can you switch to xml format? then when everythign is upgraded, switch back?
          Hide
          Shawn Heisey added a comment -

          I started upgrading my slave shards today from 1.4.1 to 3.1 and ran into a bit of a showstopper:

          SEVERE: Master at: http://HOST:8983/solr/live/replication is not available. Index fetch failed. Exception: Invalid version or the data in not in 'javabin' format

          What advice do you have? I can't upgrade both master and slave at the same time - we have to be able to fully test against the new version. I would rather not do the testing with completely static indexes, it would be better to have data being added and deleted normally.

          Show
          Shawn Heisey added a comment - I started upgrading my slave shards today from 1.4.1 to 3.1 and ran into a bit of a showstopper: SEVERE: Master at: http://HOST:8983/solr/live/replication is not available. Index fetch failed. Exception: Invalid version or the data in not in 'javabin' format What advice do you have? I can't upgrade both master and slave at the same time - we have to be able to fully test against the new version. I would rather not do the testing with completely static indexes, it would be better to have data being added and deleted normally.
          Hide
          Robert Muir added a comment -

          ok. I was wondering if we are planning to implement javabin in any other languages

          I can't speak for this, except to say that if we wanted to, I think implementing the encoding
          portion would now be significantly easier, especially for languages that don't use an
          internal UTF-16 representation like Java does.

          Show
          Robert Muir added a comment - ok. I was wondering if we are planning to implement javabin in any other languages I can't speak for this, except to say that if we wanted to, I think implementing the encoding portion would now be significantly easier, especially for languages that don't use an internal UTF-16 representation like Java does.
          Hide
          Noble Paul added a comment -

          ok. I was wondering if we are planning to implement javabin in any other languages

          Show
          Noble Paul added a comment - ok. I was wondering if we are planning to implement javabin in any other languages
          Hide
          Robert Muir added a comment -

          Noble, please see my comment: https://issues.apache.org/jira/browse/SOLR-2034?focusedCommentId=12898405&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12898405

          thats an example of another client trying to implement javabin, and wrongly implementing the modified-utf8 conversion... but there doesn't need to be any justification to not use modified utf-8 over the wire, its just wrong.

          Show
          Robert Muir added a comment - Noble, please see my comment: https://issues.apache.org/jira/browse/SOLR-2034?focusedCommentId=12898405&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12898405 thats an example of another client trying to implement javabin, and wrongly implementing the modified-utf8 conversion... but there doesn't need to be any justification to not use modified utf-8 over the wire, its just wrong.
          Hide
          Noble Paul added a comment -

          Is interoperability the issue? The question is , is there any other client using javabin format? Or is it just to be standards compliant?

          Show
          Noble Paul added a comment - Is interoperability the issue? The question is , is there any other client using javabin format? Or is it just to be standards compliant?
          Hide
          Robert Muir added a comment -

          Committed revision 990180 (trunk), 990183 (3x), and updated the 'javabin' page on the wiki.

          I tried to make the change easy to understand in CHANGES/wiki, if you have improvements to
          the wording please do not hesitate.

          Show
          Robert Muir added a comment - Committed revision 990180 (trunk), 990183 (3x), and updated the 'javabin' page on the wiki. I tried to make the change easy to understand in CHANGES/wiki, if you have improvements to the wording please do not hesitate.
          Hide
          Robert Muir added a comment -

          if no one objects to the latest patch, i'd like to commit in a day or two.

          Show
          Robert Muir added a comment - if no one objects to the latest patch, i'd like to commit in a day or two.
          Hide
          Robert Muir added a comment -

          well, i object to my own patch, because i think it would suck to have solrj depend on the lucene jar.

          here's a modified version with its own utf-8 conversion and no bytesref/unicodeutil

          Show
          Robert Muir added a comment - well, i object to my own patch, because i think it would suck to have solrj depend on the lucene jar. here's a modified version with its own utf-8 conversion and no bytesref/unicodeutil
          Hide
          Robert Muir added a comment -

          thanks hoss, ill commit tomorrow if there are no objections.

          Show
          Robert Muir added a comment - thanks hoss, ill commit tomorrow if there are no objections.
          Hide
          Hoss Man added a comment -

          beautiful.

          +1 commit

          Show
          Hoss Man added a comment - beautiful. +1 commit
          Hide
          Robert Muir added a comment -

          OK, i bumped the byte version as Yonik suggested, and tried to use an old client.

          Here's the exception:

          java.lang.RuntimeException: Invalid version or the data in not in 'javabin' format
          	at org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:99)
          	at org.apache.solr.client.solrj.impl.BinaryResponseParser.processResponse(BinaryResponseParser.java:39)
          	at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:477)
          	at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
          	at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:89)
          	at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118)
          
          Show
          Robert Muir added a comment - OK, i bumped the byte version as Yonik suggested, and tried to use an old client. Here's the exception: java.lang.RuntimeException: Invalid version or the data in not in 'javabin' format at org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:99) at org.apache.solr.client.solrj.impl.BinaryResponseParser.processResponse(BinaryResponseParser.java:39) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:477) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244) at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:89) at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118)
          Hide
          Robert Muir added a comment -

          JavaBinCodec.VERSION should be bumped.
          It is serialized and verified when decoding, and currently an exception is thrown if it does not match the current version.

          Ahhh, I totally missed that version byte. So did I bump the wrong version in the patch (BinaryResponseParser's)? I'll fix

          Show
          Robert Muir added a comment - JavaBinCodec.VERSION should be bumped. It is serialized and verified when decoding, and currently an exception is thrown if it does not match the current version. Ahhh, I totally missed that version byte. So did I bump the wrong version in the patch (BinaryResponseParser's)? I'll fix
          Hide
          Robert Muir added a comment -

          Hoss man, I hear your concerns but i don't understand how we can address any of this.

          This is really one of the problems of modified-UTF8, and really my big concern with using it (that clients will be wrongly written, see my example above). Its not really possible or reasonable to address it at the javabin layer: it needs to be done at a higher protocol level.

          of course, if we can figure this out, then maybe it would be easy to provide back compat too, but i didnt see any obvious places in the code where any versioning is written over the wire.

          Show
          Robert Muir added a comment - Hoss man, I hear your concerns but i don't understand how we can address any of this. This is really one of the problems of modified-UTF8, and really my big concern with using it (that clients will be wrongly written, see my example above). Its not really possible or reasonable to address it at the javabin layer: it needs to be done at a higher protocol level. of course, if we can figure this out, then maybe it would be easy to provide back compat too, but i didnt see any obvious places in the code where any versioning is written over the wire.
          Hide
          Yonik Seeley added a comment -

          Seems OK. I think modified UTF-8 was originally used so that the string chars could be directly written to the output stream instead of to a temp buffer. But copying to a temp buffer first shouldn't have that much overhead.

          JavaBinCodec.VERSION should be bumped.
          It is serialized and verified when decoding, and currently an exception is thrown if it does not match the current version.

          Show
          Yonik Seeley added a comment - Seems OK. I think modified UTF-8 was originally used so that the string chars could be directly written to the output stream instead of to a temp buffer. But copying to a temp buffer first shouldn't have that much overhead. JavaBinCodec.VERSION should be bumped. It is serialized and verified when decoding, and currently an exception is thrown if it does not match the current version.
          Hide
          Hoss Man added a comment -

          I don't think adding many hoops for back compatibility is worth the trouble. Note that that does not mean people can not use solrj to talk across different versions - they may have to use xml though....

          Agreed, my chief concern is what happens when someone tries to use SolrJ 1.4 to talk to Solr 3.1 w/javabin (or vice versa).

          A) If they get an error: great, i'm totaly fine with that – we just document that they should use XML in this case.

          B) If the commands succeed, but the string data is always corrupted, that's not ideal – but not totally horrible since the probably should be immediately obvious and should have read the documentation and known not to do that.

          C) if the commands succeed, but the string data is sometimes corrupted (as i recall, not every character is different in UTF8 vs Java's modified UTF8, correct?) then that seems really bad ... people may start using javabin to update their index and not notice for quite some time that big hard to identify chunks of their data are corrupted.

          as long a someone sanity checks that the situation is either #A or #B before committing, i'm totally cool with it ... but #C scares the bejesus out of me.

          (i'll try to run some tests myself in the next few days if no one else gets a chance)

          Show
          Hoss Man added a comment - I don't think adding many hoops for back compatibility is worth the trouble. Note that that does not mean people can not use solrj to talk across different versions - they may have to use xml though.... Agreed, my chief concern is what happens when someone tries to use SolrJ 1.4 to talk to Solr 3.1 w/javabin (or vice versa). A) If they get an error: great, i'm totaly fine with that – we just document that they should use XML in this case. B) If the commands succeed, but the string data is always corrupted, that's not ideal – but not totally horrible since the probably should be immediately obvious and should have read the documentation and known not to do that. C) if the commands succeed, but the string data is sometimes corrupted (as i recall, not every character is different in UTF8 vs Java's modified UTF8, correct?) then that seems really bad ... people may start using javabin to update their index and not notice for quite some time that big hard to identify chunks of their data are corrupted. as long a someone sanity checks that the situation is either #A or #B before committing, i'm totally cool with it ... but #C scares the bejesus out of me. (i'll try to run some tests myself in the next few days if no one else gets a chance)
          Hide
          Robert Muir added a comment -

          Thanks Ryan, I'll wait a few days before committing to see if there are any objections.

          If there aren't, i'll update wiki / CHANGES loud and clear that the binary format has changed.

          Show
          Robert Muir added a comment - Thanks Ryan, I'll wait a few days before committing to see if there are any objections. If there aren't, i'll update wiki / CHANGES loud and clear that the binary format has changed.
          Hide
          Ryan McKinley added a comment -

          I don't think adding many hoops for back compatibility is worth the trouble. Note that that does not mean people can not use solrj to talk across different versions – they may have to use xml though.

          Show
          Ryan McKinley added a comment - I don't think adding many hoops for back compatibility is worth the trouble. Note that that does not mean people can not use solrj to talk across different versions – they may have to use xml though.
          Hide
          Robert Muir added a comment -

          I bumped the BinaryResponseParser version (only version i can find here).

          its not really obvious to me if this is actually written over the wire / how to conditionalize modified-UTF-8 based on it, and seems risky.

          I think its best to just go to UTF-8 and never look back (but if someone knows how to support modified UTF-8 when version=1, thats great, I just don't have the heart)

          Show
          Robert Muir added a comment - I bumped the BinaryResponseParser version (only version i can find here). its not really obvious to me if this is actually written over the wire / how to conditionalize modified-UTF-8 based on it, and seems risky. I think its best to just go to UTF-8 and never look back (but if someone knows how to support modified UTF-8 when version=1, thats great, I just don't have the heart)
          Hide
          Robert Muir added a comment -

          Hoss, thanks, I agree with regards to backwards compat, unfortunately its not immediately obvious to me how to implement the versioning (seamless like you said, would be preferable).

          the only thing i see is the version in the response parser, but i will play some and see if i can do it in a versioned way (any more pointers would be very helpful).

          ultimately the goal would be to make it easier for non-java clients to implement this protocol. although the wiki says only the java client implements this, i found an issue for the .NET client here: http://code.google.com/p/solrnet/issues/detail?id=71

          I took a look at the github source code (http://github.com/mausch/SolrNet/blob/javabin/SolrNet/Impl/JavaBinCodec.cs) and was a little concerned to see writeChars implemented with Encoding.UTF8.GetBytes... I know its likely a work in progress etc, but I think it illustrates the benefits of standard UTF-8.

          Show
          Robert Muir added a comment - Hoss, thanks, I agree with regards to backwards compat, unfortunately its not immediately obvious to me how to implement the versioning (seamless like you said, would be preferable). the only thing i see is the version in the response parser, but i will play some and see if i can do it in a versioned way (any more pointers would be very helpful). ultimately the goal would be to make it easier for non-java clients to implement this protocol. although the wiki says only the java client implements this, i found an issue for the .NET client here: http://code.google.com/p/solrnet/issues/detail?id=71 I took a look at the github source code ( http://github.com/mausch/SolrNet/blob/javabin/SolrNet/Impl/JavaBinCodec.cs ) and was a little concerned to see writeChars implemented with Encoding.UTF8.GetBytes... I know its likely a work in progress etc, but I think it illustrates the benefits of standard UTF-8.
          Hide
          Hoss Man added a comment -

          +1 to the patch

          My only concern is upgrade compatibility – it would be preferable if people upgrading either Solr or their SolrJ client (but not both at the exact same moment) would still have a functioning system.

          As i recall, the BinaryResponseWriter / Parser use a version param and version metadata in the response (just like the XmlResponseWriter) to indicate the codec version requested and the code version returned – this seems like the kind of thing that should probably warrant a new coden impl with a new version number.

          that said: i didn't follow the details of the binary response writer / parser / codec implementation very closely, so i have no idea how hard it will be to make it all work smoothly for people: if it's a pain in the ass then i'm totally fine with saying that SolrJ 3.x can't talk to Solr 1.x (and vice versa) ... but we should still probably update the binary code version info to make it clear there is a difference

          Show
          Hoss Man added a comment - +1 to the patch My only concern is upgrade compatibility – it would be preferable if people upgrading either Solr or their SolrJ client (but not both at the exact same moment) would still have a functioning system. As i recall, the BinaryResponseWriter / Parser use a version param and version metadata in the response (just like the XmlResponseWriter) to indicate the codec version requested and the code version returned – this seems like the kind of thing that should probably warrant a new coden impl with a new version number. that said: i didn't follow the details of the binary response writer / parser / codec implementation very closely, so i have no idea how hard it will be to make it all work smoothly for people: if it's a pain in the ass then i'm totally fine with saying that SolrJ 3.x can't talk to Solr 1.x (and vice versa) ... but we should still probably update the binary code version info to make it clear there is a difference

            People

            • Assignee:
              Robert Muir
              Reporter:
              Robert Muir
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development