Avro
  1. Avro
  2. AVRO-668

Java: Streamline writing of Strings for Encoders and GenericDatumWriter

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.4.1
    • Component/s: java
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      We can streamline writing of strings to minimize object creation during writes.

      We can avoid converting a String into Utf8 for Json output, and for Binary output we can avoid a Utf8 (but still create a byte[]).

      1. AVRO-668.patch
        5 kB
        Scott Carey
      2. AVRO-668.patch
        7 kB
        Scott Carey
      3. AVRO-668.patch
        7 kB
        Scott Carey
      4. AVRO-668.patch
        8 kB
        Scott Carey

        Activity

        Hide
        Scott Carey added a comment -

        I forgot ValidatingEncoder in the last patch file. Oops.

        Show
        Scott Carey added a comment - I forgot ValidatingEncoder in the last patch file. Oops.
        Hide
        Scott Carey added a comment -

        We could also change this signature to take CharSequence instead of String. But that would break the user-facing API. This only requires those who have extended Encoder to implement the string variation, and that could be backed out and simply overriden in the child classes.

        Show
        Scott Carey added a comment - We could also change this signature to take CharSequence instead of String. But that would break the user-facing API. This only requires those who have extended Encoder to implement the string variation, and that could be backed out and simply overriden in the child classes.
        Hide
        Doug Cutting added a comment -

        Scott, have you benchmarked this? If so, how much does it improve things?

        Some problems with the current patch:

        • it doesn't compile, failing with: ValidatingEncoder is not abstract and does not override abstract method writeString(java.lang.String)
        • once that's fixed, some tests fail for me (e.g., TestBlockIO2)
        • the utf8cs constant should be named UTF8_CS. this fails checkstyle.

        Should we also add Encoder#writeString(CharSequence s) as a convenience? This could be implemented as:

        if (s instanceof Utf8) 
          writeString((Utf8)s);
        else
           writeString(s.toString());
        

        That would move this logic from GenericDatumWriter to Encoder. What do you think? Are there other places that method might be useful?

        Show
        Doug Cutting added a comment - Scott, have you benchmarked this? If so, how much does it improve things? Some problems with the current patch: it doesn't compile, failing with: ValidatingEncoder is not abstract and does not override abstract method writeString(java.lang.String) once that's fixed, some tests fail for me (e.g., TestBlockIO2) the utf8cs constant should be named UTF8_CS. this fails checkstyle. Should we also add Encoder#writeString(CharSequence s) as a convenience? This could be implemented as: if (s instanceof Utf8) writeString((Utf8)s); else writeString(s.toString()); That would move this logic from GenericDatumWriter to Encoder. What do you think? Are there other places that method might be useful?
        Hide
        Scott Carey added a comment -

        As far as performance goes, I have not set up a benchmark. I did note the String > Utf8 > byte[] > outputStream garbage creation in a profiler of my app. But reducing garbage almost never does much in a microbenchmark, it is more of a system level optimization when you've got lots of code churning the GC. It might be useful for:
        http://wiki.github.com/eishay/jvm-serializers/ though. Anything that makes String > out and in > String more streamlined will help that benchmark.

        • The most recent patch compiles. I failed to include it the first time.
        • I'll change it to UTF8_CS

        I can't run all java tests from ant on trunk. Not sure if its my fault or not but it hangs forever in:
        [junit] Running org.apache.avro.TestProtocolHttp
        [junit] 5455 [main] INFO org.apache.avro.ipc.DatagramTransceiver - sent to /127.0.0.1:15276

        The test prior fails too:
        [junit] Running org.apache.avro.TestProtocolGenericMeta
        [junit] 5427 [SocketServer on 0.0.0.0/0.0.0.0:0] INFO org.apache.avro.ipc.SocketTransceiver - open to /10.0.0.231:65043
        [junit] 5429 [SocketServer on 0.0.0.0/0.0.0.0:0] INFO org.apache.avro.ipc.SocketServer - stopping /0.0.0.0
        [junit] 5427 [main] INFO org.apache.avro.ipc.SocketTransceiver - open to 0.0.0.0/0.0.0.0:65040
        [junit] 5441 [Connection to /10.0.0.231:65043] INFO org.apache.avro.TestProtocolGeneric - hello: bob
        [junit] 5441 [main] INFO org.apache.avro.ipc.SocketTransceiver - closing to 0.0.0.0/0.0.0.0:65040
        [junit] Tests run: 6, Failures: 0, Errors: 5, Time elapsed: 0.071 sec

        with 5 errors similar to:
        Testcase: testEcho took 0 sec
        Caused an ERROR
        null
        java.nio.channels.ClosedChannelException
        at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:126)
        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
        at org.apache.avro.ipc.SocketTransceiver.writeLength(SocketTransceiver.java:94)
        at org.apache.avro.ipc.SocketTransceiver.writeBuffers(SocketTransceiver.java:84)
        at org.apache.avro.ipc.Transceiver.transceive(Transceiver.java:38)
        at org.apache.avro.ipc.Requestor.request(Requestor.java:123)
        at org.apache.avro.generic.GenericRequestor.request(GenericRequestor.java:43)
        at org.apache.avro.TestProtocolGeneric.testEcho(TestProtocolGeneric.java:132)

        Either that is my setup, or a different JIRA to make that work. Or its just more stupid Mac crap causing trouble.

        If I run TestBlockIO2 by itself, I can see the error you mention but I'm puzzled by what that might be at the moment. Maybe its because I switched from the deprecated Utf8 getLength() to getByteLength()?

        I think providing both writeString(Utf8) and writeString(CharSequence) at the Encoder level would be useful. However thats an API change that would have to wait until 1.5. That is not so bad I guess this isn't a huge performance benefit, just something that combined with 4 or 5 other little things would add up.

        Show
        Scott Carey added a comment - As far as performance goes, I have not set up a benchmark. I did note the String > Utf8 > byte[] > outputStream garbage creation in a profiler of my app. But reducing garbage almost never does much in a microbenchmark, it is more of a system level optimization when you've got lots of code churning the GC. It might be useful for: http://wiki.github.com/eishay/jvm-serializers/ though. Anything that makes String > out and in > String more streamlined will help that benchmark. The most recent patch compiles. I failed to include it the first time. I'll change it to UTF8_CS I can't run all java tests from ant on trunk. Not sure if its my fault or not but it hangs forever in: [junit] Running org.apache.avro.TestProtocolHttp [junit] 5455 [main] INFO org.apache.avro.ipc.DatagramTransceiver - sent to /127.0.0.1:15276 The test prior fails too: [junit] Running org.apache.avro.TestProtocolGenericMeta [junit] 5427 [SocketServer on 0.0.0.0/0.0.0.0:0] INFO org.apache.avro.ipc.SocketTransceiver - open to /10.0.0.231:65043 [junit] 5429 [SocketServer on 0.0.0.0/0.0.0.0:0] INFO org.apache.avro.ipc.SocketServer - stopping /0.0.0.0 [junit] 5427 [main] INFO org.apache.avro.ipc.SocketTransceiver - open to 0.0.0.0/0.0.0.0:65040 [junit] 5441 [Connection to /10.0.0.231:65043] INFO org.apache.avro.TestProtocolGeneric - hello: bob [junit] 5441 [main] INFO org.apache.avro.ipc.SocketTransceiver - closing to 0.0.0.0/0.0.0.0:65040 [junit] Tests run: 6, Failures: 0, Errors: 5, Time elapsed: 0.071 sec with 5 errors similar to: Testcase: testEcho took 0 sec Caused an ERROR null java.nio.channels.ClosedChannelException at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:126) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324) at org.apache.avro.ipc.SocketTransceiver.writeLength(SocketTransceiver.java:94) at org.apache.avro.ipc.SocketTransceiver.writeBuffers(SocketTransceiver.java:84) at org.apache.avro.ipc.Transceiver.transceive(Transceiver.java:38) at org.apache.avro.ipc.Requestor.request(Requestor.java:123) at org.apache.avro.generic.GenericRequestor.request(GenericRequestor.java:43) at org.apache.avro.TestProtocolGeneric.testEcho(TestProtocolGeneric.java:132) Either that is my setup, or a different JIRA to make that work. Or its just more stupid Mac crap causing trouble. If I run TestBlockIO2 by itself, I can see the error you mention but I'm puzzled by what that might be at the moment. Maybe its because I switched from the deprecated Utf8 getLength() to getByteLength()? I think providing both writeString(Utf8) and writeString(CharSequence) at the Encoder level would be useful. However thats an API change that would have to wait until 1.5. That is not so bad I guess this isn't a huge performance benefit, just something that combined with 4 or 5 other little things would add up.
        Hide
        Doug Cutting added a comment -

        > I can't run all java tests from ant on trunk.

        That's a problem. Do you have a '127.0.0.1 localhost' line in /etc/hosts? How about a '0.0.0.0 localhost' line? I have both, but don't know if either are required.

        > I think providing both writeString(Utf8) and writeString(CharSequence) at the Encoder level would be useful. However thats an API change that would have to wait until 1.5.

        Why? I'm don't see how adding that method with a default implementation would be an incompatible change.

        Show
        Doug Cutting added a comment - > I can't run all java tests from ant on trunk. That's a problem. Do you have a '127.0.0.1 localhost' line in /etc/hosts? How about a '0.0.0.0 localhost' line? I have both, but don't know if either are required. > I think providing both writeString(Utf8) and writeString(CharSequence) at the Encoder level would be useful. However thats an API change that would have to wait until 1.5. Why? I'm don't see how adding that method with a default implementation would be an incompatible change.
        Hide
        Scott Carey added a comment -

        Patch includes suggestion by Doug to use default implementations in Encoder for writeString(String) and writeString(CharSequence) and move logic out of GenericDatumWriter.

        TestBlockIO2 still fails, but feedback on this variant would be useful regardless (or ideas on why these changes would cause such a failure).

        Doug:
        My hosts file has:

        127.0.0.1	localhost
        255.255.255.255	broadcasthost
        ::1             localhost 
        fe80::1%lo0	localhost
        

        Which looks like a typical BSD/Mac hosts file.
        I've got several interfaces in ifconfig; one of those might be causing confusion.

        Show
        Scott Carey added a comment - Patch includes suggestion by Doug to use default implementations in Encoder for writeString(String) and writeString(CharSequence) and move logic out of GenericDatumWriter. TestBlockIO2 still fails, but feedback on this variant would be useful regardless (or ideas on why these changes would cause such a failure). Doug: My hosts file has: 127.0.0.1 localhost 255.255.255.255 broadcasthost ::1 localhost fe80::1%lo0 localhost Which looks like a typical BSD/Mac hosts file. I've got several interfaces in ifconfig; one of those might be causing confusion.
        Hide
        Scott Carey added a comment -

        The last patch conflicts with AVRO-667, I'll make a patch that works on top of AVRO-667 if we commit it.

        Show
        Scott Carey added a comment - The last patch conflicts with AVRO-667 , I'll make a patch that works on top of AVRO-667 if we commit it.
        Hide
        Scott Carey added a comment -

        Tests in o.a.a.io pass.

        Doug's suggestions incorporated.

        Show
        Scott Carey added a comment - Tests in o.a.a.io pass. Doug's suggestions incorporated.
        Hide
        Doug Cutting added a comment -

        Did you mean to use 'Utf8.class.equals(charSequence.getClass())'? I think 'charSequence instanceof Utf8' reads much better, and, as we discussed in AVRO-667, the performance improvement is perhaps not significant and this might be risky if Utf8 is not final.

        (I wonder whether HotSpot doesn't already optimize instanceof for classes with no subclasses yet defined to be ==. It could, couldn't it?)

        Did we also want to update GenericDatumWriter#writeString()? Maybe not...

        Other than that, +1

        Show
        Doug Cutting added a comment - Did you mean to use 'Utf8.class.equals(charSequence.getClass())'? I think 'charSequence instanceof Utf8' reads much better, and, as we discussed in AVRO-667 , the performance improvement is perhaps not significant and this might be risky if Utf8 is not final. (I wonder whether HotSpot doesn't already optimize instanceof for classes with no subclasses yet defined to be ==. It could, couldn't it?) Did we also want to update GenericDatumWriter#writeString()? Maybe not... Other than that, +1
        Hide
        Scott Carey added a comment -

        Did you mean to use 'Utf8.class.equals(charSequence.getClass())'? I think 'charSequence instanceof Utf8' reads much better, and, as we discussed in AVRO-667, the performance improvement is perhaps not significant and this might be risky if Utf8 is not final.

        At first I did this defensively. It is not risky for a subclass, as that would still work, via the CharSequence contract on toString().

        But thinking about it a bit more, this only requires that a sublcass implement getByteLength() and getBytes() properly, and these are unlikely to be done improperly.
        Unlike in other places where equals() and hashCode() have to remain consistent with Utf8 and it is easy for someone to write a subclass that is self-consistent with Object.equals() and Object.hashCode() but break Utf8's variations.

        Therefore instanceof has little value from the defensive side here, and not much performance-wise either. Writing is far more expensive than the difference between instanceof and class.equals() here.

        Did we also want to update GenericDatumWriter#writeString()? Maybe not

        Oops, I did intend to include that!

        Updated patch on the way.

        Show
        Scott Carey added a comment - Did you mean to use 'Utf8.class.equals(charSequence.getClass())'? I think 'charSequence instanceof Utf8' reads much better, and, as we discussed in AVRO-667 , the performance improvement is perhaps not significant and this might be risky if Utf8 is not final. At first I did this defensively. It is not risky for a subclass, as that would still work, via the CharSequence contract on toString(). But thinking about it a bit more, this only requires that a sublcass implement getByteLength() and getBytes() properly, and these are unlikely to be done improperly. Unlike in other places where equals() and hashCode() have to remain consistent with Utf8 and it is easy for someone to write a subclass that is self-consistent with Object.equals() and Object.hashCode() but break Utf8's variations. Therefore instanceof has little value from the defensive side here, and not much performance-wise either. Writing is far more expensive than the difference between instanceof and class.equals() here. Did we also want to update GenericDatumWriter#writeString()? Maybe not Oops, I did intend to include that! Updated patch on the way.
        Hide
        Doug Cutting added a comment -

        I just committed this. Thanks, Scott!

        Show
        Doug Cutting added a comment - I just committed this. Thanks, Scott!

          People

          • Assignee:
            Scott Carey
            Reporter:
            Scott Carey
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development