Avro
  1. Avro
  2. AVRO-860

Invalid JSON when printing out records with unicode

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Duplicate
    • Affects Version/s: 1.5.1
    • Fix Version/s: None
    • Component/s: java
    • Labels:

      Description

      I have an avro file, that when printed returns invalid JSON.
      The code for iterating and printing is:

      
                  DatumReader<GenericRecord> reader = new GenericDatumReader<GenericRecord>();
                  DataFileReader<GenericRecord> dataFileReader =
                      new DataFileReader<GenericRecord>(data, reader);
      
                  while (dataFileReader.hasNext()) {
                      System.out.println(dataFileReader.next().toString());
                  }
      

      and the relevant JSON snippet is

          "description": "Move™ offers advertisers the opportunity to deliver messages to consumers at a time when consumers are making the biggest purchases of their lives\uMOVE™ OFFERS ADVERTISERS THE OPPORTUNITY TO DELIVER MESSAGES TO CONSUMERS AT A TIME WHEN CONSUMERS ARE MAKING THE BIGGEST PURCHASES OF THEIR LIVES—OR REMODELING, REDECORATING AND MAINTAINING THEIR MOST IMPORTANT ASSETS.or remodeling, redecorating and maintaining their most important assets.",
      

      (The \uMOVE is the problematic part).

      However if I do:

                      GenericRecord record = dataFileReader.next();
                      Utf8 desc = (Utf8)record.get("description");
                      System.out.println(desc);
      

      Then I get

      Move™ offers advertisers the opportunity to deliver messages to consumers at a time when consumers are making the biggest purchases of their lives—or remodeling, redecorating and maintaining their most important assets.
      
      1. AVRO-860.diff
        6 kB
        Miki Tebeka
      2. m.avro
        0.4 kB
        Miki Tebeka
      3. AVRO-860.diff
        7 kB
        Miki Tebeka

        Issue Links

          Activity

          Miki Tebeka created issue -
          Hide
          Doug Cutting added a comment -

          The problem looks to be in GenericData#writeEscapedString(), added in AVRO-713.

          Show
          Doug Cutting added a comment - The problem looks to be in GenericData#writeEscapedString(), added in AVRO-713 .
          Hide
          Miki Tebeka added a comment -

          Any reason the JSON is constructed manually and not using jackson? (which is already an requirement).

          Show
          Miki Tebeka added a comment - Any reason the JSON is constructed manually and not using jackson? (which is already an requirement).
          Hide
          Doug Cutting added a comment -

          No good reason. History. Schema.toString() uses Jackson.

          Show
          Doug Cutting added a comment - No good reason. History. Schema.toString() uses Jackson.
          Hide
          Miki Tebeka added a comment -

          Patch to have toString use jackson

          Show
          Miki Tebeka added a comment - Patch to have toString use jackson
          Miki Tebeka made changes -
          Field Original Value New Value
          Attachment AVRO-860.diff [ 12486665 ]
          Hide
          Scott Carey added a comment -

          Looks good, is there a unit test that shows this error before the patch, but works after? If not, we should add it to this patch.

          Show
          Scott Carey added a comment - Looks good, is there a unit test that shows this error before the patch, but works after? If not, we should add it to this patch.
          Hide
          Miki Tebeka added a comment -

          OK, I'll work on that. Note the Java is not my strong side (I'm a Python developer). Will try to dig out an offending avro file.

          Show
          Miki Tebeka added a comment - OK, I'll work on that. Note the Java is not my strong side (I'm a Python developer). Will try to dig out an offending avro file.
          Hide
          Doug Cutting added a comment -

          I think the "™" above is what triggered the bug and that any string with this would be mis-encoded.

          Show
          Doug Cutting added a comment - I think the "™" above is what triggered the bug and that any string with this would be mis-encoded.
          Hide
          Scott Carey added a comment -

          It could be as simple as creating a very simple GenercData.Record with a string field set to have the "™" in there (you can place unicode utf8 directly in the source, or use a \u literal).

          Schema s = Schema.parse("{\"type\":\"record\", \"fields\": [{\"name":\"bar\", \"type\":\"string\"}]}");
          GenericData.Record foo = new GenericData.Record(s);
          foo.put(0, "utf8 trademark char-> ™ <-");
          Assert.assertEquals(expected, foo.toString());
          
          Show
          Scott Carey added a comment - It could be as simple as creating a very simple GenercData.Record with a string field set to have the "™" in there (you can place unicode utf8 directly in the source, or use a \u literal). Schema s = Schema.parse( "{\" type\ ":\" record\ ", \" fields\ ": [{\" name ":\" bar\ ", \" type\ ":\" string\ "}]}" ); GenericData.Record foo = new GenericData.Record(s); foo.put(0, "utf8 trademark char -> ™ <-" ); Assert.assertEquals(expected, foo.toString());
          Hide
          Miki Tebeka added a comment -

          Avro file that is encoded to invalid JSON

          Show
          Miki Tebeka added a comment - Avro file that is encoded to invalid JSON
          Miki Tebeka made changes -
          Attachment m.avro [ 12486689 ]
          Hide
          Scott Carey added a comment -

          I tried adding the below to TestGenericData.java:

            @Test
            public void testUtf8String() {
              Schema s = Schema.parse("{\"type\":\"record\", \"name\":\"foo\", \"fields\": [{\"name\":\"bar\", \"type\":[\"null\",\"string\"]}]}");
              GenericRecord foo = new GenericData.Record(s);
              foo.put(0, new Utf8("utf8 trademark char-> ™ <-"));
              System.out.println(foo);
            }
          

          But it did not print out anything suspicious. I have not tried it using your data file. This is with trunk – are you using trunk or 1.5.1?

          Show
          Scott Carey added a comment - I tried adding the below to TestGenericData.java: @Test public void testUtf8String() { Schema s = Schema.parse( "{\" type\ ":\" record\ ", \" name\ ":\" foo\ ", \" fields\ ": [{\" name\ ":\" bar\ ", \" type\ ":[\" null \ ",\" string\ "]}]}" ); GenericRecord foo = new GenericData.Record(s); foo.put(0, new Utf8( "utf8 trademark char -> ™ <-" )); System .out.println(foo); } But it did not print out anything suspicious. I have not tried it using your data file. This is with trunk – are you using trunk or 1.5.1?
          Hide
          Miki Tebeka added a comment -

          Adding the test Scott suggested.

          Show
          Miki Tebeka added a comment - Adding the test Scott suggested.
          Miki Tebeka made changes -
          Attachment AVRO-860.diff [ 12486694 ]
          Hide
          Scott Carey added a comment -

          Sorry, I should have noticed this earlier: This was fixed in AVRO-851

          The test below fails if I revert AVRO-851 on trunk. AVRO-851 likely fixes the issue you see too. I am not sure if AVRO-851 made it into 1.5.2's release candidate.

          AVRO-851 did not switch out to Jackson, I think that is still a valuable improvement.

          However, it appears that the patch here alters the output – it does not escape the character: '\u2013', leaving it as a literal utf8 char ('–'). Is it required to escape unicode characters in this range? Jackson apparently does not in the default configuration.

            @Test
            public void testUtf8StringPrint() {
              Schema s = Schema.parse("{\"type\":\"record\", \"name\":\"foo\", \"fields\": [{\"name\":\"bar\", \"type\":[\"null\",\"string\"]}]}");
              GenericRecord foo = new GenericData.Record(s);
              foo.put(0, new Utf8("unicode char-> \u2013 <-"));
              assertEquals("{\"bar\": \"unicode char-> \\u2013 <-\"}", foo.toString());
            }
          
          Show
          Scott Carey added a comment - Sorry, I should have noticed this earlier: This was fixed in AVRO-851 The test below fails if I revert AVRO-851 on trunk. AVRO-851 likely fixes the issue you see too. I am not sure if AVRO-851 made it into 1.5.2's release candidate. AVRO-851 did not switch out to Jackson, I think that is still a valuable improvement. However, it appears that the patch here alters the output – it does not escape the character: '\u2013', leaving it as a literal utf8 char ('–'). Is it required to escape unicode characters in this range? Jackson apparently does not in the default configuration. @Test public void testUtf8StringPrint() { Schema s = Schema.parse( "{\" type\ ":\" record\ ", \" name\ ":\" foo\ ", \" fields\ ": [{\" name\ ":\" bar\ ", \" type\ ":[\" null \ ",\" string\ "]}]}" ); GenericRecord foo = new GenericData.Record(s); foo.put(0, new Utf8( "unicode char -> \u2013 <-" )); assertEquals( "{\" bar\ ": \" unicode char -> \\u2013 <-\ "}" , foo.toString()); }
          Hide
          Miki Tebeka added a comment -

          I'm not an expert on the subject, but IMO JSON is utf-8.

          Show
          Miki Tebeka added a comment - I'm not an expert on the subject, but IMO JSON is utf-8.
          Hide
          Scott Carey added a comment -

          http://www.json.org/
          indicates that all Unicode characters that are not control characters or " or \ are valid. It does not specify what encoding is valid, just that it is Unicode. So I assume that it must be consistent with whatever encoding the entire document is in.

          http://www.ietf.org/rfc/rfc4627.txt

          Is more precise, and Jackson seems to be implementing that. In that case, only the control characters between 00 and 1F inclusive are required to be encoded, along with \ and ".

          The old code encoded more code points, which will print out more cleanly in some cases but has nothing to do with JSON compliance.

          I think we can safely delegate this to Jackson and trust it outputs valid JSON string encodings.

          Show
          Scott Carey added a comment - http://www.json.org/ indicates that all Unicode characters that are not control characters or " or \ are valid. It does not specify what encoding is valid, just that it is Unicode. So I assume that it must be consistent with whatever encoding the entire document is in. http://www.ietf.org/rfc/rfc4627.txt Is more precise, and Jackson seems to be implementing that. In that case, only the control characters between 00 and 1F inclusive are required to be encoded, along with \ and ". The old code encoded more code points, which will print out more cleanly in some cases but has nothing to do with JSON compliance. I think we can safely delegate this to Jackson and trust it outputs valid JSON string encodings.
          Doug Cutting made changes -
          Link This issue duplicates AVRO-851 [ AVRO-851 ]
          Hide
          Doug Cutting added a comment -

          So we have two different patches for this, one here and one in AVRO-851. This one has the advantage that it uses Jackson, and is thus more likely to produce valid JSON. However it makes a deep copy of data structures, which probably adversely affects performance. Performance here is probably important.

          We could develop an implementation that, instead of Jackson's ObjectMapper, uses Jackson's lower-level JsonGenerator API, as is done in Schema.java. That might both perform well and delegate JSON details to Jackson. On the other hand, JSON is simple enough that the approach in AVRO-851 might be less code and work well enough.

          Thoughts?

          Show
          Doug Cutting added a comment - So we have two different patches for this, one here and one in AVRO-851 . This one has the advantage that it uses Jackson, and is thus more likely to produce valid JSON. However it makes a deep copy of data structures, which probably adversely affects performance. Performance here is probably important. We could develop an implementation that, instead of Jackson's ObjectMapper, uses Jackson's lower-level JsonGenerator API, as is done in Schema.java. That might both perform well and delegate JSON details to Jackson. On the other hand, JSON is simple enough that the approach in AVRO-851 might be less code and work well enough. Thoughts?
          Doug Cutting made changes -
          Assignee Miki Tebeka [ tebeka ]
          Fix Version/s 1.6.0 [ 12316198 ]
          Doug Cutting made changes -
          Fix Version/s 1.7.0 [ 12318848 ]
          Fix Version/s 1.6.0 [ 12316198 ]
          Doug Cutting made changes -
          Fix Version/s 1.7.0 [ 12318848 ]
          Hide
          Doug Cutting added a comment -

          Closing this as a duplicate of AVRO-851.

          Show
          Doug Cutting added a comment - Closing this as a duplicate of AVRO-851 .
          Doug Cutting made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Duplicate [ 3 ]
          Doug Cutting made changes -
          Status Resolved [ 5 ] Closed [ 6 ]

            People

            • Assignee:
              Miki Tebeka
              Reporter:
              Miki Tebeka
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development