Avro
  1. Avro
  2. AVRO-672

Convert JSON Text Input to Avro Tool

    Details

    • Type: New Feature New Feature
    • Status: Patch Available
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: java
    • Labels:
      None

      Description

      The attached patch allows reading a JSON-formatted text file in, converting to a conforming Avro text file, emitting one record per line, e.g., it can read this input file:

      {"intval":12} {"intval":-73,"strval":"hello, there!!"}

      with this schema:
      { "type":"record", "name":"TestRecord", "fields": [

      {"name":"intval","type":"int"}

      ,

      {"name":"strval","type":["string", "null"]}

      ]}

      returning valid Avro. This is different than the DataFileWriteTool, which would read in the following internal encoding:

      {"intval":12,"strval":null}

      {"intval":-73,"strval":{"string":"hello, there!!"}}

      In general, the internal encodings used by Avro aren't natural when reading in JSON text that appears in the wild. Likewise, this utility allows changing invalid Avro identifier characters into an underscore, again to tolerate JSON that wasn't designed to be readable by Avro.

      1. AVRO-672.patch
        9 kB
        Doug Cutting
      2. AVRO-672.patch
        18 kB
        Ron Bodkin

        Activity

        Hide
        Doug Cutting added a comment -

        Leith, is the tool that Ron provided here the one you need? If so, then we can probably resuscitate this patch and get it committed. If not, is there a specific tool you need (e.g., CSV or TSV)? Thanks!

        Show
        Doug Cutting added a comment - Leith, is the tool that Ron provided here the one you need? If so, then we can probably resuscitate this patch and get it committed. If not, is there a specific tool you need (e.g., CSV or TSV)? Thanks!
        Hide
        Leith Shabbot added a comment -

        I see that this feature is unscheduled. What Phillip mentioned above is a feature that we are looking for, with respect to avro. I am just curious if this feature will be part of the avro tool set and if so, do you all have a good idea when this feature maybe targeted for??

        Show
        Leith Shabbot added a comment - I see that this feature is unscheduled. What Phillip mentioned above is a feature that we are looking for, with respect to avro. I am just curious if this feature will be part of the avro tool set and if so, do you all have a good idea when this feature maybe targeted for??
        Hide
        Doug Cutting added a comment -

        > I like the idea of having tools that manipulate "traditional" data formats into avro records, including guessing at the schema.

        Do you think Ron's patch here is a good example of this that we should commit?

        I worry that such tools might do 90% of what each application wants and require constant tweaking. And each tweak might break other users. So a tool has to either have lots of flexibility or be lossless. But perhaps I'm just paranoid...

        Show
        Doug Cutting added a comment - > I like the idea of having tools that manipulate "traditional" data formats into avro records, including guessing at the schema. Do you think Ron's patch here is a good example of this that we should commit? I worry that such tools might do 90% of what each application wants and require constant tweaking. And each tweak might break other users. So a tool has to either have lots of flexibility or be lossless. But perhaps I'm just paranoid...
        Hide
        Philip Zeyliger added a comment -

        I like the idea of having tools that manipulate "traditional" data formats into avro records, including guessing at the schema. CSV and TSV and one-json-per-line are obvious candidates here.

        Show
        Philip Zeyliger added a comment - I like the idea of having tools that manipulate "traditional" data formats into avro records, including guessing at the schema. CSV and TSV and one-json-per-line are obvious candidates here.
        Hide
        Doug Cutting added a comment -

        I am not convinced that the tool you need is a general-purpose tool that others will use or whether it might be better to keep this in your application. Avro's existing JSON encoding is primarily a tool for debugging. Tools that can losslessly import and export JSON data into and out of Avro might also be generally useful. A tool that adapts JSON data to pre-existing schemas could be generally useful if it permitted enough control of how the adaptation is done, but might also be rather application-specific. What do you think?

        Show
        Doug Cutting added a comment - I am not convinced that the tool you need is a general-purpose tool that others will use or whether it might be better to keep this in your application. Avro's existing JSON encoding is primarily a tool for debugging. Tools that can losslessly import and export JSON data into and out of Avro might also be generally useful. A tool that adapts JSON data to pre-existing schemas could be generally useful if it permitted enough control of how the adaptation is done, but might also be rather application-specific. What do you think?
        Hide
        Ron Bodkin added a comment -

        The use case I'm most interested in supporting is converting from JSON data to a previously-defined Avro schema, either in a batch file conversion, or in memory (for use with map-reduce).

        This newer patch emits the output in a standard, different schema and conversion to a previously-defined (custom) schema seems to be a problem that would require code like I wrote in my patch. Also, it'd be nice to be able to read in a value like "1" even to a double or a long field, even though it'd be parsed as a JSON integer node.

        Also I have found it valuable to have transformation of names that have invalid characters since there's lots of valid JSON with identifiers that don't conform to the Avro identifier grammar. That would be pretty easy to put in this patch (although the regexp I used before was way too slow so I have a newer version that's efficient).

        To allow reading in JSON text and creating objects in memory that conform to that schema, I think it'd be necessary to have hints for the type of data that arrays contain (e.g., in generated code or in runtime annotations if using a reflective style). That is something that I already ran into in trying to get the reflection reader to work with specific data (on AVRO-669).

        Show
        Ron Bodkin added a comment - The use case I'm most interested in supporting is converting from JSON data to a previously-defined Avro schema, either in a batch file conversion, or in memory (for use with map-reduce). This newer patch emits the output in a standard, different schema and conversion to a previously-defined (custom) schema seems to be a problem that would require code like I wrote in my patch. Also, it'd be nice to be able to read in a value like "1" even to a double or a long field, even though it'd be parsed as a JSON integer node. Also I have found it valuable to have transformation of names that have invalid characters since there's lots of valid JSON with identifiers that don't conform to the Avro identifier grammar. That would be pretty easy to put in this patch (although the regexp I used before was way too slow so I have a newer version that's efficient). To allow reading in JSON text and creating objects in memory that conform to that schema, I think it'd be necessary to have hints for the type of data that arrays contain (e.g., in generated code or in runtime annotations if using a reflective style). That is something that I already ran into in trying to get the reflection reader to work with specific data (on AVRO-669 ).
        Hide
        Doug Cutting added a comment -

        It might be confusing to provide two different JSON encodings for Avro data. Also, the encoding in your patch is indeed simpler, but can lose information. For example, a string that looks like base64-encoded binary data would be assumed by Jackson to be binary data, which might not always be the case. Schemas that include fixed or enum values are not supported by this encoding, nor are many unions.

        If reading and writing arbitrary JSON is a priority, then the approach taken in AVRO-251 might be of interest. Here's a patch that provides a DatumReader and DatumWriter for Jackson's JsonNode. This uses a schema that permits arbitrary JSON data. Would this be useful to you? If so, we could provide it as a tool.

        Show
        Doug Cutting added a comment - It might be confusing to provide two different JSON encodings for Avro data. Also, the encoding in your patch is indeed simpler, but can lose information. For example, a string that looks like base64-encoded binary data would be assumed by Jackson to be binary data, which might not always be the case. Schemas that include fixed or enum values are not supported by this encoding, nor are many unions. If reading and writing arbitrary JSON is a priority, then the approach taken in AVRO-251 might be of interest. Here's a patch that provides a DatumReader and DatumWriter for Jackson's JsonNode. This uses a schema that permits arbitrary JSON data. Would this be useful to you? If so, we could provide it as a tool.
        Hide
        Ron Bodkin added a comment -

        Patch with implementation of this feature, including a test.

        Show
        Ron Bodkin added a comment - Patch with implementation of this feature, including a test.

          People

          • Assignee:
            Unassigned
            Reporter:
            Ron Bodkin
          • Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:

              Development