Avro
  1. Avro
  2. AVRO-458

add tools that read/write CSV records from/to avro data files

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Duplicate
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: java
    • Labels:

      Description

      It might be useful to have command-line tools that can read & write arbitrary CSV data from & to Avro data files.

        Issue Links

          Activity

          Hide
          Doug Cutting added a comment -

          This is similar to AVRO-456 and AVRO-457.

          Perhaps all three could even be a single tool that supports a variety of formats.

          Show
          Doug Cutting added a comment - This is similar to AVRO-456 and AVRO-457 . Perhaps all three could even be a single tool that supports a variety of formats.
          Hide
          Harsh J added a comment -

          I've put together a simple cli tool with Python that does the following (with some tunable opts):

          CSV to Avro ->
          1. Pass a schema file or it generates one based on CSV header with all string types.
          2. Read/Split each CSV record (from a list of input files) with given delimiter (default ',') and convert their data to their valid schema types.
          p.s. In case of an exception during data-type-mappings (like say null in place of what's supposed to be a float in CSV), check if there's a default field in the schema passed and use it. Else throw an informative exception. I know this makes the 'default' meaning of the schema look wrong, but its a great feature to have!
          3. Write these records down into a data file.

          Avro to CSV ->
          1. Pass a schema to read selective data. Else it reads the file with full schema.
          2. Read each record [only works with records for now] and convert all data to string type. Can read from many avro files into a csv file.
          3. Write to a csv file with an optional header.

          Currently the code (WIP) resides on GitHub at: http://github.com/QwertyManiac/avroutils but I'll submit the stuff as a formal patch once it feels complete.

          This comment is for gaining some suggestions. What to extend/etc.

          Show
          Harsh J added a comment - I've put together a simple cli tool with Python that does the following (with some tunable opts): CSV to Avro -> 1. Pass a schema file or it generates one based on CSV header with all string types. 2. Read/Split each CSV record (from a list of input files) with given delimiter (default ',') and convert their data to their valid schema types. p.s. In case of an exception during data-type-mappings (like say null in place of what's supposed to be a float in CSV), check if there's a default field in the schema passed and use it. Else throw an informative exception. I know this makes the 'default' meaning of the schema look wrong, but its a great feature to have! 3. Write these records down into a data file. Avro to CSV -> 1. Pass a schema to read selective data. Else it reads the file with full schema. 2. Read each record [only works with records for now] and convert all data to string type. Can read from many avro files into a csv file. 3. Write to a csv file with an optional header. Currently the code (WIP) resides on GitHub at: http://github.com/QwertyManiac/avroutils but I'll submit the stuff as a formal patch once it feels complete. This comment is for gaining some suggestions. What to extend/etc.
          Hide
          Philip Zeyliger added a comment -

          Harsh,

          I took a quick look at the github, and it looks like this is great stuff. Looking forward to more tools in the ecosystem.

          The one thing obviously missing from the patch is some tests.

          Cheers,

          – Philip

          Show
          Philip Zeyliger added a comment - Harsh, I took a quick look at the github, and it looks like this is great stuff. Looking forward to more tools in the ecosystem. The one thing obviously missing from the patch is some tests. Cheers, – Philip
          Hide
          Patrick Linehan added a comment -

          Good stuff. The feature list matches what I've been thinking about implementing myself. The only additional feature I would like to see would be the ability to somehow "embed" more complex data types inside a row. Perhaps as a JSON string? This requires some thought, though.

          I'm not familiar with the Python tools in Avro, as I tend to work with Java. Would there be interest in a Java version of this tool, for inclusion in the avro-tools.jar?

          Show
          Patrick Linehan added a comment - Good stuff. The feature list matches what I've been thinking about implementing myself. The only additional feature I would like to see would be the ability to somehow "embed" more complex data types inside a row. Perhaps as a JSON string? This requires some thought, though. I'm not familiar with the Python tools in Avro, as I tend to work with Java. Would there be interest in a Java version of this tool, for inclusion in the avro-tools.jar?
          Hide
          Doug Cutting added a comment -

          > The only additional feature I would like to see would be the ability to somehow "embed" more complex data types inside a row. Perhaps as a JSON string?

          AVRO-456 might be the place to address that: if you need complex data types then perhaps each line should be a JSON value? Note that I just added a patch to AVRO-672 that efficiently reads and writes JSON as Avro data that could be used for this.

          > Would there be interest in a Java version of this tool, for inclusion in the avro-tools.jar?

          Yes, that'd be great!

          Show
          Doug Cutting added a comment - > The only additional feature I would like to see would be the ability to somehow "embed" more complex data types inside a row. Perhaps as a JSON string? AVRO-456 might be the place to address that: if you need complex data types then perhaps each line should be a JSON value? Note that I just added a patch to AVRO-672 that efficiently reads and writes JSON as Avro data that could be used for this. > Would there be interest in a Java version of this tool, for inclusion in the avro-tools.jar? Yes, that'd be great!
          Hide
          Harsh J added a comment -

          Resolved via AVRO-836.

          Marking as duplicate. Sorry about not having had the time to continue on this earlier!

          Show
          Harsh J added a comment - Resolved via AVRO-836 . Marking as duplicate. Sorry about not having had the time to continue on this earlier!

            People

            • Assignee:
              Unassigned
              Reporter:
              Doug Cutting
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development