Avro
  1. Avro
  2. AVRO-457

add tools that read/write xml records from/to avro data files

    Details

    • Type: New Feature New Feature
    • Status: Patch Available
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 1.7.8
    • Fix Version/s: None
    • Component/s: java
    • Labels:
    • Tags:
      gsoc

      Description

      It might be useful to have command-line tools that can read & write arbitrary XML data from & to Avro data files.

      1. AVRO-457.patch
        327 kB
        Michael Pigott
      2. AVRO-457.patch
        326 kB
        Michael Pigott
      3. AVRO-457.patch
        311 kB
        Michael Pigott
      4. AVRO-457.patch
        720 kB
        Michael Pigott

        Issue Links

          Activity

          Hide
          Michael Pigott added a comment -

          Thanks Ryan Blue! That was one part, and I also needed to run mvn install in order to make all of the dependencies available.

          I attached the corrected AVRO-457.patch.

          Show
          Michael Pigott added a comment - Thanks Ryan Blue ! That was one part, and I also needed to run mvn install in order to make all of the dependencies available. I attached the corrected AVRO-457.patch .
          Hide
          Ryan Blue added a comment -

          Michael Pigott, I think you want to use . instead of : for the main class argument: -Dexec.mainClass="..."

          Show
          Ryan Blue added a comment - Michael Pigott , I think you want to use . instead of : for the main class argument: -Dexec.mainClass="..."
          Hide
          Michael Pigott added a comment -

          Sorry for the long delay. I added two tools: FromXmlTool.java and ToXmlTool.java. Unfortunately I cannot seem to test this locally; I cannot run the command

          mvn exec:java -Dexec:mainClass="org.apache.avro.tool.Main"

          without getting the error

          org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.4.0:java (default-cli) on project avro-toplevel: The parameters 'mainClass' for goal org.codehaus.mojo:exec-maven-plugin:1.4.0:java are missing or invalid
          

          Any help is appreciated.

          Show
          Michael Pigott added a comment - Sorry for the long delay. I added two tools: FromXmlTool.java and ToXmlTool.java. Unfortunately I cannot seem to test this locally; I cannot run the command mvn exec:java -Dexec:mainClass="org.apache.avro.tool.Main" without getting the error org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.4.0:java (default-cli) on project avro-toplevel: The parameters 'mainClass' for goal org.codehaus.mojo:exec-maven-plugin:1.4.0:java are missing or invalid Any help is appreciated.
          Hide
          Doug Cutting added a comment -

          I am unfamiliar with XML-Schema and probably not qualified to review this. Can someone who uses XML-Schema please have a look at this? It would be best to know that others think this would be useful before adding it to Avro. It might also be useful to add command-line tools that convert an Avro data file to an XML file and vice-versa, so that folks can easily see by example how it works.

          Show
          Doug Cutting added a comment - I am unfamiliar with XML-Schema and probably not qualified to review this. Can someone who uses XML-Schema please have a look at this? It would be best to know that others think this would be useful before adding it to Avro. It might also be useful to add command-line tools that convert an Avro data file to an XML file and vice-versa, so that folks can easily see by example how it works.
          Hide
          Michael Pigott added a comment -

          AVRO-457.patch, take 2.

          Apache XML Schema v2.2.0 is released, so I removed all of the XML-Schema-specific code that went with it. This patch also includes support for AVRO-739, and does a better job of creating valid Java package names. Likewise the Avro Schema Compiler can now generate Java code for generated schemas.

          Feedback is appreciated!

          Thanks,
          Mike

          Show
          Michael Pigott added a comment - AVRO-457 .patch, take 2. Apache XML Schema v2.2.0 is released, so I removed all of the XML-Schema-specific code that went with it. This patch also includes support for AVRO-739 , and does a better job of creating valid Java package names. Likewise the Avro Schema Compiler can now generate Java code for generated schemas. Feedback is appreciated! Thanks, Mike
          Hide
          Michael Pigott added a comment -

          I'm backporting the XML-Schema-specific code to XMLSCHEMA-36.

          Regards,
          Mike

          Show
          Michael Pigott added a comment - I'm backporting the XML-Schema-specific code to XMLSCHEMA-36 . Regards, Mike
          Hide
          Michael Pigott added a comment -

          Uploading AVRO-457.patch.

          This change is quite large, so I thought I'd stop here and submit, even if it doesn't completely follow the spec I proposed five weeks ago.

          Notable Differences:

          • I did not add support to automatically generate an XML Schema from an Avro Schema. XmlDatumWriter encodes the XML Schema locations, and XmlDatumReader uses that to reconstruct the XML document.
          • Avro maps are automatically created if an element has exactly one non-optional ID attribute. (I did not make it an optional feature.)
          • Enums are automatically created if all of the enumeration values are valid Avro enum symbols. (This was not in the spec.)
          • I unfortunately did not have much success with JRegex for evaluating XML regular expressions. As a result, regular-expression validation is not part of this release.
          • I added XMLUnit to the dependencies for validating generated XML documents.

          I saw there is a "Submit Patch" button ... is that what I should use instead? I tried it but I did not see a way to upload the patch file.

          Show
          Michael Pigott added a comment - Uploading AVRO-457 .patch. This change is quite large, so I thought I'd stop here and submit, even if it doesn't completely follow the spec I proposed five weeks ago. Notable Differences: I did not add support to automatically generate an XML Schema from an Avro Schema. XmlDatumWriter encodes the XML Schema locations, and XmlDatumReader uses that to reconstruct the XML document. Avro maps are automatically created if an element has exactly one non-optional ID attribute. (I did not make it an optional feature.) Enums are automatically created if all of the enumeration values are valid Avro enum symbols. (This was not in the spec.) I unfortunately did not have much success with JRegex for evaluating XML regular expressions. As a result, regular-expression validation is not part of this release. I added XMLUnit to the dependencies for validating generated XML documents. I saw there is a "Submit Patch" button ... is that what I should use instead? I tried it but I did not see a way to upload the patch file.
          Hide
          Michael Pigott added a comment -

          Thanks for the insight! I have modified the proposal accordingly. If we have the URL to the XML Schema, we can encode that in the Avro schema. If we don't, your recommendation makes a lot of sense. It is a bit more complicated for complex XML types, as all, choice, and sequence groups may contain more groups internally.

          I propose to store group metadata as JSON objects, each of which with a "type" field containing the child type: “all,” “choice,” “sequence,” or “element.” Other fields define the minimum and maximum number of occurrences, and a value field. For groups, the “value” field is an array of the members of that group. For elements, the “value” field is the element’s fully-qualified XML name. Here is an example:

          { "type": "sequence",
            "minOccurs": 0,
            "maxOccurs": "unbounded",
            "value": [
                       { "type": "element",
                         "minOccurs": 1,
                         "maxOccurs": 1,
                         "value": { "namespace": "http://www.w3.org/2001/XMLSchema",
                                        "localPart": "complexType" 
                                  }
                       }
             ]
          }
          

          This isn't a perfect solution - attributes, elements, groups, and types can be abstracted to separate sections of an XML Schema for reusability across the document. In addition, multiple schemas can be referenced when describing an XML document. I think the only true way to support lossless Avro Schema -> XML Schema conversion would be to encode the entire XML Schema in JSON in the Avro schema. That said, the updated proposal will allow us to create an XML Schema that validates the same documents that the original schema would, so I think it is a reasonable compromise.

          Show
          Michael Pigott added a comment - Thanks for the insight! I have modified the proposal accordingly. If we have the URL to the XML Schema, we can encode that in the Avro schema. If we don't, your recommendation makes a lot of sense. It is a bit more complicated for complex XML types, as all, choice, and sequence groups may contain more groups internally. I propose to store group metadata as JSON objects, each of which with a "type" field containing the child type: “all,” “choice,” “sequence,” or “element.” Other fields define the minimum and maximum number of occurrences, and a value field. For groups, the “value” field is an array of the members of that group. For elements, the “value” field is the element’s fully-qualified XML name. Here is an example: { "type" : "sequence" , "minOccurs" : 0, "maxOccurs" : "unbounded" , "value" : [ { "type" : "element" , "minOccurs" : 1, "maxOccurs" : 1, "value" : { "namespace" : "http: //www.w3.org/2001/XMLSchema" , "localPart" : "complexType" } } ] } This isn't a perfect solution - attributes, elements, groups, and types can be abstracted to separate sections of an XML Schema for reusability across the document. In addition, multiple schemas can be referenced when describing an XML document. I think the only true way to support lossless Avro Schema -> XML Schema conversion would be to encode the entire XML Schema in JSON in the Avro schema. That said, the updated proposal will allow us to create an XML Schema that validates the same documents that the original schema would, so I think it is a reasonable compromise.
          Hide
          Doug Cutting added a comment -

          This would be a great addition to Avro.

          A few quick comments on the proposal:

          • you might use AVRO-1402 to support XML Schema decimal types
          • when mapping from an XML schema to an Avro schema we might add attributes to the Avro schema indicating the XML schema. E.g., XML's unsignedShort type might map to an Avro schema like
            {"type":"int", "xml-schema":"unsignedInt"}.

            Then conversion back to an XML schema might be done losslessly.

          Show
          Doug Cutting added a comment - This would be a great addition to Avro. A few quick comments on the proposal: you might use AVRO-1402 to support XML Schema decimal types when mapping from an XML schema to an Avro schema we might add attributes to the Avro schema indicating the XML schema. E.g., XML's unsignedShort type might map to an Avro schema like { "type" : " int " , "xml-schema" : "unsignedInt" }. Then conversion back to an XML schema might be done losslessly.
          Hide
          Michael Pigott added a comment -

          Hi,
          I created the following proposal for this project: https://docs.google.com/document/d/1BkuMPplmgd4imrU-Fv9RhVsubWMFtmI0mcRzcDAWXD4/edit?usp=sharing
          Comments are welcome!

          Thanks,
          Mike

          Show
          Michael Pigott added a comment - Hi, I created the following proposal for this project: https://docs.google.com/document/d/1BkuMPplmgd4imrU-Fv9RhVsubWMFtmI0mcRzcDAWXD4/edit?usp=sharing Comments are welcome! Thanks, Mike
          Hide
          Doug Cutting added a comment -

          This is similar to AVRO-456.

          Show
          Doug Cutting added a comment - This is similar to AVRO-456 .

            People

            • Assignee:
              Unassigned
              Reporter:
              Doug Cutting
            • Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:

                Development