Avro
  1. Avro
  2. AVRO-457

add tools that read/write xml records from/to avro data files

    Details

    • Type: New Feature New Feature
    • Status: Patch Available
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 1.7.8
    • Fix Version/s: None
    • Component/s: java
    • Labels:
    • Tags:
      gsoc

      Description

      It might be useful to have command-line tools that can read & write arbitrary XML data from & to Avro data files.

      1. AVRO-457.patch
        327 kB
        Michael Pigott
      2. AVRO-457.patch
        326 kB
        Michael Pigott
      3. AVRO-457.patch
        311 kB
        Michael Pigott
      4. AVRO-457.patch
        720 kB
        Michael Pigott
      5. ebucore.json
        254 kB
        Bram Biesbrouck

        Issue Links

          Activity

          Hide
          Doug Cutting added a comment -

          This is similar to AVRO-456.

          Show
          Doug Cutting added a comment - This is similar to AVRO-456 .
          Hide
          Michael Pigott added a comment -

          Hi,
          I created the following proposal for this project: https://docs.google.com/document/d/1BkuMPplmgd4imrU-Fv9RhVsubWMFtmI0mcRzcDAWXD4/edit?usp=sharing
          Comments are welcome!

          Thanks,
          Mike

          Show
          Michael Pigott added a comment - Hi, I created the following proposal for this project: https://docs.google.com/document/d/1BkuMPplmgd4imrU-Fv9RhVsubWMFtmI0mcRzcDAWXD4/edit?usp=sharing Comments are welcome! Thanks, Mike
          Hide
          Doug Cutting added a comment -

          This would be a great addition to Avro.

          A few quick comments on the proposal:

          • you might use AVRO-1402 to support XML Schema decimal types
          • when mapping from an XML schema to an Avro schema we might add attributes to the Avro schema indicating the XML schema. E.g., XML's unsignedShort type might map to an Avro schema like
            {"type":"int", "xml-schema":"unsignedInt"}.

            Then conversion back to an XML schema might be done losslessly.

          Show
          Doug Cutting added a comment - This would be a great addition to Avro. A few quick comments on the proposal: you might use AVRO-1402 to support XML Schema decimal types when mapping from an XML schema to an Avro schema we might add attributes to the Avro schema indicating the XML schema. E.g., XML's unsignedShort type might map to an Avro schema like { "type" : " int " , "xml-schema" : "unsignedInt" }. Then conversion back to an XML schema might be done losslessly.
          Hide
          Michael Pigott added a comment -

          Thanks for the insight! I have modified the proposal accordingly. If we have the URL to the XML Schema, we can encode that in the Avro schema. If we don't, your recommendation makes a lot of sense. It is a bit more complicated for complex XML types, as all, choice, and sequence groups may contain more groups internally.

          I propose to store group metadata as JSON objects, each of which with a "type" field containing the child type: “all,” “choice,” “sequence,” or “element.” Other fields define the minimum and maximum number of occurrences, and a value field. For groups, the “value” field is an array of the members of that group. For elements, the “value” field is the element’s fully-qualified XML name. Here is an example:

          { "type": "sequence",
            "minOccurs": 0,
            "maxOccurs": "unbounded",
            "value": [
                       { "type": "element",
                         "minOccurs": 1,
                         "maxOccurs": 1,
                         "value": { "namespace": "http://www.w3.org/2001/XMLSchema",
                                        "localPart": "complexType" 
                                  }
                       }
             ]
          }
          

          This isn't a perfect solution - attributes, elements, groups, and types can be abstracted to separate sections of an XML Schema for reusability across the document. In addition, multiple schemas can be referenced when describing an XML document. I think the only true way to support lossless Avro Schema -> XML Schema conversion would be to encode the entire XML Schema in JSON in the Avro schema. That said, the updated proposal will allow us to create an XML Schema that validates the same documents that the original schema would, so I think it is a reasonable compromise.

          Show
          Michael Pigott added a comment - Thanks for the insight! I have modified the proposal accordingly. If we have the URL to the XML Schema, we can encode that in the Avro schema. If we don't, your recommendation makes a lot of sense. It is a bit more complicated for complex XML types, as all, choice, and sequence groups may contain more groups internally. I propose to store group metadata as JSON objects, each of which with a "type" field containing the child type: “all,” “choice,” “sequence,” or “element.” Other fields define the minimum and maximum number of occurrences, and a value field. For groups, the “value” field is an array of the members of that group. For elements, the “value” field is the element’s fully-qualified XML name. Here is an example: { "type" : "sequence" , "minOccurs" : 0, "maxOccurs" : "unbounded" , "value" : [ { "type" : "element" , "minOccurs" : 1, "maxOccurs" : 1, "value" : { "namespace" : "http: //www.w3.org/2001/XMLSchema" , "localPart" : "complexType" } } ] } This isn't a perfect solution - attributes, elements, groups, and types can be abstracted to separate sections of an XML Schema for reusability across the document. In addition, multiple schemas can be referenced when describing an XML document. I think the only true way to support lossless Avro Schema -> XML Schema conversion would be to encode the entire XML Schema in JSON in the Avro schema. That said, the updated proposal will allow us to create an XML Schema that validates the same documents that the original schema would, so I think it is a reasonable compromise.
          Hide
          Michael Pigott added a comment -

          Uploading AVRO-457.patch.

          This change is quite large, so I thought I'd stop here and submit, even if it doesn't completely follow the spec I proposed five weeks ago.

          Notable Differences:

          • I did not add support to automatically generate an XML Schema from an Avro Schema. XmlDatumWriter encodes the XML Schema locations, and XmlDatumReader uses that to reconstruct the XML document.
          • Avro maps are automatically created if an element has exactly one non-optional ID attribute. (I did not make it an optional feature.)
          • Enums are automatically created if all of the enumeration values are valid Avro enum symbols. (This was not in the spec.)
          • I unfortunately did not have much success with JRegex for evaluating XML regular expressions. As a result, regular-expression validation is not part of this release.
          • I added XMLUnit to the dependencies for validating generated XML documents.

          I saw there is a "Submit Patch" button ... is that what I should use instead? I tried it but I did not see a way to upload the patch file.

          Show
          Michael Pigott added a comment - Uploading AVRO-457 .patch. This change is quite large, so I thought I'd stop here and submit, even if it doesn't completely follow the spec I proposed five weeks ago. Notable Differences: I did not add support to automatically generate an XML Schema from an Avro Schema. XmlDatumWriter encodes the XML Schema locations, and XmlDatumReader uses that to reconstruct the XML document. Avro maps are automatically created if an element has exactly one non-optional ID attribute. (I did not make it an optional feature.) Enums are automatically created if all of the enumeration values are valid Avro enum symbols. (This was not in the spec.) I unfortunately did not have much success with JRegex for evaluating XML regular expressions. As a result, regular-expression validation is not part of this release. I added XMLUnit to the dependencies for validating generated XML documents. I saw there is a "Submit Patch" button ... is that what I should use instead? I tried it but I did not see a way to upload the patch file.
          Hide
          Michael Pigott added a comment -

          I'm backporting the XML-Schema-specific code to XMLSCHEMA-36.

          Regards,
          Mike

          Show
          Michael Pigott added a comment - I'm backporting the XML-Schema-specific code to XMLSCHEMA-36 . Regards, Mike
          Hide
          Michael Pigott added a comment -

          AVRO-457.patch, take 2.

          Apache XML Schema v2.2.0 is released, so I removed all of the XML-Schema-specific code that went with it. This patch also includes support for AVRO-739, and does a better job of creating valid Java package names. Likewise the Avro Schema Compiler can now generate Java code for generated schemas.

          Feedback is appreciated!

          Thanks,
          Mike

          Show
          Michael Pigott added a comment - AVRO-457 .patch, take 2. Apache XML Schema v2.2.0 is released, so I removed all of the XML-Schema-specific code that went with it. This patch also includes support for AVRO-739 , and does a better job of creating valid Java package names. Likewise the Avro Schema Compiler can now generate Java code for generated schemas. Feedback is appreciated! Thanks, Mike
          Hide
          Doug Cutting added a comment -

          I am unfamiliar with XML-Schema and probably not qualified to review this. Can someone who uses XML-Schema please have a look at this? It would be best to know that others think this would be useful before adding it to Avro. It might also be useful to add command-line tools that convert an Avro data file to an XML file and vice-versa, so that folks can easily see by example how it works.

          Show
          Doug Cutting added a comment - I am unfamiliar with XML-Schema and probably not qualified to review this. Can someone who uses XML-Schema please have a look at this? It would be best to know that others think this would be useful before adding it to Avro. It might also be useful to add command-line tools that convert an Avro data file to an XML file and vice-versa, so that folks can easily see by example how it works.
          Hide
          Michael Pigott added a comment -

          Sorry for the long delay. I added two tools: FromXmlTool.java and ToXmlTool.java. Unfortunately I cannot seem to test this locally; I cannot run the command

          mvn exec:java -Dexec:mainClass="org.apache.avro.tool.Main"

          without getting the error

          org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.4.0:java (default-cli) on project avro-toplevel: The parameters 'mainClass' for goal org.codehaus.mojo:exec-maven-plugin:1.4.0:java are missing or invalid
          

          Any help is appreciated.

          Show
          Michael Pigott added a comment - Sorry for the long delay. I added two tools: FromXmlTool.java and ToXmlTool.java. Unfortunately I cannot seem to test this locally; I cannot run the command mvn exec:java -Dexec:mainClass="org.apache.avro.tool.Main" without getting the error org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.4.0:java (default-cli) on project avro-toplevel: The parameters 'mainClass' for goal org.codehaus.mojo:exec-maven-plugin:1.4.0:java are missing or invalid Any help is appreciated.
          Hide
          Ryan Blue added a comment -

          Michael Pigott, I think you want to use . instead of : for the main class argument: -Dexec.mainClass="..."

          Show
          Ryan Blue added a comment - Michael Pigott , I think you want to use . instead of : for the main class argument: -Dexec.mainClass="..."
          Hide
          Michael Pigott added a comment -

          Thanks Ryan Blue! That was one part, and I also needed to run mvn install in order to make all of the dependencies available.

          I attached the corrected AVRO-457.patch.

          Show
          Michael Pigott added a comment - Thanks Ryan Blue ! That was one part, and I also needed to run mvn install in order to make all of the dependencies available. I attached the corrected AVRO-457.patch .
          Hide
          Bram Biesbrouck added a comment -

          Please allow me to comment on this after having used Michael's project (from https://github.com/mikepigott/xml-to-avro) on the official (and fairly complex) ebucore.xsd schema version 1.6 (see https://tech.ebu.ch/MetadataEbuCore and https://www.ebu.ch/metadata/schemas/EBUCore/ebucore.zip)

          To me, from a developer point of view, the need for the tool Michael has written is very high; nearly all official ontologies release their versions using XML schema (XSD) files. Just like the XJC (and by extent the JAXB) project, it's important to have de-facto standard projects to convert them to working memory models. Having a reliable XSD->AVSC converter would be awesome.

          I've played around with Michael's code and got it to successfully generate an avro schema from the ebucore.xsd file. However, I had to make a lot of modifications to the original file because not all standards are implemented in xml-to-avro (for one, elements with default, empty types crash the converter).

          After having tried four solutions:
          1) https://github.com/stealthly/xml-avro
          2) https://github.com/mikepigott/xml-to-avro
          3) https://github.com/nokia/Avro-Schema-Generator
          4) https://github.com/FasterXML/jackson-dataformat-avro

          I conclude that solution 1 is the best for now, because it works out of the box without modifications and generates a more type-safe schema (than Michael's converter), although for complex schemas like ebucore, double types are introduced (eg; Double1, Double2, ...).

          All this to make a point: I, together with a lot of other developers, truly see the need for an official XSD->AVSC converter, so please consider it. I can help with testing, but I'm no XSD expert.
          You might want to contact to folks at https://github.com/stealthly/xml-avro

          bram

          Show
          Bram Biesbrouck added a comment - Please allow me to comment on this after having used Michael's project (from https://github.com/mikepigott/xml-to-avro ) on the official (and fairly complex) ebucore.xsd schema version 1.6 (see https://tech.ebu.ch/MetadataEbuCore and https://www.ebu.ch/metadata/schemas/EBUCore/ebucore.zip ) To me, from a developer point of view, the need for the tool Michael has written is very high; nearly all official ontologies release their versions using XML schema (XSD) files. Just like the XJC (and by extent the JAXB) project, it's important to have de-facto standard projects to convert them to working memory models. Having a reliable XSD->AVSC converter would be awesome. I've played around with Michael's code and got it to successfully generate an avro schema from the ebucore.xsd file. However, I had to make a lot of modifications to the original file because not all standards are implemented in xml-to-avro (for one, elements with default, empty types crash the converter). After having tried four solutions: 1) https://github.com/stealthly/xml-avro 2) https://github.com/mikepigott/xml-to-avro 3) https://github.com/nokia/Avro-Schema-Generator 4) https://github.com/FasterXML/jackson-dataformat-avro I conclude that solution 1 is the best for now, because it works out of the box without modifications and generates a more type-safe schema (than Michael's converter), although for complex schemas like ebucore, double types are introduced (eg; Double1, Double2, ...). All this to make a point: I, together with a lot of other developers, truly see the need for an official XSD->AVSC converter, so please consider it. I can help with testing, but I'm no XSD expert. You might want to contact to folks at https://github.com/stealthly/xml-avro bram
          Hide
          Michael Pigott added a comment -

          Bram,
          Thank you for taking a look at this! I'm sorry you've had trouble with my implementation - I was unaware of the EBU family of XML documents when I was writing the code. My testing primarily focused on XBRL documents[1], which I'm sure is why you've had trouble. Unfortunately, I'm sure you noticed that this project has not gained a lot of traction since I proposed to work on it a year and a half ago, primarily due to a lack of XML / XML Schema experience among the project's contributors.
          That said, I'm happy to look at the schema you provided and see what I can do to correct the bugs you found in the coming weeks. (Unfortunately I have not looked at this code in a long time, and I do not even program in Java regularly anymore.) If you have more information you can provide to help me get started, feel free to file an issue on GitHub[2]!

          Regards,
          Mike

          [1] http://www.sec.gov/info/edgar/edgartaxonomies.shtml
          [2] https://github.com/mikepigott/xml-to-avro/issues

          Show
          Michael Pigott added a comment - Bram, Thank you for taking a look at this! I'm sorry you've had trouble with my implementation - I was unaware of the EBU family of XML documents when I was writing the code. My testing primarily focused on XBRL documents [1] , which I'm sure is why you've had trouble. Unfortunately, I'm sure you noticed that this project has not gained a lot of traction since I proposed to work on it a year and a half ago, primarily due to a lack of XML / XML Schema experience among the project's contributors. That said, I'm happy to look at the schema you provided and see what I can do to correct the bugs you found in the coming weeks. (Unfortunately I have not looked at this code in a long time, and I do not even program in Java regularly anymore.) If you have more information you can provide to help me get started, feel free to file an issue on GitHub [2] ! Regards, Mike [1] http://www.sec.gov/info/edgar/edgartaxonomies.shtml [2] https://github.com/mikepigott/xml-to-avro/issues
          Hide
          Ryan Blue added a comment -

          Bram Biesbrouck, thanks for posting your summary of the current state of this. I agree with Michael's assessment that it isn't a lack of interest in having something like this, it is that we're not XSD experts either. That said, if we can get the right people together to collaborate around this, like Michael Pigott and the Stealth.ly team that put together option #1, then I can take care of the commit part. I don't think we all have to be experts if there's a portion of the community that is interested in looking at this, updating Michael's latest work, and helping us review to get it in.

          Show
          Ryan Blue added a comment - Bram Biesbrouck , thanks for posting your summary of the current state of this. I agree with Michael's assessment that it isn't a lack of interest in having something like this, it is that we're not XSD experts either. That said, if we can get the right people together to collaborate around this, like Michael Pigott and the Stealth.ly team that put together option #1, then I can take care of the commit part. I don't think we all have to be experts if there's a portion of the community that is interested in looking at this, updating Michael's latest work, and helping us review to get it in.
          Hide
          Bram Biesbrouck added a comment -

          The result of a first try to convert ebucore.xsd to a json schema

          Show
          Bram Biesbrouck added a comment - The result of a first try to convert ebucore.xsd to a json schema
          Hide
          Bram Biesbrouck added a comment -

          Hi Ryan Blue and Michael Pigott,

          I think I might have found a better approach to this...
          To parse XSD schemas, 99% of Java users use XJC to convert an XSD to POJOs. The results of this tool are very good, since it's a mature tool.
          Because it makes sense to reuse a common POJO codebase to (de)serialize to JSON/XML/AVRO, this might be a better start to investigate a robust XSD->AVRO parser. Also because raw XSD parsing/understanding is quite error prone.

          Fortunately, a lot of work has been done already. Take a look at this project.
          It generates a JSON Schema from a POJO class (and recursively all it's members). The result is a JSON schema.
          Now the best part: the same developers also wrote this project that converts a JSON schema to an AVRO schema. However, the json->avro converter is not production ready yet. But it has a very nice codebase to start with. This class is a good entry point to its inner workings.

          I'm currently trying to find some time to work on it, but it's slow. I successfully managed to convert the EBUCore XSD schema to a JSON schema though. The next step (JSON->AVRO) is more difficult I'm afraid. Hence: do the AVRO developers have any experience with converting JSON schemas into (the more narrow) AVRO schema structure? Would be interesting to investigate in general because JSON validation is becoming more and more relevant these days.

          b.

          Show
          Bram Biesbrouck added a comment - Hi Ryan Blue and Michael Pigott , I think I might have found a better approach to this... To parse XSD schemas, 99% of Java users use XJC to convert an XSD to POJOs. The results of this tool are very good, since it's a mature tool. Because it makes sense to reuse a common POJO codebase to (de)serialize to JSON/XML/AVRO, this might be a better start to investigate a robust XSD->AVRO parser. Also because raw XSD parsing/understanding is quite error prone. Fortunately, a lot of work has been done already. Take a look at this project . It generates a JSON Schema from a POJO class (and recursively all it's members). The result is a JSON schema . Now the best part: the same developers also wrote this project that converts a JSON schema to an AVRO schema. However, the json->avro converter is not production ready yet. But it has a very nice codebase to start with. This class is a good entry point to its inner workings. I'm currently trying to find some time to work on it, but it's slow. I successfully managed to convert the EBUCore XSD schema to a JSON schema though. The next step (JSON->AVRO) is more difficult I'm afraid. Hence: do the AVRO developers have any experience with converting JSON schemas into (the more narrow) AVRO schema structure? Would be interesting to investigate in general because JSON validation is becoming more and more relevant these days. b.
          Hide
          Ryan Blue added a comment -

          Bram Biesbrouck, it may be easier than going through a JSON schema. Avro's reflect support will analyze Java classes to produce Avro schemas and can also serialize instances of those classes. You may not need to use JSON at all, just try ReflectData.get().getSchema(MyXMLObject.class).

          Show
          Ryan Blue added a comment - Bram Biesbrouck , it may be easier than going through a JSON schema. Avro's reflect support will analyze Java classes to produce Avro schemas and can also serialize instances of those classes. You may not need to use JSON at all, just try ReflectData.get().getSchema(MyXMLObject.class) .
          Hide
          Bram Biesbrouck added a comment -

          I wish it was that easy. When I run this code

          Schema avroSchema = ReflectData.get().getSchema(EbuCoreMainType.class);
          

          I get this exception:

          org.apache.avro.AvroTypeException: Unknown type: T

          because JAXBElement uses generic types and one of the members in the tree is:

          protected List<JAXBElement<Object>> audioContentIDRef;
          

          and generics don't seem to work well.

          Any suggestions to get around this?

          b.

          Show
          Bram Biesbrouck added a comment - I wish it was that easy. When I run this code Schema avroSchema = ReflectData.get().getSchema(EbuCoreMainType.class); I get this exception: org.apache.avro.AvroTypeException: Unknown type: T because JAXBElement uses generic types and one of the members in the tree is: protected List<JAXBElement< Object >> audioContentIDRef; and generics don't seem to work well. Any suggestions to get around this? b.
          Hide
          Ryan Blue added a comment -

          Is that generic allowed in your original XSD or is that introduced when you convert to JAXB objects? If it is the latter, then I think we would have to get around that with a direct conversion to avoid losing what the type contained in that list is.

          Show
          Ryan Blue added a comment - Is that generic allowed in your original XSD or is that introduced when you convert to JAXB objects? If it is the latter, then I think we would have to get around that with a direct conversion to avoid losing what the type contained in that list is.
          Hide
          Bram Biesbrouck added a comment -

          Hi Ryan,

          (sorry for the delay)

          The relevant pieces in the XSD schema are the following:

          <complexType name="audioProgrammeType">
                  <sequence>
                      <element name="audioContentIDRef" type="IDREF" minOccurs="0" maxOccurs="unbounded">
                          <annotation>
                              <documentation>A list of reference to audioContents, each defining one component
                                  of an audioProgramme (e.g. background music), its association with an
                                  audioPack (e.g. a 2.0 audioPack of audioChannels for stereo reproduction),
                                  its association with a audioStream, and its set of loudness parameters.
                                  Notice that loudness values of a program are dependent of the associated
                                  audioPack mixReproductionFormat.
                              </documentation>
                          </annotation>
                      </element>
                      <element name="loudnessMetadata" type="ebucore:loudnessMetadataType" minOccurs="0">
                          <annotation>
                              <documentation>A set of loudness parameters proper to the audio content of the
                                  whole programme.
                              </documentation>
                          </annotation>
                      </element>
                      ...
                  </sequence>
              </complexType>
          

          I'm no JAXB (nor XSD) expert, but I assume JAXB doesn't really know what to do with the IDREF type, and just defaults to using an Object type. What do you think?

          Show
          Bram Biesbrouck added a comment - Hi Ryan, (sorry for the delay) The relevant pieces in the XSD schema are the following: <complexType name= "audioProgrammeType" > <sequence> <element name= "audioContentIDRef" type= "IDREF" minOccurs= "0" maxOccurs= "unbounded" > <annotation> <documentation>A list of reference to audioContents, each defining one component of an audioProgramme (e.g. background music), its association with an audioPack (e.g. a 2.0 audioPack of audioChannels for stereo reproduction), its association with a audioStream, and its set of loudness parameters. Notice that loudness values of a program are dependent of the associated audioPack mixReproductionFormat. </documentation> </annotation> </element> <element name= "loudnessMetadata" type= "ebucore:loudnessMetadataType" minOccurs= "0" > <annotation> <documentation>A set of loudness parameters proper to the audio content of the whole programme. </documentation> </annotation> </element> ... </sequence> </complexType> I'm no JAXB (nor XSD) expert, but I assume JAXB doesn't really know what to do with the IDREF type, and just defaults to using an Object type. What do you think?
          Hide
          Ryan Blue added a comment -

          Sounds reasonable to me. I don't know what an IDREF is either. What do you think the output should be instead?

          Show
          Ryan Blue added a comment - Sounds reasonable to me. I don't know what an IDREF is either. What do you think the output should be instead?

            People

            • Assignee:
              Unassigned
              Reporter:
              Doug Cutting
            • Votes:
              1 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

              • Created:
                Updated:

                Development