Avro
  1. Avro
  2. AVRO-1188

External Schema Imports via AVSC Schema

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.7.3
    • Component/s: java
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Tags:
      Schema, Import, Imports, Importing, Types, Reuse, Share

      Description

      There is no way for ".avsc" schema files to import types (i.e records, enums, etc) in external schema files. There's tremendous benefit in being able to do this as it would allow the sharing of common types between multiple schema files. Here's a use case that illustrates the typical usecase of this feature request.

      Suppose we have an enum called "Privacy" that we would like to share between multiple schemas:

      //privacy.avsc
      { 
        "type": "enum",
        "name": "Privacy",
        "symbols" : ["Public", "Private"]
      }
      

      Now, if this feature was implemented one could import the above type into other schema files by doing something like this:

      //the post.avsc 
      {
        "type": "record", 
        "name": "Post",
        "imports": ["privacy.avsc"] //you can import one or more schemas
        "fields": [{
                    "name": "privacy", 
                    "type": [ "null", "Privacy"], //use imported Privacy type
                    "default": "Private"
                  }
        ]
      }
      

      Here's another schema file that also has a similar privacy concern:

      //the event.avsc is another schema that also imports the privacy type
      {
        "type": "record", 
        "name": "Event",
        "imports": ["privacy.avsc"] //it also imports the privacy schemas
        "fields": [{
                    "name": "privacy", 
                    "type": [ "null", "Privacy"], //use imported Privacy type
                    "default": "Public"
                  }
        ]
      }
      

      IDL files are able to import external schemas and protocols and it would be very beneficial if schema files could import other schema files.

      1. Avro-1188.tar.gz
        4 kB
        Sharmarke Aden
      2. vcs-diff6739872835137179667.patch
        5 kB
        Sharmarke Aden
      3. vcs-diff1160361655737792386.patch
        5 kB
        Sharmarke Aden
      4. vcs-diff2916139350460140957.patch
        5 kB
        Sharmarke Aden
      5. Avro-1188.tar.gz
        20 kB
        Sharmarke Aden
      6. vcs-diff4277815358664835838.patch
        5 kB
        Sharmarke Aden

        Issue Links

          Activity

          Sharmarke Aden created issue -
          Hide
          Doug Cutting added a comment -

          In many cases we require that schemas are standalone, that they do not have external references. For example, the schema included in a data file or transmitted to a remote server in RPC must be standalone. For this reason we've generally suggested that folks use some sort of pre-processor when they need to include schemas. The IDL compiler is one such pre-processor but one might also reasonably use cpp, m4 or some other pre-processor for this.

          Note however that the Java schema parser remembers all named schemas previously parsed by the same schema. Thus if one first parses your privacy.avsc above then post.avsc and/or event.avsc with the same parser, then the effect you desire is achieved.

          http://avro.apache.org/docs/current/api/java/org/apache/avro/Schema.Parser.html

          This facility is used, e.g., by the command-line schema compiler. Schemas are processed in the order listed on the command line so that later schemas may depend on earlier schemas. AVRO-983 changes Avro's Maven plugin so that schemas are processed by a single parser in directory-traversal order, permitting dependencies.

          Would either of these features work for you? If not, can you tell more about your use case?

          Show
          Doug Cutting added a comment - In many cases we require that schemas are standalone, that they do not have external references. For example, the schema included in a data file or transmitted to a remote server in RPC must be standalone. For this reason we've generally suggested that folks use some sort of pre-processor when they need to include schemas. The IDL compiler is one such pre-processor but one might also reasonably use cpp, m4 or some other pre-processor for this. Note however that the Java schema parser remembers all named schemas previously parsed by the same schema. Thus if one first parses your privacy.avsc above then post.avsc and/or event.avsc with the same parser, then the effect you desire is achieved. http://avro.apache.org/docs/current/api/java/org/apache/avro/Schema.Parser.html This facility is used, e.g., by the command-line schema compiler. Schemas are processed in the order listed on the command line so that later schemas may depend on earlier schemas. AVRO-983 changes Avro's Maven plugin so that schemas are processed by a single parser in directory-traversal order, permitting dependencies. Would either of these features work for you? If not, can you tell more about your use case?
          Hide
          Sharmarke Aden added a comment -

          Looking at AVRO-983 it seems that this issue is somewhat of a duplicate issue. I searched for a similar issue but didn't have luck finding it. Yes, utilization of the order technique will work for us though I wonder if there should be better way of schema imports based on previously loaded/compiled schemas without having to rely on the order of files/directorys. What do you think of the idea of having an explicit imports directory that's passed via command-line/maven which is processed/compiled before any other schemas?

          In maven one would define a configuration property called <importsDirectory/> that points to a directory containing schemas that should be processed first. Perhaps this directory could default to "src/main/avro/imports" which is treated as a special reserved directory by the maven plugin. The maven plugin declaration would look like this:

          <plugin>
          <groupId>org.apache.avro</groupId>
          <artifactId>avro-maven-plugin</artifactId>
          <executions>
              <execution>
          	<goals>
          	    <goal>schema</goal>
          	</goals>
              </execution>
          </executions>
          <configuration>
              <importsDirectory>${basedir}/src/main/avro/imports</importsDirectory>
          </configuration>
          </plugin>
          

          I am not familiar with the command-line tool but I would imagine a command-line option called "-imports" can be added that allows a user to specify an import directory that gets processed before other schemas.

          java -jar avro-tools-XXX.jar -imports "/schema/import/directory" ...
          
          Show
          Sharmarke Aden added a comment - Looking at AVRO-983 it seems that this issue is somewhat of a duplicate issue. I searched for a similar issue but didn't have luck finding it. Yes, utilization of the order technique will work for us though I wonder if there should be better way of schema imports based on previously loaded/compiled schemas without having to rely on the order of files/directorys. What do you think of the idea of having an explicit imports directory that's passed via command-line/maven which is processed/compiled before any other schemas? In maven one would define a configuration property called <importsDirectory/> that points to a directory containing schemas that should be processed first. Perhaps this directory could default to "src/main/avro/imports" which is treated as a special reserved directory by the maven plugin. The maven plugin declaration would look like this: <plugin> <groupId>org.apache.avro</groupId> <artifactId>avro-maven-plugin</artifactId> <executions> <execution> <goals> <goal>schema</goal> </goals> </execution> </executions> <configuration> <importsDirectory>${basedir}/src/main/avro/imports</importsDirectory> </configuration> </plugin> I am not familiar with the command-line tool but I would imagine a command-line option called "-imports" can be added that allows a user to specify an import directory that gets processed before other schemas. java -jar avro-tools-XXX.jar -imports "/schema/ import /directory" ...
          Hide
          Doug Cutting added a comment -

          > What do you think of the idea of having an explicit imports directory that's passed via command-line/maven which is processed/compiled before any other schemas?

          I think that would be a good approach.

          Show
          Doug Cutting added a comment - > What do you think of the idea of having an explicit imports directory that's passed via command-line/maven which is processed/compiled before any other schemas? I think that would be a good approach.
          Doug Cutting made changes -
          Field Original Value New Value
          Link This issue is related to AVRO-983 [ AVRO-983 ]
          Tom White made changes -
          Link This issue relates to AVRO-983 [ AVRO-983 ]
          Sharmarke Aden made changes -
          Description There is no way for ".avsc" schema files to import types (i.e records, enums, etc) in external schema files. There's tremendous benefit in being able to do this as it would all for the sharing of common types between multiple schema files. Here's a use case that illustrates the typical usecase of this feature request.


          Suppose we have an enum called "Privacy" that we would like to share between multiple schemas:

          {code}
          //privacy.avsc
          {
            "type": "enum",
            "name": "Privacy",
            "symbols" : ["Public", "Private"]
          }
          {code}

          Now, if this feature was implemented one could import the above type into other schema files by doing something like this:

          {code}
          //the post.avsc
          {
            "type": "record",
            "name": "Post",
            "imports": ["privacy.avsc"] //you can import one or more schemas
            "fields": [{
                        "name": "privacy",
                        "type": [ "null", "Privacy"], //use imported Privacy type
                        "default": "Private"
                      }
            ]
          }
          {code}

          Here's another schema file that also has a similar privacy concern:

          {code}
          //the event.avsc is another schema that also imports the privacy type
          {
            "type": "record",
            "name": "Event",
            "imports": ["privacy.avsc"] //it also imports the privacy schemas
            "fields": [{
                        "name": "privacy",
                        "type": [ "null", "Privacy"], //use imported Privacy type
                        "default": "Public"
                      }
            ]
          }
          {code}


          IDL files are able to import external schemas and protocols and it would be very beneficial if schema files could import other schema files.
          There is no way for ".avsc" schema files to import types (i.e records, enums, etc) in external schema files. There's tremendous benefit in being able to do this as it would allow the sharing of common types between multiple schema files. Here's a use case that illustrates the typical usecase of this feature request.


          Suppose we have an enum called "Privacy" that we would like to share between multiple schemas:

          {code}
          //privacy.avsc
          {
            "type": "enum",
            "name": "Privacy",
            "symbols" : ["Public", "Private"]
          }
          {code}

          Now, if this feature was implemented one could import the above type into other schema files by doing something like this:

          {code}
          //the post.avsc
          {
            "type": "record",
            "name": "Post",
            "imports": ["privacy.avsc"] //you can import one or more schemas
            "fields": [{
                        "name": "privacy",
                        "type": [ "null", "Privacy"], //use imported Privacy type
                        "default": "Private"
                      }
            ]
          }
          {code}

          Here's another schema file that also has a similar privacy concern:

          {code}
          //the event.avsc is another schema that also imports the privacy type
          {
            "type": "record",
            "name": "Event",
            "imports": ["privacy.avsc"] //it also imports the privacy schemas
            "fields": [{
                        "name": "privacy",
                        "type": [ "null", "Privacy"], //use imported Privacy type
                        "default": "Public"
                      }
            ]
          }
          {code}


          IDL files are able to import external schemas and protocols and it would be very beneficial if schema files could import other schema files.
          Sharmarke Aden made changes -
          Attachment vcs-diff4277815358664835838.patch [ 12553042 ]
          Hide
          Sharmarke Aden added a comment -

          This patch adds the ability to import directories or files to the maven plugin

          Show
          Sharmarke Aden added a comment - This patch adds the ability to import directories or files to the maven plugin
          Hide
          Sharmarke Aden added a comment -

          Here is a simple project that demonstrates the import feature.

          Show
          Sharmarke Aden added a comment - Here is a simple project that demonstrates the import feature.
          Sharmarke Aden made changes -
          Attachment Avro-1188.tar.gz [ 12553043 ]
          Sharmarke Aden made changes -
          Attachment vcs-diff2916139350460140957.patch [ 12553051 ]
          Hide
          Sharmarke Aden added a comment -

          Updated patch which gaurds against NPE if importedFiles is not defined

          Show
          Sharmarke Aden added a comment - Updated patch which gaurds against NPE if importedFiles is not defined
          Hide
          Sharmarke Aden added a comment -

          This patch also displays the same symptoms as AVRO-983 patch. Having a single Schema.Parser causes the IPC module to fail with "Can't redefine: org.apache.avro.ipc.MD5"

          Show
          Sharmarke Aden added a comment - This patch also displays the same symptoms as AVRO-983 patch. Having a single Schema.Parser causes the IPC module to fail with "Can't redefine: org.apache.avro.ipc.MD5"
          Sharmarke Aden made changes -
          Attachment vcs-diff1160361655737792386.patch [ 12553104 ]
          Hide
          Sharmarke Aden added a comment -

          This patch makes everything hunky-dory and backward compatible by checking to see if the user opted in to the importing of files. If user specified importedFiles then a single schema parser is used otherwise we do things the old way and create a new schema parser every time.

          Show
          Sharmarke Aden added a comment - This patch makes everything hunky-dory and backward compatible by checking to see if the user opted in to the importing of files. If user specified importedFiles then a single schema parser is used otherwise we do things the old way and create a new schema parser every time.
          Hide
          Tom White added a comment -

          Thanks for the patch, Sharmarke. I tried using the patch with a small project I was testing AVRO-983 with and it worked fine. A few comments on the patch:

          • Using a static Schema.Parser could potentially cause problems if the plugin is used more than once in a Maven project, since only one parser instance is used and it would retain references to earlier schemas. Instead, make it a non-static instance variable.
          • If importing files then project.addCompileSourceRoot is called twice with the same output directory. I'm not sure if this is a problem, but it might be better to call it just once.
          • Nit: declaration of "File file" can be moved to the line where the object is first instantiated. This occurs in two places in the patch.
          Show
          Tom White added a comment - Thanks for the patch, Sharmarke. I tried using the patch with a small project I was testing AVRO-983 with and it worked fine. A few comments on the patch: Using a static Schema.Parser could potentially cause problems if the plugin is used more than once in a Maven project, since only one parser instance is used and it would retain references to earlier schemas. Instead, make it a non-static instance variable. If importing files then project.addCompileSourceRoot is called twice with the same output directory. I'm not sure if this is a problem, but it might be better to call it just once. Nit: declaration of "File file" can be moved to the line where the object is first instantiated. This occurs in two places in the patch.
          Tom White made changes -
          Assignee Sharmarke Aden [ saden1 ]
          Sharmarke Aden made changes -
          Attachment vcs-diff6739872835137179667.patch [ 12553130 ]
          Hide
          Sharmarke Aden added a comment -

          Updated the patch based on Tom's valuable feedback. Please note that I also changed the plugin coUpdated the patch based on Tom's feedback. Please note that I also changed the plugin "importedFiles" configuration to just "imports"

          Show
          Sharmarke Aden added a comment - Updated the patch based on Tom's valuable feedback. Please note that I also changed the plugin coUpdated the patch based on Tom's feedback. Please note that I also changed the plugin "importedFiles" configuration to just "imports"
          Hide
          Sharmarke Aden added a comment -

          Here is an updated project that demonstrates usage of the import functionality. The plugin deceleration essentially looks like this:

          <plugin>
            <groupId>org.apache.avro</groupId>
            <artifactId>avro-maven-plugin</artifactId>
            <executions>
              <execution>
                <goals>
                  <goal>schema</goal>
                </goals>
              </execution>
            </executions>
            <configuration>
              <imports>
                <!-- import a directory -->
                <import>${basedir}/src/main/avro/imports</import>
                <!-- import a file -->
                <import>${basedir}/src/main/avro/directImport/PrivacyDirectImport.avsc</import>
              </imports>
            </configuration>
          </plugin>
          
          Show
          Sharmarke Aden added a comment - Here is an updated project that demonstrates usage of the import functionality. The plugin deceleration essentially looks like this: <plugin> <groupId>org.apache.avro</groupId> <artifactId>avro-maven-plugin</artifactId> <executions> <execution> <goals> <goal>schema</goal> </goals> </execution> </executions> <configuration> <imports> <!-- import a directory --> < import >${basedir}/src/main/avro/imports</ import > <!-- import a file --> < import >${basedir}/src/main/avro/directImport/PrivacyDirectImport.avsc</ import > </imports> </configuration> </plugin>
          Sharmarke Aden made changes -
          Attachment Avro-1188.tar.gz [ 12553131 ]
          Hide
          Tom White added a comment -

          +1 I just committed this. Thanks, Sharmarke!

          Show
          Tom White added a comment - +1 I just committed this. Thanks, Sharmarke!
          Tom White made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Hadoop Flags Reviewed [ 10343 ]
          Fix Version/s 1.7.3 [ 12323298 ]
          Resolution Fixed [ 1 ]
          Hide
          Sharmarke Aden added a comment -

          My pleasure. Thanks you guys for your guidance.

          Show
          Sharmarke Aden added a comment - My pleasure. Thanks you guys for your guidance.
          Hide
          Doug Cutting added a comment -

          Shouldn't we add a test? I didn't see one in the commit.

          Show
          Doug Cutting added a comment - Shouldn't we add a test? I didn't see one in the commit.
          Hide
          Sharmarke Aden added a comment -

          I would very much like to add unit tests but I'm not sure where they should be added.

          Show
          Sharmarke Aden added a comment - I would very much like to add unit tests but I'm not sure where they should be added.
          Hide
          Doug Cutting added a comment -

          We should perhaps add directories lang/java/maven-plugin/src/test/avro/

          {input,output,include}

          then write a test that's like TestIdl.java and TestSpecificCompilerTool.java, which compiles schemas in the input directory, specifying the include directory, and compares the output to what's in the output directory.

          Show
          Doug Cutting added a comment - We should perhaps add directories lang/java/maven-plugin/src/test/avro/ {input,output,include} then write a test that's like TestIdl.java and TestSpecificCompilerTool.java, which compiles schemas in the input directory, specifying the include directory, and compares the output to what's in the output directory.
          Hide
          Tom White added a comment -

          Yes, we should have a test for this - I was a bit hasty in my commit, sorry! I didn't think there were any tests for the avro-maven-plugin, but in retrospect I realise that the ipc and mapred modules are using it, so they are tests of sorts for the plugin already. Sharmarke, are you able take a look at this?

          Show
          Tom White added a comment - Yes, we should have a test for this - I was a bit hasty in my commit, sorry! I didn't think there were any tests for the avro-maven-plugin, but in retrospect I realise that the ipc and mapred modules are using it, so they are tests of sorts for the plugin already. Sharmarke, are you able take a look at this?
          Hide
          Sharmarke Aden added a comment -

          Yes, I will take a look at it.

          On a side note, is there any reason why there aren't any tests for the plugin Mojos?

          Show
          Sharmarke Aden added a comment - Yes, I will take a look at it. On a side note, is there any reason why there aren't any tests for the plugin Mojos?
          Hide
          Doug Cutting added a comment -

          > is there any reason why there aren't any tests for the plugin Mojos?

          No good reason. They were implemented as a part of the conversion from Ant to Maven and no tests were added then.

          Show
          Doug Cutting added a comment - > is there any reason why there aren't any tests for the plugin Mojos? No good reason. They were implemented as a part of the conversion from Ant to Maven and no tests were added then.
          Hide
          Sharmarke Aden added a comment -

          I have created AVRO-1207 and submitted a patch for it. It adds tests for all the mojos and it also tests schema imports.

          Show
          Sharmarke Aden added a comment - I have created AVRO-1207 and submitted a patch for it. It adds tests for all the mojos and it also tests schema imports.
          Doug Cutting made changes -
          Status Resolved [ 5 ] Closed [ 6 ]

            People

            • Assignee:
              Sharmarke Aden
              Reporter:
              Sharmarke Aden
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development