Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.7.7
    • Component/s: None
    • Labels:
    • Hadoop Flags:
      Reviewed

      Description

      Currently, Avro does not seem to support a DECIMAL type or equivalent.

      http://avro.apache.org/docs/1.7.5/spec.html#schema_primitive

      Adding DECIMAL support would be particularly interesting when converting types from Avro to Hive, since DECIMAL is already a supported data type in Hive (0.11.0).

      1. AVRO-1402.patch
        3 kB
        Doug Cutting
      2. AVRO-1402.patch
        13 kB
        Doug Cutting
      3. AVRO-1402.patch
        14 kB
        Tom White
      4. AVRO-1402.patch
        11 kB
        Tom White
      5. AVRO-1402.patch
        13 kB
        Tom White
      6. AVRO-1402.patch
        7 kB
        Tom White
      7. AVRO-1402-logical-type-spec.patch
        2 kB
        Tom White
      8. AVRO-1402-logical-type-spec-2.patch
        3 kB
        Ryan Blue
      9. AVRO-1402-logical-type-spec-3.patch
        3 kB
        Ryan Blue
      10. AVRO-1402-logical-type-spec-4.patch
        3 kB
        Ryan Blue
      11. AVRO-1402-logical-type-spec-5.patch
        4 kB
        Ryan Blue
      12. AVRO-1402-logical-type-spec-6.patch
        14 kB
        Ryan Blue
      13. AVRO-1402-logical-type-spec-7.patch
        29 kB
        Ryan Blue
      14. UnixEpochRecordMapping.patch
        3 kB
        Tom White

        Issue Links

          Activity

          Hide
          Tom White added a comment -

          Following Doug's subType suggestion on AVRO-739 I wrote a patch to support a decimal type.

          The schema declaration looks like this:

          {"type":"bytes", "subType":"decimal"}

          The encoding is an int to represent the scale followed by a byte array containing the unscaled integer, in the (language-neutral) format described here: http://docs.oracle.com/javase/6/docs/api/java/math/BigInteger.html#toByteArray%28%29.

          One thing to notice is that in this patch the type does not define the precision and scale as a part of the type. This means that there is no restriction in Avro on the decimal that may be written. Instead, the burden of limiting the precision and scale falls on the application. Hive, for example, already has logic for ensuring that the precision and scale of a decimal value are consistent with the precision and scale values set as a part of the type definition for that decimal column. (There is more discussion on this point on HIVE-3976, and in particular in the functional spec.)

          It would be good if we could get some agreement on i) whether this is a good approach for adding new (optional) types to Avro and ii) what the binary encoding should look like. Thoughts?

          In this initial patch I added the decimal subtype to GenericDatumWriter/Reader. I did this since Hive uses generic, but there is a potential compatibility issue where code using a GenericDatumReader receives a BigDecimal instead of a ByteBuffer when reading a new schema with a decimal subtype. Any thoughts on how to tackle this would be gratefully received too.

          Show
          Tom White added a comment - Following Doug's subType suggestion on AVRO-739 I wrote a patch to support a decimal type. The schema declaration looks like this: {"type":"bytes", "subType":"decimal"} The encoding is an int to represent the scale followed by a byte array containing the unscaled integer, in the (language-neutral) format described here: http://docs.oracle.com/javase/6/docs/api/java/math/BigInteger.html#toByteArray%28%29 . One thing to notice is that in this patch the type does not define the precision and scale as a part of the type. This means that there is no restriction in Avro on the decimal that may be written. Instead, the burden of limiting the precision and scale falls on the application. Hive, for example, already has logic for ensuring that the precision and scale of a decimal value are consistent with the precision and scale values set as a part of the type definition for that decimal column. (There is more discussion on this point on HIVE-3976 , and in particular in the functional spec .) It would be good if we could get some agreement on i) whether this is a good approach for adding new (optional) types to Avro and ii) what the binary encoding should look like. Thoughts? In this initial patch I added the decimal subtype to GenericDatumWriter/Reader. I did this since Hive uses generic, but there is a potential compatibility issue where code using a GenericDatumReader receives a BigDecimal instead of a ByteBuffer when reading a new schema with a decimal subtype. Any thoughts on how to tackle this would be gratefully received too.
          Hide
          Doug Cutting added a comment -

          For greatest interoperability we might use strings instead of bytes, with scientific notation. This would be bigger and slower to parse but would also greatly simplify implementation and usability. For example, so far as I can tell, Python's decimal facility does not support a binary constructor, so we'd need to (carefully) implement our own there. C# has a binary constructor, but it's an array of ints, not bytes. And so on.

          Also:

          • GenericData#isString() or isBytes() should handle subtypes too, so unions will work;
          • GenericData#induce() should infer decimal subtypes;
          • ReflectData#createSchema should infer decimal subtypes.

          With regards to the incompatiblity, I don't see how this breaks an application that's currently works. Another part of the application would need to be updated to generate data in the new format, so the Hive part would need to be updated to handle that change, not because of Avro changes alone. The Avro feature would permit an incompatible change, but wouldn't create it. Or am I missing something?

          Show
          Doug Cutting added a comment - For greatest interoperability we might use strings instead of bytes, with scientific notation. This would be bigger and slower to parse but would also greatly simplify implementation and usability. For example, so far as I can tell, Python's decimal facility does not support a binary constructor, so we'd need to (carefully) implement our own there. C# has a binary constructor, but it's an array of ints, not bytes. And so on. Also: GenericData#isString() or isBytes() should handle subtypes too, so unions will work; GenericData#induce() should infer decimal subtypes; ReflectData#createSchema should infer decimal subtypes. With regards to the incompatiblity, I don't see how this breaks an application that's currently works. Another part of the application would need to be updated to generate data in the new format, so the Hive part would need to be updated to handle that change, not because of Avro changes alone. The Avro feature would permit an incompatible change, but wouldn't create it. Or am I missing something?
          Hide
          Jarek Jarcec Cecho added a comment -

          Thank you guys for working on this one! I'm following this JIRA as I would love to see ability to represent decimals in Avro in a way that would work across all the processing engines in Hadoop. I've made two small observations that I would like to share.

          1) It seems that the proposal is to serialize the scale with every record (row). I would like to mention that from database perspective the type of a column is always with scale and precision. E.g. the type that Hive or other SQL engines on top Hadoop will store in the file will always be "decimal(5,2)" and never just a "decimal". Databases do not allows different rows for the same column to have different scale or precision. Hence encoding the scale with every record will contain a lot of redundant information for this use case. I fully understand that Avro being a generic format might want to enable it, but I wanted to point it out explicitly.

          2) I can see that we are serializing only scale and not precision to the disk. I do understand that from storage perspective, this is fully sufficient, however I do see a possible problems with follow up processing. I believe that in order to execute the math in the same way in all the processing engines one need to know the precision as well. This do not seem to be a problem for projects that do have separate service keeping the metadata (such as Hive and it's Hive metastore), but I assume that it can become an issue for projects that don't have that ability (such as Pig). Hence I'm wondering if it would make sense to store the precision in the Avro metadata as well?

          Show
          Jarek Jarcec Cecho added a comment - Thank you guys for working on this one! I'm following this JIRA as I would love to see ability to represent decimals in Avro in a way that would work across all the processing engines in Hadoop. I've made two small observations that I would like to share. 1) It seems that the proposal is to serialize the scale with every record (row). I would like to mention that from database perspective the type of a column is always with scale and precision. E.g. the type that Hive or other SQL engines on top Hadoop will store in the file will always be "decimal(5,2)" and never just a "decimal". Databases do not allows different rows for the same column to have different scale or precision. Hence encoding the scale with every record will contain a lot of redundant information for this use case. I fully understand that Avro being a generic format might want to enable it, but I wanted to point it out explicitly. 2) I can see that we are serializing only scale and not precision to the disk. I do understand that from storage perspective, this is fully sufficient, however I do see a possible problems with follow up processing. I believe that in order to execute the math in the same way in all the processing engines one need to know the precision as well. This do not seem to be a problem for projects that do have separate service keeping the metadata (such as Hive and it's Hive metastore), but I assume that it can become an issue for projects that don't have that ability (such as Pig). Hence I'm wondering if it would make sense to store the precision in the Avro metadata as well?
          Hide
          Sean Busbey added a comment -

          re: #1, what are the use cases we enable by setting precision and scale on a per-record basis rather than setting it in the schema?

          Show
          Sean Busbey added a comment - re: #1, what are the use cases we enable by setting precision and scale on a per-record basis rather than setting it in the schema?
          Hide
          Tom White added a comment -

          > (Doug) For greatest interoperability we might use strings instead of bytes, with scientific notation.

          Hive would probably prefer a binary representation for performance, but perhaps others can comment on that.

          Thanks for the pointers on supporting subtypes more fully - I can create another patch for those. Regarding the incompatibility, I was thinking of an application that processes Avro files that it didn't generate - if it upgraded to the new version of Avro and read a file with a decimal subtype it would receive a BigDecimal when it was only expecting a ByteBuffer.

          > (Jarcec) Databases do not allows different rows for the same column to have different scale or precision.

          The scale and precision specified in the column definition are maximums. My understanding is that decimals with smaller scale and precision may be stored in such columns. For example in Hive if I have a DECIMAL(5,3) column then 12.4 is stored as [124, 1] as the unscaled int-scale pair (so precision 3, scale 1), and not as [12400, 3] So in effect there are per-record precision and scale values, unless I'm missing something.

          > (Jarcec) I'm wondering if it would make sense to store the precision in the Avro metadata as well?

          Applications could set scale and precision attributes as metadata to describe the (maximum) scale and precision of the decimals written to that field.

          Would that be sufficient? I'm not sure if we want to define and implement a policy for how to handle decimals that don't fit a max precision/scale in Avro.

          Show
          Tom White added a comment - > (Doug) For greatest interoperability we might use strings instead of bytes, with scientific notation. Hive would probably prefer a binary representation for performance, but perhaps others can comment on that. Thanks for the pointers on supporting subtypes more fully - I can create another patch for those. Regarding the incompatibility, I was thinking of an application that processes Avro files that it didn't generate - if it upgraded to the new version of Avro and read a file with a decimal subtype it would receive a BigDecimal when it was only expecting a ByteBuffer. > (Jarcec) Databases do not allows different rows for the same column to have different scale or precision. The scale and precision specified in the column definition are maximums. My understanding is that decimals with smaller scale and precision may be stored in such columns. For example in Hive if I have a DECIMAL(5,3) column then 12.4 is stored as [124, 1] as the unscaled int-scale pair (so precision 3, scale 1), and not as [12400, 3] So in effect there are per-record precision and scale values, unless I'm missing something. > (Jarcec) I'm wondering if it would make sense to store the precision in the Avro metadata as well? Applications could set scale and precision attributes as metadata to describe the (maximum) scale and precision of the decimals written to that field. Would that be sufficient? I'm not sure if we want to define and implement a policy for how to handle decimals that don't fit a max precision/scale in Avro.
          Hide
          Doug Cutting added a comment -

          > Hive would probably prefer a binary representation for performance [ ... ]

          It might be useful to quantify the performance difference, perhaps benchmarking the writing and reading a snappy-compressed file that contains a decimal field represented as either bytes or as a string.

          A faster alternative to a subtype might be to use a record, e.g.:

          {"type":"record","org.apache.avro.Decimal","fields":[
            {"name":"scale","type":"int"},
            {"name":"value","type":"bytes"}
          ]}
          

          If we still changed GenericData to implement this directly then there would be no overhead and implementation would be easier & faster, since it wouldn't need a temporary buffer. It wouldn't be very useful to implementations that don't yet know about it, but neither would the binary subtype. We could add this type to the specification as something that implementations might optimize, just like a subtype. So this might be something to benchmark too.

          > if it upgraded to the new version of Avro and read a file with a decimal subtype it would receive a BigDecimal when it was only expecting a ByteBuffer.

          Today if an application using specific or reflect uses BigDecimal then it will be read as BigDecimal, since that's currently encoded using the schema

          {"type":"string", "java-class":"java.math.BigDecimal"}

          . So the schema would change when they upgrade, but the object would not. That seems compatible to me. You?

          If the application is using Generic to write, then BigDecimal will currently fail.

          I assume that existing applications are not currently using "subType":"decimal", no application should start receiving BigDecimal that wasn't before. If the write path is upgraded before the read path then the application will start seeing bytes where before it saw either BigDecimal or nothing. This is a potential compatibility problem, but not the one you seem to describe.

          Show
          Doug Cutting added a comment - > Hive would probably prefer a binary representation for performance [ ... ] It might be useful to quantify the performance difference, perhaps benchmarking the writing and reading a snappy-compressed file that contains a decimal field represented as either bytes or as a string. A faster alternative to a subtype might be to use a record, e.g.: { "type" : "record" , "org.apache.avro.Decimal" , "fields" :[ { "name" : "scale" , "type" : " int " }, { "name" : "value" , "type" : "bytes" } ]} If we still changed GenericData to implement this directly then there would be no overhead and implementation would be easier & faster, since it wouldn't need a temporary buffer. It wouldn't be very useful to implementations that don't yet know about it, but neither would the binary subtype. We could add this type to the specification as something that implementations might optimize, just like a subtype. So this might be something to benchmark too. > if it upgraded to the new version of Avro and read a file with a decimal subtype it would receive a BigDecimal when it was only expecting a ByteBuffer. Today if an application using specific or reflect uses BigDecimal then it will be read as BigDecimal, since that's currently encoded using the schema {"type":"string", "java-class":"java.math.BigDecimal"} . So the schema would change when they upgrade, but the object would not. That seems compatible to me. You? If the application is using Generic to write, then BigDecimal will currently fail. I assume that existing applications are not currently using "subType":"decimal", no application should start receiving BigDecimal that wasn't before. If the write path is upgraded before the read path then the application will start seeing bytes where before it saw either BigDecimal or nothing. This is a potential compatibility problem, but not the one you seem to describe.
          Hide
          Ryan Blue added a comment -

          i) whether this is a good approach for adding new (optional) types to Avro

          +1 for the sub-types. I think that's a good way to provide backward-compat. My only reservation is that calling it a "subType" makes me think there is an "is a" relationship. What about calling it a "logicalType"?

          For existing applications, isn't compatibility covered by the read schema? If an application isn't setting the read schema and a file is written with a changed schema, then there isn't a strong compatibility guarantee. But setting a read schema that doesn't include the subtype annotation should guarantee compatibility.

          What are the use cases we enable by setting precision and scale on a per-record basis rather than setting it in the schema?

          The one I came up with is storing decimals with the right number of significant figures. The example on wikipedia for multiplication is 8.02 * 8.02 = 64.3 (p=3, s=1). BigDecimal will carry out the easiest multiplication and produce 64.3204 (p=6, s=4) and the fixed-scale representation will produce 62.32 (p=4, s=2) with one extra, non-significant figure. But I'm not sure that this is actually something people would (or do) use.

          smaller scale and precision may be stored in such columns

          If it is the case that smaller scales can be stored, then I think we are forced into storing the scale as data.

          Show
          Ryan Blue added a comment - i) whether this is a good approach for adding new (optional) types to Avro +1 for the sub-types. I think that's a good way to provide backward-compat. My only reservation is that calling it a "subType" makes me think there is an "is a" relationship. What about calling it a "logicalType"? For existing applications, isn't compatibility covered by the read schema? If an application isn't setting the read schema and a file is written with a changed schema, then there isn't a strong compatibility guarantee. But setting a read schema that doesn't include the subtype annotation should guarantee compatibility. What are the use cases we enable by setting precision and scale on a per-record basis rather than setting it in the schema? The one I came up with is storing decimals with the right number of significant figures. The example on wikipedia for multiplication is 8.02 * 8.02 = 64.3 (p=3, s=1). BigDecimal will carry out the easiest multiplication and produce 64.3204 (p=6, s=4) and the fixed-scale representation will produce 62.32 (p=4, s=2) with one extra, non-significant figure. But I'm not sure that this is actually something people would (or do) use. smaller scale and precision may be stored in such columns If it is the case that smaller scales can be stored, then I think we are forced into storing the scale as data.
          Hide
          Ryan Blue added a comment -

          From looking around at other database docs, it looks like most use a fixed scale (or don't specify) and SQL server even states that different scales are considered different types. But in postgres, the scale is the maximum scale for all values in the column, which may make a case for per-value scales. I think there is enough precedent behind fixing the scale for a column that it's reasonable to have a type that fixes the scale like this:

          {"type":"bytes", "logicalType": "decimal", "scale": 2}

          Then we could include in the specification how to determine the scale from data if it is missing.

          Show
          Ryan Blue added a comment - From looking around at other database docs, it looks like most use a fixed scale (or don't specify) and SQL server even states that different scales are considered different types. But in postgres , the scale is the maximum scale for all values in the column, which may make a case for per-value scales. I think there is enough precedent behind fixing the scale for a column that it's reasonable to have a type that fixes the scale like this: {"type":"bytes", "logicalType": "decimal", "scale": 2} Then we could include in the specification how to determine the scale from data if it is missing.
          Hide
          Doug Cutting added a comment -

          > If an application isn't setting the read schema and a file is written with a changed schema, then there isn't a strong compatibility guarantee.

          In this case, for Java applications that use BigDecimal, upgrading Avro alone will change the schema. So the writing application's schema may change without any other change to the application. That alone may be a compatibility problem.

          BTW, we may also need to remove much of the stringable handling for BigDecimal in SpecificData. We don't want both isString() and isBytes() to be true for the same object. However we do still need to be able to read instances of the old "java-class" schema for BigDecimal, and perhaps even continue to write these. Ugh.

          I wonder if we instead should use something like serializers, e.g.,:

          interface TypeMapping<T> {
            boolean isInstance(Object o);
            Schema getSchema();
            void write(T instance, Schema schema, Encoder encoder) throws IOException;
            T read(Schema schema, Decoder decoder) throws IOException;
          }
          public class DecimalMapping implements TypeMapping<BigDecimal> {
            private int scale;
            public DecimalMapping(int scale) { this.scale = scale; }
            ...
          }
          

          To opt into using this, you'd do something like:

          GenericData data = new GenericData();
          data.addTypeMapping(new DecimalMapping(10));
          ...
          

          An application would need to update both its write and read side to start using this. That way each time we add a feature like this we won't need to make an incompatible release.

          Show
          Doug Cutting added a comment - > If an application isn't setting the read schema and a file is written with a changed schema, then there isn't a strong compatibility guarantee. In this case, for Java applications that use BigDecimal, upgrading Avro alone will change the schema. So the writing application's schema may change without any other change to the application. That alone may be a compatibility problem. BTW, we may also need to remove much of the stringable handling for BigDecimal in SpecificData. We don't want both isString() and isBytes() to be true for the same object. However we do still need to be able to read instances of the old "java-class" schema for BigDecimal, and perhaps even continue to write these. Ugh. I wonder if we instead should use something like serializers, e.g.,: interface TypeMapping<T> { boolean isInstance( Object o); Schema getSchema(); void write(T instance, Schema schema, Encoder encoder) throws IOException; T read(Schema schema, Decoder decoder) throws IOException; } public class DecimalMapping implements TypeMapping<BigDecimal> { private int scale; public DecimalMapping( int scale) { this .scale = scale; } ... } To opt into using this, you'd do something like: GenericData data = new GenericData(); data.addTypeMapping( new DecimalMapping(10)); ... An application would need to update both its write and read side to start using this. That way each time we add a feature like this we won't need to make an incompatible release.
          Hide
          Tom White added a comment -

          > I wonder if we instead should use something like serializers

          How about using the existing CustomEncoding class?

          GenericData data = new GenericData();
          data.addCustomEncoding(new DecimalEncoding(5, 3));
          ...
          
          Show
          Tom White added a comment - > I wonder if we instead should use something like serializers How about using the existing CustomEncoding class? GenericData data = new GenericData(); data.addCustomEncoding( new DecimalEncoding(5, 3)); ...
          Hide
          Doug Cutting added a comment -

          CustomEncoding is close. It's does not have an isInstance() method, since it's attached to a class via the @AvroEncode annotation. Mapping from class to encoding is more efficient than via an isInstance() predicate, so we might instead have a getInstanceClass() method. (This wouldn't let us use the same mechanism for a different generated record implementation, but that's probably okay.) GenericData could have a Map<Class,TypeMapping>. CustomEncoding might implement TypeMapping and be implemented in terms of the facility added to GenericData?

          Have you considered using a lexicographic-friendly format?

          http://www.zanopha.com/docs/elen.pdf

          That could help if these are ever, e.g., used as MapReduce keys. Perhaps not a priority.

          Show
          Doug Cutting added a comment - CustomEncoding is close. It's does not have an isInstance() method, since it's attached to a class via the @AvroEncode annotation. Mapping from class to encoding is more efficient than via an isInstance() predicate, so we might instead have a getInstanceClass() method. (This wouldn't let us use the same mechanism for a different generated record implementation, but that's probably okay.) GenericData could have a Map<Class,TypeMapping>. CustomEncoding might implement TypeMapping and be implemented in terms of the facility added to GenericData? Have you considered using a lexicographic-friendly format? http://www.zanopha.com/docs/elen.pdf That could help if these are ever, e.g., used as MapReduce keys. Perhaps not a priority.
          Hide
          Tom White added a comment -

          Here's a patch that does roughly what you suggest. I used a name rather than a class, so that the reader can use the schema to see if a custom encoding is needed.

          BTW I think AVRO-1469 is related. Would the approach here satisfy the requirements of AVRO-1469?

          > Have you considered using a lexicographic-friendly format?

          I had a look, but I don't think that's a requirement for Hive for example.

          Show
          Tom White added a comment - Here's a patch that does roughly what you suggest. I used a name rather than a class, so that the reader can use the schema to see if a custom encoding is needed. BTW I think AVRO-1469 is related. Would the approach here satisfy the requirements of AVRO-1469 ? > Have you considered using a lexicographic-friendly format? I had a look, but I don't think that's a requirement for Hive for example.
          Hide
          Xuefu Zhang added a comment -

          Looking the patch, the schema and data encoding look good to me. One minor thought for consideration though, is that (precision, scale), when mentioned at the schema level, implies "max". Thus, the property names could be just "precision" and "scale". Very insignificant though.

          Show
          Xuefu Zhang added a comment - Looking the patch, the schema and data encoding look good to me. One minor thought for consideration though, is that (precision, scale), when mentioned at the schema level, implies "max". Thus, the property names could be just "precision" and "scale". Very insignificant though.
          Hide
          Doug Cutting added a comment -

          Tom, this is looking good. A few issues:

          • Changing the CustomEncoding API makes this incompatible. Also, reflect is perhaps not the appropriate package for this. So, even though the API is almost identical to CustomEncoding, it might be preferable to add a new one, like RecordMapping.
          • It won't work to put these in unions without somehow modifying GenericData#getSchemaName. We should probably extend isRecord() to call some isRecordMapped(Object) predicate.
          • Do we need a "CustomEncoding" schema attribute? We need an isRecordMapped(Schema) predicate, but we might just key off of Schema#getFullName(). In the specification we can add a list of records that implementations might specially handle, like org.apache.avro.Decimal. Adding an attribute seems redundant.
          Show
          Doug Cutting added a comment - Tom, this is looking good. A few issues: Changing the CustomEncoding API makes this incompatible. Also, reflect is perhaps not the appropriate package for this. So, even though the API is almost identical to CustomEncoding, it might be preferable to add a new one, like RecordMapping. It won't work to put these in unions without somehow modifying GenericData#getSchemaName. We should probably extend isRecord() to call some isRecordMapped(Object) predicate. Do we need a "CustomEncoding" schema attribute? We need an isRecordMapped(Schema) predicate, but we might just key off of Schema#getFullName(). In the specification we can add a list of records that implementations might specially handle, like org.apache.avro.Decimal. Adding an attribute seems redundant.
          Hide
          Tom White added a comment -

          Thanks for the feedback Doug. I’ve addressed these issues in the latest patch.

          Thanks for taking a look Xuefu. I wanted to be explicit about the fact precision and scale on the schema are maximum values, not the actual (fixed) values, since there has been some confusion on this (during the discussion here). I can change it though if there's consensus that the “maximum” qualifier is redundant.

          Show
          Tom White added a comment - Thanks for the feedback Doug. I’ve addressed these issues in the latest patch. Thanks for taking a look Xuefu. I wanted to be explicit about the fact precision and scale on the schema are maximum values, not the actual (fixed) values, since there has been some confusion on this (during the discussion here). I can change it though if there's consensus that the “maximum” qualifier is redundant.
          Hide
          Xuefu Zhang added a comment -

          The scale and precision specified in the column definition are maximums. My understanding is that decimals with smaller scale and precision may be stored in such columns. For example in Hive if I have a DECIMAL(5,3) column then 12.4 is stored as [124, 1] as the unscaled int-scale pair (so precision 3, scale 1), and not as [12400, 3] So in effect there are per-record precision and scale values, unless I'm missing something.

          Sorry for being late on this. While it's possible that a column of type decimal might have different scales from row to row, I'm not sure if there is any use case for that. At least, majority of use cases is that all rows will have the same scale. So for majority of use cases, storage efficiency will suffer a little for storing scale on per row basis. Most of time, same application (such as Hive AvroSerde) will be used to write/read the data, so the way data is stored can be controlled. Of course, storing scale per row is more generic. If that's case, the need for maxPrecision/maxScale at schema is less meaningful, as consumer of the decimal data will need and be able to figure out the precision/scale on a per-row basis.

          If we do choose to storing scale per row, I'm wondering if byte instead of int can be used as the type, saving some storage scale.

          Show
          Xuefu Zhang added a comment - The scale and precision specified in the column definition are maximums. My understanding is that decimals with smaller scale and precision may be stored in such columns. For example in Hive if I have a DECIMAL(5,3) column then 12.4 is stored as [124, 1] as the unscaled int-scale pair (so precision 3, scale 1), and not as [12400, 3] So in effect there are per-record precision and scale values, unless I'm missing something. Sorry for being late on this. While it's possible that a column of type decimal might have different scales from row to row, I'm not sure if there is any use case for that. At least, majority of use cases is that all rows will have the same scale. So for majority of use cases, storage efficiency will suffer a little for storing scale on per row basis. Most of time, same application (such as Hive AvroSerde) will be used to write/read the data, so the way data is stored can be controlled. Of course, storing scale per row is more generic. If that's case, the need for maxPrecision/maxScale at schema is less meaningful, as consumer of the decimal data will need and be able to figure out the precision/scale on a per-row basis. If we do choose to storing scale per row, I'm wondering if byte instead of int can be used as the type, saving some storage scale.
          Hide
          Sean Busbey added a comment -

          If we do choose to storing scale per row, I'm wondering if byte instead of int can be used as the type, saving some storage scale.

          Presuming we store the scale with standard Avro primitives, there is no byte type. And if the values are normally in the range of byte, then we'll save on some space with the normal variable length zig-zag encoding.

          Show
          Sean Busbey added a comment - If we do choose to storing scale per row, I'm wondering if byte instead of int can be used as the type, saving some storage scale. Presuming we store the scale with standard Avro primitives , there is no byte type. And if the values are normally in the range of byte, then we'll save on some space with the normal variable length zig-zag encoding.
          Hide
          Tom White added a comment -

          Most of time, same application (such as Hive AvroSerde) will be used to write/read the data, so the way data is stored can be controlled.

          I agree, but as far as I can tell Hive and MySQL do store per-row scale values. E.g. https://github.com/apache/hive/blob/trunk/serde/src/java/org/apache/hadoop/hive/serde2/io/HiveDecimalWritable.java#L113.

          Another argument for storing the scale in the data is that if you have scale as an Avro schema property, then you change the values of the data by providing a different read schema since the schema property is ignored during Avro schema resolution (http://avro.apache.org/docs/1.7.6/spec.html#Schema+Resolution). Compare this to Avro fixed where you can't change the size field in a read schema.

          Show
          Tom White added a comment - Most of time, same application (such as Hive AvroSerde) will be used to write/read the data, so the way data is stored can be controlled. I agree, but as far as I can tell Hive and MySQL do store per-row scale values. E.g. https://github.com/apache/hive/blob/trunk/serde/src/java/org/apache/hadoop/hive/serde2/io/HiveDecimalWritable.java#L113 . Another argument for storing the scale in the data is that if you have scale as an Avro schema property, then you change the values of the data by providing a different read schema since the schema property is ignored during Avro schema resolution ( http://avro.apache.org/docs/1.7.6/spec.html#Schema+Resolution ). Compare this to Avro fixed where you can't change the size field in a read schema.
          Hide
          Doug Cutting added a comment -

          Tom, this is looking great. A few questions/concerns:

          • a few new public or protected methods need javadoc
          • should we add a constant or method that returns the decimal schema?
          • should we add this schema to the specification document?
          • as a test of this new abstraction, will it be easy to fix AVRO-739 using it?
          Show
          Doug Cutting added a comment - Tom, this is looking great. A few questions/concerns: a few new public or protected methods need javadoc should we add a constant or method that returns the decimal schema? should we add this schema to the specification document? as a test of this new abstraction, will it be easy to fix AVRO-739 using it?
          Hide
          Xuefu Zhang added a comment -

          I can tell Hive and MySQL do store per-row scale values. E.g. https://github.com/apache/hive/blob/trunk/serde/src/java/org/apache/hadoop/hive/serde2/io/HiveDecimalWritable.java#L113.

          This is true. This class is used as default LazySerde in Hive. However, other Serde is free to choose its implementation.

          I'm on board with storing scale per row. My only concern is that Hive ParquetSerde is choosing not to store scale per row. Of course, this belongs to a different discussion.

          Show
          Xuefu Zhang added a comment - I can tell Hive and MySQL do store per-row scale values. E.g. https://github.com/apache/hive/blob/trunk/serde/src/java/org/apache/hadoop/hive/serde2/io/HiveDecimalWritable.java#L113 . This is true. This class is used as default LazySerde in Hive. However, other Serde is free to choose its implementation. I'm on board with storing scale per row. My only concern is that Hive ParquetSerde is choosing not to store scale per row. Of course, this belongs to a different discussion.
          Hide
          Tom White added a comment -

          Here's a new patch with javadoc and a section in the spec about record mappings.

          should we add a constant or method that returns the decimal schema?

          You can get the decimal schema with new DecimalRecordMapping().getSchema(), or new DecimalRecordMapping(5, 3).getSchema() - is that sufficient?

          as a test of this new abstraction, will it be easy to fix AVRO-739 using it?

          Yes, it should be. I'll try it out and report back.

          Show
          Tom White added a comment - Here's a new patch with javadoc and a section in the spec about record mappings. should we add a constant or method that returns the decimal schema? You can get the decimal schema with new DecimalRecordMapping().getSchema() , or new DecimalRecordMapping(5, 3).getSchema() - is that sufficient? as a test of this new abstraction, will it be easy to fix AVRO-739 using it? Yes, it should be. I'll try it out and report back.
          Hide
          Doug Cutting added a comment -

          +1. This looks great to me! Thanks!

          Show
          Doug Cutting added a comment - +1. This looks great to me! Thanks!
          Hide
          Tom White added a comment -

          Here's a patch (not for commit) of what a Unix Epoch mapping might look like as discussed in AVRO-739.

          Show
          Tom White added a comment - Here's a patch (not for commit) of what a Unix Epoch mapping might look like as discussed in AVRO-739 .
          Hide
          ASF subversion and git services added a comment -

          Commit 1584605 from tomwhite@apache.org in branch 'avro/trunk'
          [ https://svn.apache.org/r1584605 ]

          AVRO-1402. Support for DECIMAL type (as a record mapping).

          Show
          ASF subversion and git services added a comment - Commit 1584605 from tomwhite@apache.org in branch 'avro/trunk' [ https://svn.apache.org/r1584605 ] AVRO-1402 . Support for DECIMAL type (as a record mapping).
          Hide
          Tom White added a comment -

          I just committed this. Thanks for all the feedback and reviews.

          Show
          Tom White added a comment - I just committed this. Thanks for all the feedback and reviews.
          Hide
          Hudson added a comment -

          SUCCESS: Integrated in AvroJava #448 (See https://builds.apache.org/job/AvroJava/448/)
          AVRO-1402. Support for DECIMAL type (as a record mapping). (tomwhite: rev 1584605)

          • /avro/trunk/CHANGES.txt
          • /avro/trunk/doc/src/content/xdocs/spec.xml
          • /avro/trunk/lang/java/avro/src/main/java/org/apache/avro/generic/DecimalRecordMapping.java
          • /avro/trunk/lang/java/avro/src/main/java/org/apache/avro/generic/GenericData.java
          • /avro/trunk/lang/java/avro/src/main/java/org/apache/avro/generic/GenericDatumReader.java
          • /avro/trunk/lang/java/avro/src/main/java/org/apache/avro/generic/GenericDatumWriter.java
          • /avro/trunk/lang/java/avro/src/main/java/org/apache/avro/generic/RecordMapping.java
          • /avro/trunk/lang/java/ipc/src/test/java/org/apache/avro/TestSchema.java
          Show
          Hudson added a comment - SUCCESS: Integrated in AvroJava #448 (See https://builds.apache.org/job/AvroJava/448/ ) AVRO-1402 . Support for DECIMAL type (as a record mapping). (tomwhite: rev 1584605) /avro/trunk/CHANGES.txt /avro/trunk/doc/src/content/xdocs/spec.xml /avro/trunk/lang/java/avro/src/main/java/org/apache/avro/generic/DecimalRecordMapping.java /avro/trunk/lang/java/avro/src/main/java/org/apache/avro/generic/GenericData.java /avro/trunk/lang/java/avro/src/main/java/org/apache/avro/generic/GenericDatumReader.java /avro/trunk/lang/java/avro/src/main/java/org/apache/avro/generic/GenericDatumWriter.java /avro/trunk/lang/java/avro/src/main/java/org/apache/avro/generic/RecordMapping.java /avro/trunk/lang/java/ipc/src/test/java/org/apache/avro/TestSchema.java
          Hide
          Tom White added a comment -

          There has been some further (offline) discussion about whether it would be possible to store the scale in the Avro schema, and not in the data for efficiency reasons. Something like:

          {
            "type":"record”,
            "name":”org.apache.avro.FixedDecimal”,
            "fields”: [{
              "name":"value”,
              "type":”bytes"
            }],
            "scale":"2”,
            "precision":”4"
          }
          

          In the implementation committed here the name does not uniquely determine the RecordMapping so a FixedDecimal(4, 2) has a different RecordMapping to a FixedDecimal(3, 0). GenericData has a map of name to RecordMappings, so org.apache.avro.FixedDecimal would map to either FixedDecimalRecordMapping(4, 2) or FixedDecimalRecordMapping(3, 0), but not both.

          We could solve this problem by having a stateless FixedDecimalRecordMapping and having the read and write methods pass through the record schema to get the scale. However, consider the case where there are multiple decimals (with different scales) in a single schema. Since you can’t redefine a type multiple times (http://avro.apache.org/docs/1.7.6/spec.html#Names), the first one serves as the definition, and later ones are just references:

          {"type":"record","name":"rec","fields":[
            {"name":"dec1","type":{"type":"record","name":"org.apache.avro.FixedDecimal","fields":[{"name":"value","type":"bytes"}],"scale":"2","precision":"4"}},
            {"name":"dec2","type":"org.apache.avro.FixedDecimal","precision":"3","scale":"0"}
          ]} 
          

          When GenericDatumReader/Writer is processing dec2, the value of scale seen is 2, not 0, since the read/write method sees the record schema, not the field-level schema. I can’t see a simple way around this.

          Note that in the Decimal schema committed in this JIRA we allow maxPrecision and maxScale values to be specified as JSON properties that are not interpreted by Avro. E.g.

          {"type":"record","name":"rec","fields":[
            {"name":"dec1","type":{"type":"record","name":”org.apache.avro.Decimal","fields":[{"name":"scale","type":"int"},{"name":"value","type":"bytes"}],"maxPrecision":"4","maxScale":"2"}},
            {"name":"dec2","type":"org.apache.avro.Decimal","maxPrecision":"3","maxScale":"0"}
          ]}
          

          As it stands an application using this extra metadata would have to be careful to read the JSON properties either from the field (if they are present there) or the org.apache.avro.Decimal record type. This might be something we improve - e.g. by only having the metadata as a field-level properties, not as a part of the record definition. That would work for Hive.

          Show
          Tom White added a comment - There has been some further (offline) discussion about whether it would be possible to store the scale in the Avro schema, and not in the data for efficiency reasons. Something like: { "type" :"record”, "name" :”org.apache.avro.FixedDecimal”, "fields”: [{ "name" :"value”, "type" :”bytes" }], "scale" :"2”, "precision" :”4" } In the implementation committed here the name does not uniquely determine the RecordMapping so a FixedDecimal(4, 2) has a different RecordMapping to a FixedDecimal(3, 0). GenericData has a map of name to RecordMappings, so org.apache.avro.FixedDecimal would map to either FixedDecimalRecordMapping(4, 2) or FixedDecimalRecordMapping(3, 0), but not both. We could solve this problem by having a stateless FixedDecimalRecordMapping and having the read and write methods pass through the record schema to get the scale. However, consider the case where there are multiple decimals (with different scales) in a single schema. Since you can’t redefine a type multiple times ( http://avro.apache.org/docs/1.7.6/spec.html#Names ), the first one serves as the definition, and later ones are just references: { "type" : "record" , "name" : "rec" , "fields" :[ { "name" : "dec1" , "type" :{ "type" : "record" , "name" : "org.apache.avro.FixedDecimal" , "fields" :[{ "name" : "value" , "type" : "bytes" }], "scale" : "2" , "precision" : "4" }}, { "name" : "dec2" , "type" : "org.apache.avro.FixedDecimal" , "precision" : "3" , "scale" : "0" } ]} When GenericDatumReader/Writer is processing dec2, the value of scale seen is 2, not 0, since the read/write method sees the record schema, not the field-level schema. I can’t see a simple way around this. Note that in the Decimal schema committed in this JIRA we allow maxPrecision and maxScale values to be specified as JSON properties that are not interpreted by Avro. E.g. { "type" : "record" , "name" : "rec" , "fields" :[ { "name" : "dec1" , "type" :{ "type" : "record" , "name" :”org.apache.avro.Decimal "," fields ":[{" name ":" scale "," type ":" int "},{" name ":" value "," type ":" bytes "}]," maxPrecision ":" 4 "," maxScale ":" 2"}}, { "name" : "dec2" , "type" : "org.apache.avro.Decimal" , "maxPrecision" : "3" , "maxScale" : "0" } ]} As it stands an application using this extra metadata would have to be careful to read the JSON properties either from the field (if they are present there) or the org.apache.avro.Decimal record type. This might be something we improve - e.g. by only having the metadata as a field-level properties, not as a part of the record definition. That would work for Hive.
          Hide
          Tom White added a comment -

          There have been objections to the approach implemented here on the basis that there’s no a clear use case for storing decimals with different scales in the same field. So here’s another idea. Rather than adding new types into GenericData, provide helper code to make it easy for applications to use decimals with some agreed upon schemas. To avoid the problems with unique names we could go back to the logical type approach:

          {"type":"bytes", "logicalType":"decimal", "scale":"2”} 
          

          We’d then have code like

          // write
          Schema schema = DecimalHelper.schema(2);
          BigDecimal decimal = new BigDecimal("12.45");
          GenericDatumWriter<ByteBuffer> writer = new
              GenericDatumWriter<ByteBuffer>(schema, GenericData.get());
          writer.write(DecimalHelper.toByteBuffer(decimal), encoder);
          
          // read
          int scale = DecimalHelper.getScale(schema);
          GenericDatumReader<ByteBuffer> reader = new
            GenericDatumReader<ByteBuffer>(schema, schema, GenericData.get());
          ByteBuffer buffer = reader.read(null, decoder);
          BigDecimal readDecimal = DecimalHelper.fromByteBuffer(buffer, scale);  
          

          If we did this we would revert the patch in this JIRA, since there’s no benefit in having both approaches. Date/time could be implemented in the same way.

          Also, the work on interchangeable memory-models (AVRO-1469) might provide a more flexible approach for supporting more in-memory types (like BigDecimal) in the future.

          Show
          Tom White added a comment - There have been objections to the approach implemented here on the basis that there’s no a clear use case for storing decimals with different scales in the same field. So here’s another idea. Rather than adding new types into GenericData, provide helper code to make it easy for applications to use decimals with some agreed upon schemas. To avoid the problems with unique names we could go back to the logical type approach: { "type" : "bytes" , "logicalType" : "decimal" , "scale" :"2”} We’d then have code like // write Schema schema = DecimalHelper.schema(2); BigDecimal decimal = new BigDecimal( "12.45" ); GenericDatumWriter<ByteBuffer> writer = new GenericDatumWriter<ByteBuffer>(schema, GenericData.get()); writer.write(DecimalHelper.toByteBuffer(decimal), encoder); // read int scale = DecimalHelper.getScale(schema); GenericDatumReader<ByteBuffer> reader = new GenericDatumReader<ByteBuffer>(schema, schema, GenericData.get()); ByteBuffer buffer = reader.read( null , decoder); BigDecimal readDecimal = DecimalHelper.fromByteBuffer(buffer, scale); If we did this we would revert the patch in this JIRA, since there’s no benefit in having both approaches. Date/time could be implemented in the same way. Also, the work on interchangeable memory-models ( AVRO-1469 ) might provide a more flexible approach for supporting more in-memory types (like BigDecimal) in the future.
          Hide
          Ryan Blue added a comment -

          There have been objections to the approach implemented here on the basis that there’s no a clear use case for storing decimals with different scales in the same field.

          I think there is a use case that avro needs to care about: `BigDecimal` is used to avoid floating point error, but not just for money where scale is almost always 2. If I'm storing precise measurements, I could end up with different incoming scales from different equipment used to take the measurement. This works just fine up until the `BigDecimal` is stored in avro, at which point avro would throw an exception and force the caller to mutate the data, if avro forces data points to have the same scale.

          There are real-world cases where this happens. GPS data comes to mind, where newer receivers have better location resolution and it's important to keep the most precise measurement taken.

          Show
          Ryan Blue added a comment - There have been objections to the approach implemented here on the basis that there’s no a clear use case for storing decimals with different scales in the same field. I think there is a use case that avro needs to care about: `BigDecimal` is used to avoid floating point error, but not just for money where scale is almost always 2. If I'm storing precise measurements, I could end up with different incoming scales from different equipment used to take the measurement. This works just fine up until the `BigDecimal` is stored in avro, at which point avro would throw an exception and force the caller to mutate the data, if avro forces data points to have the same scale. There are real-world cases where this happens. GPS data comes to mind, where newer receivers have better location resolution and it's important to keep the most precise measurement taken.
          Hide
          Nong Li added a comment -

          Ryan, your example seems like a schema evolution use case. Each file would have a fixed scale and we'd resolve the different schemas. Is that that compelling to have the different receivers write to the same file?

          Show
          Nong Li added a comment - Ryan, your example seems like a schema evolution use case. Each file would have a fixed scale and we'd resolve the different schemas. Is that that compelling to have the different receivers write to the same file?
          Hide
          Ryan Blue added a comment -

          I don't think evolution would cover the use case because the file format would then require separate files for measurements where the only difference is the resolution of the measurement. That wouldn't work for cases where these are coming in from multiple devices simultaneously, that won't be upgraded at the same time. I agree we should support the case where all of the scales are the same, but I don't think it's correct for avro to force that to be the case.

          Show
          Ryan Blue added a comment - I don't think evolution would cover the use case because the file format would then require separate files for measurements where the only difference is the resolution of the measurement. That wouldn't work for cases where these are coming in from multiple devices simultaneously, that won't be upgraded at the same time. I agree we should support the case where all of the scales are the same, but I don't think it's correct for avro to force that to be the case.
          Hide
          Nong Li added a comment -

          I see your use case as the same as going from int -> long or float -> double. Can you help me understand why it's different?

          Show
          Nong Li added a comment - I see your use case as the same as going from int -> long or float -> double. Can you help me understand why it's different?
          Hide
          Ryan Blue added a comment -

          No problem. Going from int to long or float to double is increasing the precision, which is fine. But changing the scale is actually changing the data.

          Say I have measurements coming in, and over time I'm updating the platform to get measurements with higher resolution. Using BigDecimal is the right choice because I want to be able to calculate the margin of error, so I need to know how many figures are significant. If we fix the scale at the resolution of the initial measurements, then the higher-resolution measurements are lost because I have to discard digits to get to the same scale (12.008 becomes 12.01). But if I start with a higher resolution, scale to 4 digits, then I have to store a separate value that says how many of those are significant (12.0080 is really 12.008). In other words: for measurements, scale matters. That's why I'm not using floating point because I don't want an approximation that is close, but not quite accurate:

          >> BigDecimal.new(12.100).to_s
          => "12.0999999999999996447286321199499070644378662109375"
          

          If we were to evolve the schema from scale=2 to scale=4, how do I know which values were accurate to 2 decimals and which were accurate to 4? If all BigDecimal values produced by the new schema had the read-time scale but were stored with different scales, then the file format would be changing data. BigDecimal(12.01) != BigDecimal(12.0100). For evolution with different scales, the maximum scale can increase, but we still have to return the scale the data was written with. For fixed-scale schemas, I don't think we should allow the scale to evolve because programs should expect objects with the fixed scale.

          Show
          Ryan Blue added a comment - No problem. Going from int to long or float to double is increasing the precision, which is fine. But changing the scale is actually changing the data. Say I have measurements coming in, and over time I'm updating the platform to get measurements with higher resolution. Using BigDecimal is the right choice because I want to be able to calculate the margin of error, so I need to know how many figures are significant. If we fix the scale at the resolution of the initial measurements, then the higher-resolution measurements are lost because I have to discard digits to get to the same scale (12.008 becomes 12.01). But if I start with a higher resolution, scale to 4 digits, then I have to store a separate value that says how many of those are significant (12.0080 is really 12.008). In other words: for measurements, scale matters. That's why I'm not using floating point because I don't want an approximation that is close, but not quite accurate: >> BigDecimal. new (12.100).to_s => "12.0999999999999996447286321199499070644378662109375" If we were to evolve the schema from scale=2 to scale=4, how do I know which values were accurate to 2 decimals and which were accurate to 4? If all BigDecimal values produced by the new schema had the read-time scale but were stored with different scales, then the file format would be changing data. BigDecimal(12.01) != BigDecimal(12.0100). For evolution with different scales, the maximum scale can increase, but we still have to return the scale the data was written with. For fixed-scale schemas, I don't think we should allow the scale to evolve because programs should expect objects with the fixed scale.
          Hide
          Skye Wanderman-Milne added a comment -

          I strongly prefer the logical type approach, rather than a record mapping, due to the problem with multiple record mapping types in the same schema. Even if we stick with storing the scale in the data and not the schema, I don't think we should go down the record mapping path at all since it precludes any parameterized types, i.e. types that include schema metadata. We may want to introduce other types in the future that store extra metadata (e.g. a time with a timezone), and it seems unacceptable to not be able to include multiple instances of the same type with different metadata in the schema.

          I've written a micro-benchmark to get an idea of the performance of storing scale in the data vs. only in the schema. I see pretty big differences in performance depending on whether I compile with gcc or clang, so I don't want to give definitive numbers yet, but I'm seeing perf hits of between 5-30% depending on how many different scales are stored in the data (compared to storing a single scale in the schema). If we feel nailing down this performance difference is important, I can dig deeper and try to determine what the most "representative" case is.

          Talking offline with Ryan and others, we concluded that while it's possible to imagine scenarios where per-value scales are useful, it's not likely to be a concern in the large majority of cases (Ryan, please let me know if I'm mischaracterizing what you said). Given that we can't think of a very compelling use case for storing scales in the data, I think we should store the scales in the schema, using the logical type schema Tom suggested above. There's a definite performance impact, and keeping the current implementation adds complexity for applications that do not take advantage of the flexibility. I also think there's the possibility of confusing users who may decide to write multiple-scale values to the same file, only to learn later they can't easily be accessed as such.

          tl;dr: I like this one:

          {"type":"bytes", "logicalType":"decimal", "scale":"2”}
          
          Show
          Skye Wanderman-Milne added a comment - I strongly prefer the logical type approach, rather than a record mapping, due to the problem with multiple record mapping types in the same schema. Even if we stick with storing the scale in the data and not the schema, I don't think we should go down the record mapping path at all since it precludes any parameterized types, i.e. types that include schema metadata. We may want to introduce other types in the future that store extra metadata (e.g. a time with a timezone), and it seems unacceptable to not be able to include multiple instances of the same type with different metadata in the schema. I've written a micro-benchmark to get an idea of the performance of storing scale in the data vs. only in the schema. I see pretty big differences in performance depending on whether I compile with gcc or clang, so I don't want to give definitive numbers yet, but I'm seeing perf hits of between 5-30% depending on how many different scales are stored in the data (compared to storing a single scale in the schema). If we feel nailing down this performance difference is important, I can dig deeper and try to determine what the most "representative" case is. Talking offline with Ryan and others, we concluded that while it's possible to imagine scenarios where per-value scales are useful, it's not likely to be a concern in the large majority of cases (Ryan, please let me know if I'm mischaracterizing what you said). Given that we can't think of a very compelling use case for storing scales in the data, I think we should store the scales in the schema, using the logical type schema Tom suggested above. There's a definite performance impact, and keeping the current implementation adds complexity for applications that do not take advantage of the flexibility. I also think there's the possibility of confusing users who may decide to write multiple-scale values to the same file, only to learn later they can't easily be accessed as such. tl;dr: I like this one: {"type":"bytes", "logicalType":"decimal", "scale":"2”}
          Hide
          Tom White added a comment -

          I think the unique name limitation is a problem. At this point I think we should revert this patch, establish some shared logical type definitions (for decimal, date/time) that Hive can start using, then once they have become established look at how to support them better through the Avro API as a part of AVRO-1469.

          We could also add them to the Avro spec, and also describe (and enforce in the implementation) the schema resolution rules. Something like, both the reader’s and writer’s schema's logicalTypes must match. Then for each logical type have further rules, e.g. scale and precision must match for decimal.

          Show
          Tom White added a comment - I think the unique name limitation is a problem. At this point I think we should revert this patch, establish some shared logical type definitions (for decimal, date/time) that Hive can start using, then once they have become established look at how to support them better through the Avro API as a part of AVRO-1469 . We could also add them to the Avro spec, and also describe (and enforce in the implementation) the schema resolution rules. Something like, both the reader’s and writer’s schema's logicalTypes must match. Then for each logical type have further rules, e.g. scale and precision must match for decimal.
          Hide
          Doug Cutting added a comment -

          > At this point I think we should revert this patch ...

          +1

          Show
          Doug Cutting added a comment - > At this point I think we should revert this patch ... +1
          Hide
          Jarek Jarcec Cecho added a comment -

          I really like the idea of the logical types as that concept can be easily reused for some other new types (such as timestamp, ...).

          Show
          Jarek Jarcec Cecho added a comment - I really like the idea of the logical types as that concept can be easily reused for some other new types (such as timestamp, ...).
          Hide
          ASF subversion and git services added a comment -

          Commit 1586265 from tomwhite@apache.org in branch 'avro/trunk'
          [ https://svn.apache.org/r1586265 ]

          Revert r1584605. AVRO-1402. Support for DECIMAL type (as a record mapping).

          Show
          ASF subversion and git services added a comment - Commit 1586265 from tomwhite@apache.org in branch 'avro/trunk' [ https://svn.apache.org/r1586265 ] Revert r1584605. AVRO-1402 . Support for DECIMAL type (as a record mapping).
          Hide
          Tom White added a comment -

          Here's what the spec for logical types might look like.

          Show
          Tom White added a comment - Here's what the spec for logical types might look like.
          Hide
          Skye Wanderman-Milne added a comment -

          This looks great Tom! If I understand correctly, this also allows for a single logical type to have multiple possible encodings; for example, we could someday add an alternate decimal encoding that is backed by "fixed" rather than "bytes". One question: what do you mean by "Language implementations may choose to represent logical types with an appropriate native type, although this is not required"? Is this referring to logical types not explicitly defined in the spec?

          Show
          Skye Wanderman-Milne added a comment - This looks great Tom! If I understand correctly, this also allows for a single logical type to have multiple possible encodings; for example, we could someday add an alternate decimal encoding that is backed by "fixed" rather than "bytes". One question: what do you mean by "Language implementations may choose to represent logical types with an appropriate native type, although this is not required"? Is this referring to logical types not explicitly defined in the spec?
          Hide
          Ryan Blue added a comment -

          "Language implementations may choose to represent logical types with an appropriate native type, although this is not required"? Is this referring to logical types not explicitly defined in the spec?

          I think this defines backward-compatibility. It is fine if an implementation returns the underlying type (e.g. bytes) rather than translating to an object representation (e.g., BigDecimal). That way, this doesn't require implementation changes right away and we can determine the best way to transition to returning logical type objects.

          For types not defined in this spec, I think the library should support reading types by returning the underlying type's object representation. But to avoid incompatibilities, writers should reject any logical type annotations that are not defined in the spec.

          Show
          Ryan Blue added a comment - "Language implementations may choose to represent logical types with an appropriate native type, although this is not required"? Is this referring to logical types not explicitly defined in the spec? I think this defines backward-compatibility. It is fine if an implementation returns the underlying type (e.g. bytes) rather than translating to an object representation (e.g., BigDecimal). That way, this doesn't require implementation changes right away and we can determine the best way to transition to returning logical type objects. For types not defined in this spec, I think the library should support reading types by returning the underlying type's object representation. But to avoid incompatibilities, writers should reject any logical type annotations that are not defined in the spec.
          Hide
          Skye Wanderman-Milne added a comment -

          Ah, makes sense. Thanks for the clarification.

          I think we should allow writing unspecified logical types, so that new logical types can be implemented in the application when the spec is updated without upgrading the library.

          Show
          Skye Wanderman-Milne added a comment - Ah, makes sense. Thanks for the clarification. I think we should allow writing unspecified logical types, so that new logical types can be implemented in the application when the spec is updated without upgrading the library.
          Hide
          Ryan Blue added a comment -

          A logical type is always encoded in exactly the same way as the equivalent Avro type.

          Could we reword this to "A logical type is always serialized using its underlying Avro type." I'd like to avoid confusion because some logical types (like decimal) specify the "encoding" that determines how to convert between the object representation and the underlying Avro type.

          Last, I think that we should add that fixed may also be annotated with a decimal logical type, but I don't want to hold up this issue if people disagree – in that case we should discuss adding it in a separate issue/commit.

          Show
          Ryan Blue added a comment - A logical type is always encoded in exactly the same way as the equivalent Avro type. Could we reword this to "A logical type is always serialized using its underlying Avro type." I'd like to avoid confusion because some logical types (like decimal) specify the "encoding" that determines how to convert between the object representation and the underlying Avro type. Last, I think that we should add that fixed may also be annotated with a decimal logical type, but I don't want to hold up this issue if people disagree – in that case we should discuss adding it in a separate issue/commit.
          Hide
          Hudson added a comment -

          SUCCESS: Integrated in AvroJava #450 (See https://builds.apache.org/job/AvroJava/450/)
          Revert r1584605. AVRO-1402. Support for DECIMAL type (as a record mapping). (tomwhite: rev 1586265)

          • /avro/trunk/CHANGES.txt
          • /avro/trunk/doc/src/content/xdocs/spec.xml
          • /avro/trunk/lang/java/avro/src/main/java/org/apache/avro/generic/DecimalRecordMapping.java
          • /avro/trunk/lang/java/avro/src/main/java/org/apache/avro/generic/GenericData.java
          • /avro/trunk/lang/java/avro/src/main/java/org/apache/avro/generic/GenericDatumReader.java
          • /avro/trunk/lang/java/avro/src/main/java/org/apache/avro/generic/GenericDatumWriter.java
          • /avro/trunk/lang/java/avro/src/main/java/org/apache/avro/generic/RecordMapping.java
          • /avro/trunk/lang/java/ipc/src/test/java/org/apache/avro/TestSchema.java
          Show
          Hudson added a comment - SUCCESS: Integrated in AvroJava #450 (See https://builds.apache.org/job/AvroJava/450/ ) Revert r1584605. AVRO-1402 . Support for DECIMAL type (as a record mapping). (tomwhite: rev 1586265) /avro/trunk/CHANGES.txt /avro/trunk/doc/src/content/xdocs/spec.xml /avro/trunk/lang/java/avro/src/main/java/org/apache/avro/generic/DecimalRecordMapping.java /avro/trunk/lang/java/avro/src/main/java/org/apache/avro/generic/GenericData.java /avro/trunk/lang/java/avro/src/main/java/org/apache/avro/generic/GenericDatumReader.java /avro/trunk/lang/java/avro/src/main/java/org/apache/avro/generic/GenericDatumWriter.java /avro/trunk/lang/java/avro/src/main/java/org/apache/avro/generic/RecordMapping.java /avro/trunk/lang/java/ipc/src/test/java/org/apache/avro/TestSchema.java
          Hide
          Doug Cutting added a comment -

          > fixed may also be annotated with a decimal logical type

          Seems like a good idea. Let's add it to the spec if we implement it in this patch, otherwise in a separate issue.

          Show
          Doug Cutting added a comment - > fixed may also be annotated with a decimal logical type Seems like a good idea. Let's add it to the spec if we implement it in this patch, otherwise in a separate issue.
          Hide
          Xuefu Zhang added a comment -

          precision, a JSON integer representing the (maximum) precision of decimals stored in this type (optional)

          Looking at the spec, I'm wondering what's the precision if it's not specified in the type definition. Shouldn't we have a default value like scale?

          When we define a column of decimal type w/o giving precision/scale In a DB such as MySQL, "decimal" is the same as "decimal(10)" and "decimal(10,0)".

          While precision isn't required to reconstruct the data, it's important for metadata operations. For instance, what's the type of "d1+d2" where d1 and d2 are two decimal columns? Without knowing precision and scale of both d1 and d2, we will not be able to determine the result metadata for this calculation.

          Thus, I suggest we either have a default value (such as 10) as we do for scale, or make precision a required attribute. The former seems more reasonable.

          Show
          Xuefu Zhang added a comment - precision, a JSON integer representing the (maximum) precision of decimals stored in this type (optional) Looking at the spec, I'm wondering what's the precision if it's not specified in the type definition. Shouldn't we have a default value like scale? When we define a column of decimal type w/o giving precision/scale In a DB such as MySQL, "decimal" is the same as "decimal(10)" and "decimal(10,0)". While precision isn't required to reconstruct the data, it's important for metadata operations. For instance, what's the type of "d1+d2" where d1 and d2 are two decimal columns? Without knowing precision and scale of both d1 and d2, we will not be able to determine the result metadata for this calculation. Thus, I suggest we either have a default value (such as 10) as we do for scale, or make precision a required attribute. The former seems more reasonable.
          Hide
          Ryan Blue added a comment -

          This patch updates the last patch from Tom. It adds:

          • A minor clarification to the logical type encoding
          • That readers should ignore unknown logical types and use the underlying avro type
          • Fixed may be annotated with decimal, but this limits the maximum precision
          Show
          Ryan Blue added a comment - This patch updates the last patch from Tom. It adds: A minor clarification to the logical type encoding That readers should ignore unknown logical types and use the underlying avro type Fixed may be annotated with decimal, but this limits the maximum precision
          Hide
          Ryan Blue added a comment -

          I think we should allow writing unspecified logical types, so that new logical types can be implemented in the application when the spec is updated without upgrading the library.

          For now, I've just updated the spec for how readers should behave, not writers. This will probably depend on the level of support we end up building into avro vs. on top of the format. Right now with no implementation support, there is little need to update the library just to get a new version of the spec with a type. But in the future, we may have consistency checks on logical types (like the parquet implementation is adding) so we can revisit this at that point.

          I suggest we either have a default value (such as 10) as we do for scale, or make precision a required attribute

          I think the most reasonable implementation for a default precision is either 9 (fits in 4 bytes) or 18 (fits in 8 bytes), or the maximum number of digits that can be stored in a fixed array. (I think MySQL's default is 10 because it is based on a different encoding.) Because the maximum precision varies with the length of the fixed array, but that length can't be evolved, I think that this should be required rather than defaulted. That way, users must specify the expected maximum precision and we can validate that their expectation is correct, before writing data.

          Show
          Ryan Blue added a comment - I think we should allow writing unspecified logical types, so that new logical types can be implemented in the application when the spec is updated without upgrading the library. For now, I've just updated the spec for how readers should behave, not writers. This will probably depend on the level of support we end up building into avro vs. on top of the format. Right now with no implementation support, there is little need to update the library just to get a new version of the spec with a type. But in the future, we may have consistency checks on logical types (like the parquet implementation is adding) so we can revisit this at that point. I suggest we either have a default value (such as 10) as we do for scale, or make precision a required attribute I think the most reasonable implementation for a default precision is either 9 (fits in 4 bytes) or 18 (fits in 8 bytes), or the maximum number of digits that can be stored in a fixed array. (I think MySQL's default is 10 because it is based on a different encoding.) Because the maximum precision varies with the length of the fixed array, but that length can't be evolved, I think that this should be required rather than defaulted. That way, users must specify the expected maximum precision and we can validate that their expectation is correct, before writing data.
          Hide
          Ryan Blue added a comment -

          Updated decimal spec:

          • Change precision to required
          • Fix maximum precision calculation (account for sign bit)
          Show
          Ryan Blue added a comment - Updated decimal spec: Change precision to required Fix maximum precision calculation (account for sign bit)
          Hide
          Ryan Blue added a comment -

          I accidentally uploaded a bad patch, this one has the most recent changes that were missing from #3.

          Show
          Ryan Blue added a comment - I accidentally uploaded a bad patch, this one has the most recent changes that were missing from #3.
          Hide
          Xuefu Zhang added a comment -

          bq, Because the maximum precision varies with the length of the fixed array, but that length can't be evolved, I think that this should be required rather than defaulted.

          This seemingly suggests that for Avro decimal, the metadata has to match exactly the data, which isn't what Hive assumes. It's perfectly fine for Hive to read data column from a file specified with schema decimal(5, 2) into a table column specified with a schema decimal(4,1). In another word, data conversion is possible when data and metadata don't match. Since Avro data and metadata are separate, mismatch is possible regardless.

          I'm not sure if Avro wants to enforce this, but I'd just like to point this out.

          Show
          Xuefu Zhang added a comment - bq, Because the maximum precision varies with the length of the fixed array, but that length can't be evolved, I think that this should be required rather than defaulted. This seemingly suggests that for Avro decimal, the metadata has to match exactly the data, which isn't what Hive assumes. It's perfectly fine for Hive to read data column from a file specified with schema decimal(5, 2) into a table column specified with a schema decimal(4,1). In another word, data conversion is possible when data and metadata don't match. Since Avro data and metadata are separate, mismatch is possible regardless. I'm not sure if Avro wants to enforce this, but I'd just like to point this out.
          Hide
          Ryan Blue added a comment -

          Xuefu Zhang, what I'm trying to say is that it's an error if the underlying type cannot fit the precision the user expects. And because that underlying type can't be changed, we should avoid mistakes by making the precision required so we can check as soon as possible and before data is written.

          To your point about changing the value to fit the storage, that may be fine for Hive but I think a data storage format shouldn't change the data. Storing a decimal(5,2) in a decimal(4,1) discards data and requires a rounding policy, so it should be treated as an error if Avro is handling the encoding. If Hive is encoding the decimal, then the decision is left up to Hive.

          Show
          Ryan Blue added a comment - Xuefu Zhang , what I'm trying to say is that it's an error if the underlying type cannot fit the precision the user expects. And because that underlying type can't be changed, we should avoid mistakes by making the precision required so we can check as soon as possible and before data is written. To your point about changing the value to fit the storage, that may be fine for Hive but I think a data storage format shouldn't change the data. Storing a decimal(5,2) in a decimal(4,1) discards data and requires a rounding policy, so it should be treated as an error if Avro is handling the encoding. If Hive is encoding the decimal, then the decision is left up to Hive.
          Hide
          Xuefu Zhang added a comment -

          Thanks for the clarification, Ryan Blue. That makes sense.

          Speaking of enforcement, I am curious about how much Avro is able to validate against the spec. For instance, the latest spec requires bigendian for encoding, but decimal data given to Avro to store by the application is in terms of bytes, and I assume that Avro isn't able to find out the endianness. Other area of rules, such as scale should be less than or equal to precision, or scale should be positive, isn't present in the spec. I guess the bigger question is whether or what validation will be enforced by Avro.Or it's completely up to the application, and the spec is more of a guideline.

          Show
          Xuefu Zhang added a comment - Thanks for the clarification, Ryan Blue . That makes sense. Speaking of enforcement, I am curious about how much Avro is able to validate against the spec. For instance, the latest spec requires bigendian for encoding, but decimal data given to Avro to store by the application is in terms of bytes, and I assume that Avro isn't able to find out the endianness. Other area of rules, such as scale should be less than or equal to precision, or scale should be positive, isn't present in the spec. I guess the bigger question is whether or what validation will be enforced by Avro.Or it's completely up to the application, and the spec is more of a guideline.
          Hide
          Ryan Blue added a comment -

          The rules that can be enforced, like consistent metadata (scale <= precision) should be checked at write time eventually, but the fallback is to ignore the logical type annotation because it is always okay to return the underlying avro type (I'll add this to the spec). For the decimal encoding, we can't check that the bytes were encoded as big-endian, but I'm less concerned here because this isn't a user error, it would be a programming error.

          Other area of rules, such as scale should be less than or equal to precision, or scale should be positive, isn't present in the spec

          You're right. I'll add these rules.

          Show
          Ryan Blue added a comment - The rules that can be enforced, like consistent metadata (scale <= precision) should be checked at write time eventually, but the fallback is to ignore the logical type annotation because it is always okay to return the underlying avro type (I'll add this to the spec). For the decimal encoding, we can't check that the bytes were encoded as big-endian, but I'm less concerned here because this isn't a user error, it would be a programming error. Other area of rules, such as scale should be less than or equal to precision, or scale should be positive, isn't present in the spec You're right. I'll add these rules.
          Hide
          Ryan Blue added a comment -

          Updates:

          • Added that invalid logical types should be ignored.
          • Added that precision must be > 0
          • Added 0 <= scale <= precision
          Show
          Ryan Blue added a comment - Updates: Added that invalid logical types should be ignored. Added that precision must be > 0 Added 0 <= scale <= precision
          Hide
          Xuefu Zhang added a comment -

          Since the logical type approach proposed here isn't really providing a native type for decimal in Avro, it seems a little stretching when trying to accommodate decimal concepts. I understand the intention of providing interoperability between applications for decimal data, but from the point of view at storage layer, it shouldn't care. All it needs to provide the capability of annotating the binary type to the application so that the application can annotate whatever it needs, such as decimal/precision for decimal, timezone for datatime, etc. It's up to the application to do any validation. This certainly leaves interoperability to applications, but it relieves Avro from enforcing whatever it cannot guarantee anyway.

          Thus, I'd think that there are two possible approaches, one at each end:
          1. provide native type decimal in Avro, which can provide interoperability.
          2. provide just annotation capability for binary type, and leaves everything else to the application.

          The proposal here seems sit in the middle of the two, which makes me think twice when deciding if it's the best.

          Show
          Xuefu Zhang added a comment - Since the logical type approach proposed here isn't really providing a native type for decimal in Avro, it seems a little stretching when trying to accommodate decimal concepts. I understand the intention of providing interoperability between applications for decimal data, but from the point of view at storage layer, it shouldn't care. All it needs to provide the capability of annotating the binary type to the application so that the application can annotate whatever it needs, such as decimal/precision for decimal, timezone for datatime, etc. It's up to the application to do any validation. This certainly leaves interoperability to applications, but it relieves Avro from enforcing whatever it cannot guarantee anyway. Thus, I'd think that there are two possible approaches, one at each end: 1. provide native type decimal in Avro, which can provide interoperability. 2. provide just annotation capability for binary type, and leaves everything else to the application. The proposal here seems sit in the middle of the two, which makes me think twice when deciding if it's the best.
          Hide
          Ryan Blue added a comment -

          Xuefu Zhang, you're right that this is between those two approaches. By working on the specification now, we're tackling the important part for interoperability, and leaving the details of Avro's implementation until later. This allows us to build decimal support in parallel, because we can all agree on how the encoding works and shouldn't be blocked on the Avro implementation details.

          ... relieves Avro from enforcing whatever it cannot guarantee anyway.

          What do you mean? It's normal for a spec to require something that it can't guarantee. If you decide not to put an IP address in network-byte-order, for example, nothing "enforces" the rule – but your packets go somewhere other than the intended address. Decimal is similar: everyone will interpret the bytes as big-endian two's complement. I think Avro (and similarly, parquet) is the right place for this spec because it is where the data resides. Otherwise, Pig could ignore Hive's implementation and build a competing encoding scheme. Putting this in Avro means there is only one correct way to share decimal numbers in Avro.

          Show
          Ryan Blue added a comment - Xuefu Zhang , you're right that this is between those two approaches. By working on the specification now, we're tackling the important part for interoperability, and leaving the details of Avro's implementation until later. This allows us to build decimal support in parallel, because we can all agree on how the encoding works and shouldn't be blocked on the Avro implementation details. ... relieves Avro from enforcing whatever it cannot guarantee anyway. What do you mean? It's normal for a spec to require something that it can't guarantee. If you decide not to put an IP address in network-byte-order, for example, nothing "enforces" the rule – but your packets go somewhere other than the intended address. Decimal is similar: everyone will interpret the bytes as big-endian two's complement. I think Avro (and similarly, parquet) is the right place for this spec because it is where the data resides. Otherwise, Pig could ignore Hive's implementation and build a competing encoding scheme. Putting this in Avro means there is only one correct way to share decimal numbers in Avro.
          Hide
          Xuefu Zhang added a comment -

          Fair enough. I just liked to put all the options on the table, and the proposal sounds like a reasonable compromise. Let's proceed then.

          A minor question: do we put upper bounds for the precision/scale?

          Show
          Xuefu Zhang added a comment - Fair enough. I just liked to put all the options on the table, and the proposal sounds like a reasonable compromise. Let's proceed then. A minor question: do we put upper bounds for the precision/scale?
          Hide
          Ryan Blue added a comment -

          [D]o we put upper bounds for the precision/scale?

          The scale is bounded by the precision, which is bounded by the storage array's length. If the underlying type is bytes then there is no length, but the precision is required so there must be a self-imposed upper bound. We could mandate something like 38 (16 bytes), but I don't think this is really necessary.

          Show
          Ryan Blue added a comment - [D] o we put upper bounds for the precision/scale? The scale is bounded by the precision, which is bounded by the storage array's length. If the underlying type is bytes then there is no length, but the precision is required so there must be a self-imposed upper bound. We could mandate something like 38 (16 bytes), but I don't think this is really necessary.
          Hide
          Doug Cutting added a comment -

          The changes to the spec look fine, but I'd prefer we have an implementation too before we commit. In general I prefer we implement new features in at least one language before we add them to the specification, as often unforeseen issues arise during implementation that can impact the specification.

          Show
          Doug Cutting added a comment - The changes to the spec look fine, but I'd prefer we have an implementation too before we commit. In general I prefer we implement new features in at least one language before we add them to the specification, as often unforeseen issues arise during implementation that can impact the specification.
          Hide
          Ryan Blue added a comment -

          Here is a patch with a basic implementation of the logical type enforcement in the Schema. It isn't quite done, but it is working and I wanted to get something out for feedback as soon as possible. Please let me know if this doesn't look right!

          Changes:

          • Add a LogicalType class, with "decimal" and "fromJsonNode" static factory methods
          • Add setLogicalType and getLogicalType to Schema. The setter calls "validate" and only sets a logical type if it is valid for the schema.
          • Update the parsing code to check for "logicalType" and use "fromJsonNode" to deserialize. Falls back to property representation if invalid.
          • Add private method "writeLogicalType" to Schema and call when properties are written to ensure logical types are in written Schemas.

          Still to do:

          • Update equals and hashCode
          • Testing
          Show
          Ryan Blue added a comment - Here is a patch with a basic implementation of the logical type enforcement in the Schema. It isn't quite done, but it is working and I wanted to get something out for feedback as soon as possible. Please let me know if this doesn't look right! Changes: Add a LogicalType class, with "decimal" and "fromJsonNode" static factory methods Add setLogicalType and getLogicalType to Schema. The setter calls "validate" and only sets a logical type if it is valid for the schema. Update the parsing code to check for "logicalType" and use "fromJsonNode" to deserialize. Falls back to property representation if invalid. Add private method "writeLogicalType" to Schema and call when properties are written to ensure logical types are in written Schemas. Still to do: Update equals and hashCode Testing
          Hide
          Ryan Blue added a comment -

          I've pushed a branch to github, which may be more readable: https://github.com/rdblue/avro/compare/apache:trunk...logical-types

          Show
          Ryan Blue added a comment - I've pushed a branch to github, which may be more readable: https://github.com/rdblue/avro/compare/apache:trunk...logical-types
          Hide
          Doug Cutting added a comment -

          Does this need to be in Schema.java? That class is already way too big. Might it instead be in LogicalType.java, with a subclass DecimalType?

          Also, the implementation needs to read and write Decimal instances, right?

          Show
          Doug Cutting added a comment - Does this need to be in Schema.java? That class is already way too big. Might it instead be in LogicalType.java, with a subclass DecimalType? Also, the implementation needs to read and write Decimal instances, right?
          Hide
          Jarek Jarcec Cecho added a comment -

          Even though that code changes are supporting the logical types introduced here, they seem quite independent on each other. Hence I'm wondering if it would make sense to use this JIRA for committing the spec changes and do the implementation in follow up JIRA?

          Show
          Jarek Jarcec Cecho added a comment - Even though that code changes are supporting the logical types introduced here, they seem quite independent on each other. Hence I'm wondering if it would make sense to use this JIRA for committing the spec changes and do the implementation in follow up JIRA?
          Hide
          Ryan Blue added a comment -

          I don't think we need to work with decimal objects yet, that part is optional. But I would like to get the specification updated so that is finished and other projects can work in parallel. I also think it is a good idea to add the LogicalType class to Schema so that that spec is correctly enforced for decimal metadata.

          I'll move LogicalType to its own file, add equals/hash implementations, and test. Am I missing anything that needs to be done for this to be supported in the Schema?

          Show
          Ryan Blue added a comment - I don't think we need to work with decimal objects yet, that part is optional. But I would like to get the specification updated so that is finished and other projects can work in parallel. I also think it is a good idea to add the LogicalType class to Schema so that that spec is correctly enforced for decimal metadata. I'll move LogicalType to its own file, add equals/hash implementations, and test. Am I missing anything that needs to be done for this to be supported in the Schema?
          Hide
          Ryan Blue added a comment -

          Latest patch with working logical types on Schema.

          Show
          Ryan Blue added a comment - Latest patch with working logical types on Schema.
          Hide
          Doug Cutting added a comment -

          Ryanb: I'd prefer that Decimal is not a nested subclass of LogicalType. That way the pattern is that new logical types can be added by user code, without creating a new Avro release. Also, Schema.java needs not be altered. A LogicalType can be a factory for and tools that manipulate Schema instances. The existing schema property mechanism can be used to implement these. If another implementation adds a logical type that Java has not yet implemented then no exceptions need be caught, since logical types are an optional layer atop schemas.

          Jarek, you seem to feel that other projects (e.g., Hive or Impala) will better be able to make progress once a logical type specification is committed. If that's the case then perhaps we should commit just the specification change now and let implementations and uses proceed in parallel.

          Show
          Doug Cutting added a comment - Ryanb: I'd prefer that Decimal is not a nested subclass of LogicalType. That way the pattern is that new logical types can be added by user code, without creating a new Avro release. Also, Schema.java needs not be altered. A LogicalType can be a factory for and tools that manipulate Schema instances. The existing schema property mechanism can be used to implement these. If another implementation adds a logical type that Java has not yet implemented then no exceptions need be caught, since logical types are an optional layer atop schemas. Jarek, you seem to feel that other projects (e.g., Hive or Impala) will better be able to make progress once a logical type specification is committed. If that's the case then perhaps we should commit just the specification change now and let implementations and uses proceed in parallel.
          Hide
          Jarek Jarcec Cecho added a comment -

          Yes Doug Cutting, I do feel that it would be better to commit the spec here to unblock other projects and then follow up with the coding part on separate JIRA.

          Show
          Jarek Jarcec Cecho added a comment - Yes Doug Cutting , I do feel that it would be better to commit the spec here to unblock other projects and then follow up with the coding part on separate JIRA.
          Hide
          Ryan Blue added a comment -

          Doug Cutting, thanks for taking a look. I'll update the patch to address those problems. Should I also open a separate issue to track the LogicalType addition, so that we can get the spec updated for this one?

          Show
          Ryan Blue added a comment - Doug Cutting , thanks for taking a look. I'll update the patch to address those problems. Should I also open a separate issue to track the LogicalType addition, so that we can get the spec updated for this one?
          Hide
          Doug Cutting added a comment -

          Yes, it sounds like a separate issue for implementation would be best.

          Show
          Doug Cutting added a comment - Yes, it sounds like a separate issue for implementation would be best.
          Hide
          Ryan Blue added a comment -

          Created AVRO-1497 to track the logical type support in schemas (Java).

          For just the spec addition, AVRO-1402-logical-type-spec-5.patch is up-to-date. The only changes since have been adding implementation code.

          Show
          Ryan Blue added a comment - Created AVRO-1497 to track the logical type support in schemas (Java). For just the spec addition, AVRO-1402-logical-type-spec-5.patch is up-to-date. The only changes since have been adding implementation code.
          Hide
          Doug Cutting added a comment -

          Forrest failed with the spec change because it didn't like a 'sup' tag nested wtihin a 'code' tag. I changed the outer tag to 'em' in this case and now it compiles.

          Here's the modified patch.

          Show
          Doug Cutting added a comment - Forrest failed with the spec change because it didn't like a 'sup' tag nested wtihin a 'code' tag. I changed the outer tag to 'em' in this case and now it compiles. Here's the modified patch.
          Hide
          Doug Cutting added a comment -

          I'll commit this soon unless there are objections.

          Show
          Doug Cutting added a comment - I'll commit this soon unless there are objections.
          Hide
          Doug Cutting added a comment -

          Trim code changes from patch. Patch is spec only.

          Show
          Doug Cutting added a comment - Trim code changes from patch. Patch is spec only.
          Hide
          ASF subversion and git services added a comment -

          Commit 1588497 from Doug Cutting in branch 'avro/trunk'
          [ https://svn.apache.org/r1588497 ]

          AVRO-1402. Add optional subtypes to specification. Contributed by tomwhite & Ryan Blue.

          Show
          ASF subversion and git services added a comment - Commit 1588497 from Doug Cutting in branch 'avro/trunk' [ https://svn.apache.org/r1588497 ] AVRO-1402 . Add optional subtypes to specification. Contributed by tomwhite & Ryan Blue.
          Hide
          Doug Cutting added a comment -

          I committed this. Thanks, Ryan!

          Show
          Doug Cutting added a comment - I committed this. Thanks, Ryan!
          Hide
          Hudson added a comment -

          FAILURE: Integrated in AvroJava #451 (See https://builds.apache.org/job/AvroJava/451/)
          AVRO-1402. Add optional subtypes to specification. Contributed by tomwhite & Ryan Blue. (cutting: rev 1588497)

          • /avro/trunk/CHANGES.txt
          • /avro/trunk/doc/src/content/xdocs/spec.xml
          Show
          Hudson added a comment - FAILURE: Integrated in AvroJava #451 (See https://builds.apache.org/job/AvroJava/451/ ) AVRO-1402 . Add optional subtypes to specification. Contributed by tomwhite & Ryan Blue. (cutting: rev 1588497) /avro/trunk/CHANGES.txt /avro/trunk/doc/src/content/xdocs/spec.xml

            People

            • Assignee:
              Ryan Blue
              Reporter:
              Mariano Dominguez
            • Votes:
              6 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development