Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: spec
    • Labels:
      None
    1. AVRO-739.patch
      8 kB
      Doug Cutting

      Issue Links

        Activity

        Hide
        Ron Bodkin added a comment -

        From the discussion on the users list, I agree that it'd be great to start with a simple timestamp, which gets serialized as a long. Let's start with a simple feature, and future enhancements can be tracked separately.

        Doug proposed this design:

        I noted that it would be nice to allow some flexibility in the implementation
        classes for dates, e.g., letting Java users use Joda time classes as well
        as java.util.Date

        Scott said:
        Absolutely. This is a per-language feature though, so it may not require
        much of the spec. For example, in Java it could simply be a configuration
        parameter passed to the DatumReader/Writers. It doesn't make a lot of
        sense to store metadata on the data that says "this is a Joda object, not
        java.util.Date" – that is a user choice and not intrinsic to describing
        the data.

        My input:
        I agree this shouldn't be part of the serialized format. It would be nice to
        have a clean way to specify the configuration/mappings used that allows
        for specifying the mappings for more such org.apache.avro data types. It
        also should be supported for reflection and code generation approaches, as well.

        Scott also said:
        There are other questions too – what are the timestamp units
        (milliseconds? configurable?), what is the origin (1970? 2010?
        configurable?) – these decisions affect the serialization size.

        My input:
        I would like to see a format that allows storing data at the precision of popular libraries and languages (java.util.Date, Joda time, Python datetime, etc.). Having a long representing microseconds since Jan. 1 1970 seems like a good compromise for general purpose use. It supports higher precision libraries and still allows representing a few hundred thousand years of data. Some libraries do allow nanosecond resolution - but limiting to 270 years seems like a bigger limitation than microsecond precision.

        Show
        Ron Bodkin added a comment - From the discussion on the users list, I agree that it'd be great to start with a simple timestamp, which gets serialized as a long. Let's start with a simple feature, and future enhancements can be tracked separately. Doug proposed this design: I noted that it would be nice to allow some flexibility in the implementation classes for dates, e.g., letting Java users use Joda time classes as well as java.util.Date Scott said: Absolutely. This is a per-language feature though, so it may not require much of the spec. For example, in Java it could simply be a configuration parameter passed to the DatumReader/Writers. It doesn't make a lot of sense to store metadata on the data that says "this is a Joda object, not java.util.Date" – that is a user choice and not intrinsic to describing the data. My input: I agree this shouldn't be part of the serialized format. It would be nice to have a clean way to specify the configuration/mappings used that allows for specifying the mappings for more such org.apache.avro data types. It also should be supported for reflection and code generation approaches, as well. Scott also said: There are other questions too – what are the timestamp units (milliseconds? configurable?), what is the origin (1970? 2010? configurable?) – these decisions affect the serialization size. My input: I would like to see a format that allows storing data at the precision of popular libraries and languages (java.util.Date, Joda time, Python datetime, etc.). Having a long representing microseconds since Jan. 1 1970 seems like a good compromise for general purpose use. It supports higher precision libraries and still allows representing a few hundred thousand years of data. Some libraries do allow nanosecond resolution - but limiting to 270 years seems like a bigger limitation than microsecond precision.
        Hide
        Ron Bodkin added a comment -

        Sorry I forgot to pate in Doug Cutting's design:
        The way that I have imagined doing this is to specify a standard schema
        for dates, then implementations can optionally map this to a native date
        type.

        The schema could be a record containing a long, e.g.:

        {"type": "record", "name":"org.apache.avro.lib.Date", "fields" : [

        {"name": "time", "type": "long"}

        ]
        }

        Java could read this into a java.util.Date, Python to a datetime, etc.
        Such conventions could be added to the Avro specification.

        Does this sound like a reasonable approach?

        And also this email thread -

        On 01/18/2011 09:19 AM, Jeremy Custenborder wrote:
        I agree with storing it as a long. How would you handle this in code
        generation and serialization? Would you envision hooks during code
        generation that would generate a member that is the native date time
        for the language?

        Yes. Just as "bytes" is represented in Java by java.nio.ByteBuffer,
        "org.apache.avro.lib.Date" could be represented by java.util.Date.

        Does the serializer handle a date object that is
        native to the language?

        Yes, serializers and deserializers would need to implement this mapping.

        Does this sound like a reasonable approach?

        I really like the idea of having a standard
        datetime as a supported type of avro. It's a problem that everyone has
        to solve on their own.

        Show
        Ron Bodkin added a comment - Sorry I forgot to pate in Doug Cutting's design: The way that I have imagined doing this is to specify a standard schema for dates, then implementations can optionally map this to a native date type. The schema could be a record containing a long, e.g.: {"type": "record", "name":"org.apache.avro.lib.Date", "fields" : [ {"name": "time", "type": "long"} ] } Java could read this into a java.util.Date, Python to a datetime, etc. Such conventions could be added to the Avro specification. Does this sound like a reasonable approach? And also this email thread - On 01/18/2011 09:19 AM, Jeremy Custenborder wrote: I agree with storing it as a long. How would you handle this in code generation and serialization? Would you envision hooks during code generation that would generate a member that is the native date time for the language? Yes. Just as "bytes" is represented in Java by java.nio.ByteBuffer, "org.apache.avro.lib.Date" could be represented by java.util.Date. Does the serializer handle a date object that is native to the language? Yes, serializers and deserializers would need to implement this mapping. Does this sound like a reasonable approach? I really like the idea of having a standard datetime as a supported type of avro. It's a problem that everyone has to solve on their own.
        Hide
        Jeremy Custenborder added a comment -

        What were you thinking a long with the number of milliseconds since 1980 UTC? If you need more precision than that you are most likely going to make your own type. I really like the idea of getting something that can map to the native types in most of the languages. This would be a really cool feature.

        Show
        Jeremy Custenborder added a comment - What were you thinking a long with the number of milliseconds since 1980 UTC? If you need more precision than that you are most likely going to make your own type. I really like the idea of getting something that can map to the native types in most of the languages. This would be a really cool feature.
        Hide
        Colin Fletcher added a comment -

        The serialization of date/times must incorporate timezone. If it does not, then i will be unable to use it for the large scale projects I am leading. It doesnt matter to me if the format is custom in byte mode, but in json must be json compliant.

        Show
        Colin Fletcher added a comment - The serialization of date/times must incorporate timezone. If it does not, then i will be unable to use it for the large scale projects I am leading. It doesnt matter to me if the format is custom in byte mode, but in json must be json compliant.
        Hide
        Russell Jurney added a comment -

        PIG-1314 may be relevant. ISO8601 datetime format seemed convenient.

        Show
        Russell Jurney added a comment - PIG-1314 may be relevant. ISO8601 datetime format seemed convenient.
        Hide
        Kenneth Baltrinic added a comment -

        I concur w/ C Fletcher that some consideration to timezones and daylight savings time is needed. At the very minimum the spec would need require that in the absence of an explicit timezone, all times are in UTC.

        Show
        Kenneth Baltrinic added a comment - I concur w/ C Fletcher that some consideration to timezones and daylight savings time is needed. At the very minimum the spec would need require that in the absence of an explicit timezone, all times are in UTC.
        Hide
        John A. De Goes added a comment -

        Adopting UTC milliseconds as the date/time format is fundamentally wrong and will render the type useless for any serious application. ISO8601 is the standard format for date/time. It preserves the critical notion of timezone and daylight savings time, and of course lets you express time in UTC as well if that's what you want. The binary encoding is only slightly bulkier than UTC milliseconds.

        Show
        John A. De Goes added a comment - Adopting UTC milliseconds as the date/time format is fundamentally wrong and will render the type useless for any serious application. ISO8601 is the standard format for date/time. It preserves the critical notion of timezone and daylight savings time, and of course lets you express time in UTC as well if that's what you want. The binary encoding is only slightly bulkier than UTC milliseconds.
        Hide
        Doug Cutting added a comment -

        The custom encoding feature added in AVRO-1341 might be a good way to implement this. SpecificData could have a table mapping classes to custom encodings.

        By default this would map java.util.Date to a standard schema that writes it as a long. My instinct is to use a record schema rather than a long schema, however.

        Show
        Doug Cutting added a comment - The custom encoding feature added in AVRO-1341 might be a good way to implement this. SpecificData could have a table mapping classes to custom encodings. By default this would map java.util.Date to a standard schema that writes it as a long. My instinct is to use a record schema rather than a long schema, however.
        Hide
        Doug Cutting added a comment -

        Here's a patch that changes Java's specific & reflect to serialize and deserialize java.util.Date using the following schema:

        {"type":"record","name":"org.apache.avro.Datetime","fields":[{"name":"ms","type":"long"}]}"
        

        This is implemented by adding a custom encodings feature to SpecificData that permits a class to be mapped to a record schema. I had to modify reflect's CustomEncoding API. To make this back-compatible, we'll perhaps need to copy that API into specific, so this is not yet ready for commit.

        Do folks like this approach? We proclaim a language-independent schema for datetimes, then implementations can choose to map this into a native type or not.

        I did not extend Generic, since I believe there is value in keeping Generic's representations a closed set of classes. This permits applications to be sure they can process any data read using Generic. I might be convinced to add this to Generic, but that would make it an incompatible change.

        Show
        Doug Cutting added a comment - Here's a patch that changes Java's specific & reflect to serialize and deserialize java.util.Date using the following schema: { "type" : "record" , "name" : "org.apache.avro.Datetime" , "fields" :[{ "name" : "ms" , "type" : " long " }]}" This is implemented by adding a custom encodings feature to SpecificData that permits a class to be mapped to a record schema. I had to modify reflect's CustomEncoding API. To make this back-compatible, we'll perhaps need to copy that API into specific, so this is not yet ready for commit. Do folks like this approach? We proclaim a language-independent schema for datetimes, then implementations can choose to map this into a native type or not. I did not extend Generic, since I believe there is value in keeping Generic's representations a closed set of classes. This permits applications to be sure they can process any data read using Generic. I might be convinced to add this to Generic, but that would make it an incompatible change.
        Hide
        Scott Carey added a comment -

        There are two types of interest, one that is a UTC coordinate – a long like this one without any timezone or other 'date' information, and one that is a date-time, and therefore must contain timezone information. The latter would probably best be some ISO8601 subset. The former is a long (which is only 5 bytes for 'today', if in ms since 1970 UTC.

        We should decide on the names for these two things now. I think that "Datetime" is probably the thing that includes dates, times, and therefore time zones. The pure long universal time coordinate is perhaps "Instant" or "timestamp" – it has nothing to do with dates except that in Java the typical class used to hold such an instant is Date (or a long).

        I wish these were fundamental Avro primitive types. To work well with database systems we need these two types. The syntax as a special Record in the schema is klunky, but more backwards compatible. If we assume that some version of Avro in the future requires all language implementations to support new primitive types for these, how would we migrate from this klunky form to "type":"instant" ?

        This proposal isn't all that backwards compatible: If Python doesn't know what "org.apache.avro.Datetime" is, it won't be able to decode the type. Perhaps

        {"type":"instant"}

        is better – other than colliding with existing schemas with a custom type of that name. Perhaps

        {"type":"org.apache.avro.instant"}

        Rather than the record with nested field?

        Show
        Scott Carey added a comment - There are two types of interest, one that is a UTC coordinate – a long like this one without any timezone or other 'date' information, and one that is a date-time, and therefore must contain timezone information. The latter would probably best be some ISO8601 subset. The former is a long (which is only 5 bytes for 'today', if in ms since 1970 UTC. We should decide on the names for these two things now. I think that "Datetime" is probably the thing that includes dates, times, and therefore time zones. The pure long universal time coordinate is perhaps "Instant" or "timestamp" – it has nothing to do with dates except that in Java the typical class used to hold such an instant is Date (or a long). I wish these were fundamental Avro primitive types. To work well with database systems we need these two types. The syntax as a special Record in the schema is klunky, but more backwards compatible. If we assume that some version of Avro in the future requires all language implementations to support new primitive types for these, how would we migrate from this klunky form to "type":"instant" ? This proposal isn't all that backwards compatible: If Python doesn't know what "org.apache.avro.Datetime" is, it won't be able to decode the type. Perhaps {"type":"instant"} is better – other than colliding with existing schemas with a custom type of that name. Perhaps {"type":"org.apache.avro.instant"} Rather than the record with nested field?
        Hide
        Doug Cutting added a comment -

        > There are two types of interest [ ... ]

        These seem like two different external representations of the same thing. A time plus a timezone can be losslessly converted to a UTC time. You do lose the original timezone, but dates and times are usually displayed in the timezone of the displayer, not where the time was originally noted.

        Also note that SQL has five different variants (http://en.wikipedia.org/wiki/SQL#Date_and_time) which are interconvertable. I suggest all of these should be converted to a single type in Avro. Perhaps one could annotate the schema with resolution and/or timezone to improve fidelity, e.g.,

        {"type":"Datetime", "resolution":"date", "timezone":"PST"}

        , but a long would always be written and used for comparison with other Datetime instances.

        > How would we migrate from this klunky form to "type":"instant"?

        We could simply treat instances of the klunky schema identically to "type":"instant".

        Show
        Doug Cutting added a comment - > There are two types of interest [ ... ] These seem like two different external representations of the same thing. A time plus a timezone can be losslessly converted to a UTC time. You do lose the original timezone, but dates and times are usually displayed in the timezone of the displayer, not where the time was originally noted. Also note that SQL has five different variants ( http://en.wikipedia.org/wiki/SQL#Date_and_time ) which are interconvertable. I suggest all of these should be converted to a single type in Avro. Perhaps one could annotate the schema with resolution and/or timezone to improve fidelity, e.g., {"type":"Datetime", "resolution":"date", "timezone":"PST"} , but a long would always be written and used for comparison with other Datetime instances. > How would we migrate from this klunky form to "type":"instant"? We could simply treat instances of the klunky schema identically to "type":"instant".
        Hide
        Scott Carey added a comment -

        These seem like two different external representations of the same thing. A time plus a timezone can be losslessly converted to a UTC time. You do lose the original timezone, but dates and times are usually displayed in the timezone of the displayer, not where the time was originally noted.

        I completely agree for use cases where the time is being displayed to a user, but there are use cases where the loss of the original time zone is not acceptable. One could log another field with the timezone identifier for these. The use case for a UTC timestamp is more broadly applicable. I do not think we need to implement the one that also persists timezone now, but I do think we need to make sure that if we did implement such a thing in the future, that the names for these two things would be consistent. If we name this "Datetime" we are implying it has relation to dates, which implies relationship to timezones.

        With respect to the SQL variants, I see only two that represent a single point in time. Three are either dates or times and not the combination (e.g. "January 7, 2100", representing a time with granularity of one day, or "5:01" – a time of day, respectively).

        The two SQL equivalents are TIMESTAMP and TIMESTAMP WITH TIMEZONE. This proposal covers TIMESTAMP, roughly. I am suggesting we reserve space for a future TIMESTAMP WITH TIMEZONE. We could adopt the names for consistency.

        "timestamp"
        and
        "timestamptz"

        There is also the question of serialization in JSON form. A long in binary form makes sense, but in JSON, an ISO8601 string might be more useful.

        Show
        Scott Carey added a comment - These seem like two different external representations of the same thing. A time plus a timezone can be losslessly converted to a UTC time. You do lose the original timezone, but dates and times are usually displayed in the timezone of the displayer, not where the time was originally noted. I completely agree for use cases where the time is being displayed to a user, but there are use cases where the loss of the original time zone is not acceptable. One could log another field with the timezone identifier for these. The use case for a UTC timestamp is more broadly applicable. I do not think we need to implement the one that also persists timezone now, but I do think we need to make sure that if we did implement such a thing in the future, that the names for these two things would be consistent. If we name this "Datetime" we are implying it has relation to dates, which implies relationship to timezones. With respect to the SQL variants, I see only two that represent a single point in time. Three are either dates or times and not the combination (e.g. "January 7, 2100", representing a time with granularity of one day, or "5:01" – a time of day, respectively). The two SQL equivalents are TIMESTAMP and TIMESTAMP WITH TIMEZONE. This proposal covers TIMESTAMP, roughly. I am suggesting we reserve space for a future TIMESTAMP WITH TIMEZONE. We could adopt the names for consistency. "timestamp" and "timestamptz" There is also the question of serialization in JSON form. A long in binary form makes sense, but in JSON, an ISO8601 string might be more useful.
        Hide
        Doug Cutting added a comment -

        > One could log another field with the timezone identifier for these.

        From my understanding of SQL, for TIMETZ and TIMESTAMPTZ columns, a separate timezone is not stored per row. Rather, the TZ in the schema only affects how dates are parsed and displayed. Am I wrong? If I am correct, then the timezone should not be a field but a schema attribute that's used by implementations when parsing and displaying values. In all cases I believe we should only store a single UTC timestamp per value. Adding a distinct primitive type for each parsing/display variant seems like a poor design choice.

        > A long in binary form makes sense, but in JSON, an ISO8601 string might be more useful.

        Special-casing this would rule out back-compatibility, no?

        Show
        Doug Cutting added a comment - > One could log another field with the timezone identifier for these. From my understanding of SQL, for TIMETZ and TIMESTAMPTZ columns, a separate timezone is not stored per row. Rather, the TZ in the schema only affects how dates are parsed and displayed. Am I wrong? If I am correct, then the timezone should not be a field but a schema attribute that's used by implementations when parsing and displaying values. In all cases I believe we should only store a single UTC timestamp per value. Adding a distinct primitive type for each parsing/display variant seems like a poor design choice. > A long in binary form makes sense, but in JSON, an ISO8601 string might be more useful. Special-casing this would rule out back-compatibility, no?
        Hide
        Doug Cutting added a comment -

        Here's another approach. Instead of defining some new record types (which would bloat schemas), or some new primitives (which would be incompatible), might we instead standardize on some attributes?

        Thus we might use something like:

        {"type":"string", "subType":"ISO-8601-date"}

        This could be added to the specification, as an optional extension. If it's specified, then the string must be in ISO 8601 format.

        We could also have a type like:

        {"type":"long", "subType":"unix-epoch"}

        Note that with both of these formats, sorting by the primitive Avro type is consistent with sorting by time.

        Implementations can insert language-specific types for these at runtime.

        This approach might also be used to handle decimal values, using a lexicographic-friendly string format.

        http://www.zanopha.com/docs/elen.pdf

        Show
        Doug Cutting added a comment - Here's another approach. Instead of defining some new record types (which would bloat schemas), or some new primitives (which would be incompatible), might we instead standardize on some attributes? Thus we might use something like: {"type":"string", "subType":"ISO-8601-date"} This could be added to the specification, as an optional extension. If it's specified, then the string must be in ISO 8601 format. We could also have a type like: {"type":"long", "subType":"unix-epoch"} Note that with both of these formats, sorting by the primitive Avro type is consistent with sorting by time. Implementations can insert language-specific types for these at runtime. This approach might also be used to handle decimal values, using a lexicographic-friendly string format. http://www.zanopha.com/docs/elen.pdf
        Hide
        Tom White added a comment -

        Using subtypes for optional extensions sounds like a good approach to me. We might promote them to primitive types in a future major version of Avro.

        I've posted a patch with a trial implementation of a decimal type in AVRO-1402.

        Show
        Tom White added a comment - Using subtypes for optional extensions sounds like a good approach to me. We might promote them to primitive types in a future major version of Avro. I've posted a patch with a trial implementation of a decimal type in AVRO-1402 .

          People

          • Assignee:
            Unassigned
            Reporter:
            Jeff Hammerbacher
          • Votes:
            7 Vote for this issue
            Watchers:
            18 Start watching this issue

            Dates

            • Created:
              Updated:

              Development