Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.8.0
    • Component/s: spec
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    1. AVRO-739.patch
      8 kB
      Doug Cutting
    2. AVRO-739-datetime-spec.xml.patch
      6 kB
      Dmitry Kovalev
    3. AVRO-739-datetime-spec.xml.patch
      6 kB
      Dmitry Kovalev
    4. AVRO-739-update-spec.diff
      3 kB
      Ryan Blue

      Issue Links

        Activity

        Hide
        Hudson added a comment -

        FAILURE: Integrated in AvroJava #476 (See https://builds.apache.org/job/AvroJava/476/)
        AVRO-739. Add date, time, timestamp, and duration binary types to specification. Contributed by Dmitry Kovalev and Ryan Blue. (tomwhite: rev 1625574)

        • /avro/trunk/CHANGES.txt
        • /avro/trunk/doc/src/content/xdocs/spec.xml
        Show
        Hudson added a comment - FAILURE: Integrated in AvroJava #476 (See https://builds.apache.org/job/AvroJava/476/ ) AVRO-739 . Add date, time, timestamp, and duration binary types to specification. Contributed by Dmitry Kovalev and Ryan Blue. (tomwhite: rev 1625574) /avro/trunk/CHANGES.txt /avro/trunk/doc/src/content/xdocs/spec.xml
        Hide
        Tom White added a comment -

        I just committed this. Thanks Dmitry and Ryan!

        Show
        Tom White added a comment - I just committed this. Thanks Dmitry and Ryan!
        Hide
        ASF subversion and git services added a comment -

        Commit 1625574 from tomwhite@apache.org in branch 'avro/trunk'
        [ https://svn.apache.org/r1625574 ]

        AVRO-739. Add date, time, timestamp, and duration binary types to specification. Contributed by Dmitry Kovalev and Ryan Blue.

        Show
        ASF subversion and git services added a comment - Commit 1625574 from tomwhite@apache.org in branch 'avro/trunk' [ https://svn.apache.org/r1625574 ] AVRO-739 . Add date, time, timestamp, and duration binary types to specification. Contributed by Dmitry Kovalev and Ryan Blue.
        Hide
        Tom White added a comment -

        +1. I'd like to commit this soon.

        Show
        Tom White added a comment - +1. I'd like to commit this soon.
        Hide
        Tom White added a comment -

        I think we can go with the latest patch, which has binary date, time, timestamp, and duration (little-endian). Other types (timezone) or encodings (string) can be added separately.

        Show
        Tom White added a comment - I think we can go with the latest patch, which has binary date, time, timestamp, and duration (little-endian). Other types (timezone) or encodings (string) can be added separately.
        Hide
        Matthew Willson added a comment -

        On second thoughts, for time-of-day analyses I suppose I could just use the time-millis standard to serialize the local time of day alongside the UTC timestamp. I don't think you can always recover the timezone offset and the exact local timestamp from these two things though, since you don't know the local calendar date, and timezone offsets can range from UTC-12 to UTC+14 hours (a span of greater than 24hour) so there could be multiple possibilities for this.

        Show
        Matthew Willson added a comment - On second thoughts, for time-of-day analyses I suppose I could just use the time-millis standard to serialize the local time of day alongside the UTC timestamp. I don't think you can always recover the timezone offset and the exact local timestamp from these two things though, since you don't know the local calendar date, and timezone offsets can range from UTC-12 to UTC+14 hours (a span of greater than 24hour) so there could be multiple possibilities for this.
        Hide
        Matthew Willson added a comment -

        Hi all

        Definitely concur that there should at least be a standard option for serializing timestamps in a compact form (e.g. epoch millis). The reason I'm using avro is because it's an efficient binary format which can cut down on the IO bottleneck of running big analysis jobs.

        Having some standard for storing local timezone offset information alongside a timestamp would be useful for "local-time-of-day"-based analyses, e.g. of web traffic.

        Since the majority of analyses will not be local-time-of-day based though, I'd prefer to store all timestamps in UTC epoch millis, and store the local timezone offset in a separate field which can be used to correct it where required for time-of-day analysis.

        I'd suggest storing a timezone offset in minutes, since it should then fit into 2 bytes. This is the choice made in the javascript date.getTimezoneOffset() API for example, and it appears to be a safe assumption that all timezone boundaries in use are aligned to minute boundaries (in fact 15 minute boundaries as it stands). But I'm not too picky if someone has another sensible suggestion.

        Note this would mean you lose information about a logical timezone name, e.g. BST for british summertime, or "Europe/London" for whatever timezone is in force in London at this point in local time. For most purposes this is a good thing I think, since the definitions of these things can shift over time, whereas a UTC offset is pretty unambiguous.

        Show
        Matthew Willson added a comment - Hi all Definitely concur that there should at least be a standard option for serializing timestamps in a compact form (e.g. epoch millis). The reason I'm using avro is because it's an efficient binary format which can cut down on the IO bottleneck of running big analysis jobs. Having some standard for storing local timezone offset information alongside a timestamp would be useful for "local-time-of-day"-based analyses, e.g. of web traffic. Since the majority of analyses will not be local-time-of-day based though, I'd prefer to store all timestamps in UTC epoch millis, and store the local timezone offset in a separate field which can be used to correct it where required for time-of-day analysis. I'd suggest storing a timezone offset in minutes, since it should then fit into 2 bytes. This is the choice made in the javascript date.getTimezoneOffset() API for example, and it appears to be a safe assumption that all timezone boundaries in use are aligned to minute boundaries (in fact 15 minute boundaries as it stands). But I'm not too picky if someone has another sensible suggestion. Note this would mean you lose information about a logical timezone name, e.g. BST for british summertime, or "Europe/London" for whatever timezone is in force in London at this point in local time. For most purposes this is a good thing I think, since the definitions of these things can shift over time, whereas a UTC offset is pretty unambiguous.
        Hide
        Doug Cutting added a comment -

        So the tradeoff is between having to byte-swap when moving between Avro & Parquet, versus never being able to make reasonable use of ordering in Avro. Meh. No clear winner.

        Show
        Doug Cutting added a comment - So the tradeoff is between having to byte-swap when moving between Avro & Parquet, versus never being able to make reasonable use of ordering in Avro. Meh. No clear winner.
        Hide
        Ryan Blue added a comment -

        I asked around in the Parquet community if it was still possible to change the interval spec from little endian to big endian. Unfortunately, some downstream projects are already using the interval encoding with little endian, so there's a strong reason not to change the spec even though it isn't released. I'd still like for the Avro and Parquet specs to match, so I'd like to keep the Avro spec as it is in the proposed diff, using little endian. Does this sound reasonable?

        Show
        Ryan Blue added a comment - I asked around in the Parquet community if it was still possible to change the interval spec from little endian to big endian. Unfortunately, some downstream projects are already using the interval encoding with little endian, so there's a strong reason not to change the spec even though it isn't released. I'd still like for the Avro and Parquet specs to match, so I'd like to keep the Avro spec as it is in the proposed diff, using little endian. Does this sound reasonable?
        Hide
        Ryan Blue added a comment -

        I looked into the Parquet side and it looks like the decision to use little-endian is based on not having to reorder the bytes to work with the integers. Big-endian would be better for encoding, but only a little because there are 3 numbers being stored and prefix encoding will stop working unless the values are identical. This hasn't been released yet, so there's still a chance to get it changed.

        Show
        Ryan Blue added a comment - I looked into the Parquet side and it looks like the decision to use little-endian is based on not having to reorder the bytes to work with the integers. Big-endian would be better for encoding, but only a little because there are 3 numbers being stored and prefix encoding will stop working unless the values are identical. This hasn't been released yet, so there's still a chance to get it changed.
        Hide
        Doug Cutting added a comment -

        I'd expect many applications of durations to have similar forms, e.g.. all with zero month and millisecond, just days. In such applications sometimes sorting might be useful. With big-endian it could be. With little-endian it couldn't be.

        Was there a particular reason that Parquet chose little-endian for this? Is it too late to change Parquet? Has this been released?

        I don't view this as a fatal misfeature, but it is a misfeature, one we may be stuck with, or one we may still be able to avoid.

        Show
        Doug Cutting added a comment - I'd expect many applications of durations to have similar forms, e.g.. all with zero month and millisecond, just days. In such applications sometimes sorting might be useful. With big-endian it could be. With little-endian it couldn't be. Was there a particular reason that Parquet chose little-endian for this? Is it too late to change Parquet? Has this been released? I don't view this as a fatal misfeature, but it is a misfeature, one we may be stuck with, or one we may still be able to avoid.
        Hide
        Ryan Blue added a comment -

        Doug Cutting, is there a specific use case you have in mind that will perform poorly when durations are encoded in little-endian and sorted byte-wise?

        I think it should be okay to use little-endian because there isn't a well-defined sort order for durations. Each value is independent and there's no requirement for conversion. (1, 0, 0) and (0, 30, 0) are incomparable because sometimes 1 month is longer than 30 days and sometimes shorter, depending on the start time the interval is applied to. Big-endian would produce results that are generally grouped by similarity and size, but I think it's more important to match the format used elsewhere (if its reasonable) and Parquet uses little-endian.

        Show
        Ryan Blue added a comment - Doug Cutting , is there a specific use case you have in mind that will perform poorly when durations are encoded in little-endian and sorted byte-wise? I think it should be okay to use little-endian because there isn't a well-defined sort order for durations. Each value is independent and there's no requirement for conversion. (1, 0, 0) and (0, 30, 0) are incomparable because sometimes 1 month is longer than 30 days and sometimes shorter, depending on the start time the interval is applied to. Big-endian would produce results that are generally grouped by similarity and size, but I think it's more important to match the format used elsewhere (if its reasonable) and Parquet uses little-endian.
        Hide
        Russell Jurney added a comment -

        Ints were used in Pig's datetime and it resulted in a bad situation where you can't read a timestamp from the raw data. ISO8601 strings are much better - any program and any person can read them.

        Show
        Russell Jurney added a comment - Ints were used in Pig's datetime and it resulted in a bad situation where you can't read a timestamp from the raw data. ISO8601 strings are much better - any program and any person can read them.
        Hide
        Doug Cutting added a comment -

        Using little endian in durations will cause them to sort poorly, since Avro defines sorting as byte-wise for fixed.

        Show
        Doug Cutting added a comment - Using little endian in durations will cause them to sort poorly, since Avro defines sorting as byte-wise for fixed.
        Hide
        Ryan Blue added a comment -

        I've edited the spec additions from Dmitry, removing the string representations. The new patch is AVRO-739-update-spec.diff.

        Show
        Ryan Blue added a comment - I've edited the spec additions from Dmitry, removing the string representations. The new patch is AVRO-739 -update-spec.diff.
        Hide
        Ryan Blue added a comment -

        Good point about not being able to use conversion methods in situations like debugging. But, I think I'd rather not have those limitations dictate the possible representations when we'll end up with more to support and wasteful formats. You also mention using specific objects with ZonedDateTime fields – that addresses this problem by deserializing to a form that has a meaningful toString representation, right? Maybe we should encourage that approach.

        Having said that, I absolutely don't insist on including these into spec - just attempted to explain the reasons I am using them currently and have initially suggested them.

        Same here, I don't want to insist on anything. I just want to find a good solution.

        Comments about adding "local" date-time and "High-precision" time in addition to timestamp-millis are welcome.

        For high-precision, what granularity do you think needs to be supported? Nanos? Micros? We didn't have a clear answer on the Parquet side, which is why we pushed high-precision from the original spec – better to get some of the types in and expand later. Maybe we should open a follow-up issue to discuss these?

        Show
        Ryan Blue added a comment - Good point about not being able to use conversion methods in situations like debugging. But, I think I'd rather not have those limitations dictate the possible representations when we'll end up with more to support and wasteful formats. You also mention using specific objects with ZonedDateTime fields – that addresses this problem by deserializing to a form that has a meaningful toString representation, right? Maybe we should encourage that approach. Having said that, I absolutely don't insist on including these into spec - just attempted to explain the reasons I am using them currently and have initially suggested them. Same here, I don't want to insist on anything. I just want to find a good solution. Comments about adding "local" date-time and "High-precision" time in addition to timestamp-millis are welcome. For high-precision, what granularity do you think needs to be supported? Nanos? Micros? We didn't have a clear answer on the Parquet side, which is why we pushed high-precision from the original spec – better to get some of the types in and expand later. Maybe we should open a follow-up issue to discuss these?
        Hide
        Dmitry Kovalev added a comment -

        Attaching a revised patch which fixes timestamp sorting and duration endianness issues.

        With regard to keeping string representations/zoned types - if I'm not missing anyone, so far we have basically 1 vote for keeping them and 2 votes against.
        If nobody else votes, all that needs to be done is to remove the bits about string representations from this patch.

        Comments about adding "local" date-time and "High-precision" time in addition to timestamp-millis are welcome.

        Show
        Dmitry Kovalev added a comment - Attaching a revised patch which fixes timestamp sorting and duration endianness issues. With regard to keeping string representations/zoned types - if I'm not missing anyone, so far we have basically 1 vote for keeping them and 2 votes against. If nobody else votes, all that needs to be done is to remove the bits about string representations from this patch. Comments about adding "local" date-time and "High-precision" time in addition to timestamp-millis are welcome.
        Hide
        Dmitry Kovalev added a comment -

        I think the right way to handle this is to use the zone-independent date/time types and an application-level zone implementation. These cases aren't very common, as you noted, and I think having a timestamp with zone logical type allows people to get around best practices and doesn't deliver a better solution for people that actually need to represent the zone. It may be slightly easier to represent the type in a single field, but size is significantly larger and the value only has significance when interpreted at the application layer anyway.

        In environments providing "rich" support for date-time related types (such as Joda Time / Noda time), this actually translates directly into the likes of ZonedDateTime, and can be handled on Avro level, e.g. using Specific the generated objects can expose ZonedDateTime properties instead of strings. This is what I do so it does deliver a better solution for me.

        Happy to drop it from the spec anyway.

        Show
        Dmitry Kovalev added a comment - I think the right way to handle this is to use the zone-independent date/time types and an application-level zone implementation. These cases aren't very common, as you noted, and I think having a timestamp with zone logical type allows people to get around best practices and doesn't deliver a better solution for people that actually need to represent the zone. It may be slightly easier to represent the type in a single field, but size is significantly larger and the value only has significance when interpreted at the application layer anyway. In environments providing "rich" support for date-time related types (such as Joda Time / Noda time), this actually translates directly into the likes of ZonedDateTime, and can be handled on Avro level, e.g. using Specific the generated objects can expose ZonedDateTime properties instead of strings. This is what I do so it does deliver a better solution for me. Happy to drop it from the spec anyway.
        Hide
        Dmitry Kovalev added a comment -

        But what I'm trying to get at is whether your IPC use case for the string representations could be solved another way.

        In short - of course it could, just in a more laborious way.

        Maybe I'm wrong about this, but it seems like using strings would probably be most helpful in debugging the application. And if that's the case, we can provide a few simple tools for working with these types rather than changing the representation to avoid the conversion. What about adding a set of helpers...

        Having these in Avro distribution would certainly encourage more people to go with binary representations if they are going to be the only standard, although this level of support is of course not the same - e.g. when you use toString() to dump the object as JSON or introspect an object in a debugger you will still see just a byte sequence. Other binary-encoded types are mostly first-class primitives which get translated to strings by standard tools so this is not an issue.

        However I used the debugging as just one illustration of why I thought it would be worth having standardised string representations where compactness and performance are not absolutely critical.
        Another reason we have also touched above is that there is a real lack of common binary representations (and platform support) of anything beyond simple timestamps and dates, and this is what made people misuse e.g. Date to confuse utc/local/zoned time, fixed duration vs duration in months/days etc.
        Even in this spec we don't have a separate type/binary representation of "local" date+time - only separate types for each component - so undoubtedly some people will decide to use timestamp-millis, despite the spec saying that it represents UTC date-time explicitly. And the representation specified for Duration may be most efficient but is not something that can be called commonly used or easy to interpret. If you remember the issue of higher-precision time we have omitted in the spec - is it going to have a separate binary representation as well?
        ISO-8601 provides a basis to represent all of these "naturally", in a way instantly understandable by human, and makes it easy to standardise different types of date-time information and promote their correct usage, and also provide a "bridge" to binary representations.

        Having said that, I absolutely don't insist on including these into spec - just attempted to explain the reasons I am using them currently and have initially suggested them.

        Show
        Dmitry Kovalev added a comment - But what I'm trying to get at is whether your IPC use case for the string representations could be solved another way. In short - of course it could, just in a more laborious way. Maybe I'm wrong about this, but it seems like using strings would probably be most helpful in debugging the application. And if that's the case, we can provide a few simple tools for working with these types rather than changing the representation to avoid the conversion. What about adding a set of helpers... Having these in Avro distribution would certainly encourage more people to go with binary representations if they are going to be the only standard, although this level of support is of course not the same - e.g. when you use toString() to dump the object as JSON or introspect an object in a debugger you will still see just a byte sequence. Other binary-encoded types are mostly first-class primitives which get translated to strings by standard tools so this is not an issue. However I used the debugging as just one illustration of why I thought it would be worth having standardised string representations where compactness and performance are not absolutely critical. Another reason we have also touched above is that there is a real lack of common binary representations (and platform support) of anything beyond simple timestamps and dates, and this is what made people misuse e.g. Date to confuse utc/local/zoned time, fixed duration vs duration in months/days etc. Even in this spec we don't have a separate type/binary representation of "local" date+time - only separate types for each component - so undoubtedly some people will decide to use timestamp-millis, despite the spec saying that it represents UTC date-time explicitly. And the representation specified for Duration may be most efficient but is not something that can be called commonly used or easy to interpret. If you remember the issue of higher-precision time we have omitted in the spec - is it going to have a separate binary representation as well? ISO-8601 provides a basis to represent all of these "naturally", in a way instantly understandable by human, and makes it easy to standardise different types of date-time information and promote their correct usage, and also provide a "bridge" to binary representations. Having said that, I absolutely don't insist on including these into spec - just attempted to explain the reasons I am using them currently and have initially suggested them.
        Hide
        Ryan Blue added a comment -

        The problem here is that it is going to be difficult to agree on what this support should constitute on each platform

        I agree with you here, which is one reason why I like the logical types. It's valuable to have standard representations that aren't necessarily tied to the object model. But what I'm trying to get at is whether your IPC use case for the string representations could be solved another way. I know it's easier to debug with ISO-8601 strings rather than ints, but it seems like this probably doesn't apply to bytes on the wire because those the format is (usually) binary for everything else.

        Maybe I'm wrong about this, but it seems like using strings would probably be most helpful in debugging the application. And if that's the case, we can provide a few simple tools for working with these types rather than changing the representation to avoid the conversion. What about adding a set of helpers that works like this:

        String date = Iso8601.dateAsString( record.get("date") );
        String time = Iso8601.timeAsString( record.get("time") );
        String timestamp = Iso8601.timestampAsString( record.get("timestamp") );
        

        Would that help you debug these applications without needing a different representation? Or do you need the wire protocol to include the date/time string for debugging? If so, why is it necessary to treat date/time differently than other types that are binary-encoded?

        Show
        Ryan Blue added a comment - The problem here is that it is going to be difficult to agree on what this support should constitute on each platform I agree with you here, which is one reason why I like the logical types. It's valuable to have standard representations that aren't necessarily tied to the object model. But what I'm trying to get at is whether your IPC use case for the string representations could be solved another way. I know it's easier to debug with ISO-8601 strings rather than ints, but it seems like this probably doesn't apply to bytes on the wire because those the format is (usually) binary for everything else. Maybe I'm wrong about this, but it seems like using strings would probably be most helpful in debugging the application. And if that's the case, we can provide a few simple tools for working with these types rather than changing the representation to avoid the conversion. What about adding a set of helpers that works like this: String date = Iso8601.dateAsString( record.get( "date" ) ); String time = Iso8601.timeAsString( record.get( "time" ) ); String timestamp = Iso8601.timestampAsString( record.get( "timestamp" ) ); Would that help you debug these applications without needing a different representation? Or do you need the wire protocol to include the date/time string for debugging? If so, why is it necessary to treat date/time differently than other types that are binary-encoded?
        Hide
        Ryan Blue added a comment -

        On the endianness of numbers in the "interval" type:

        This array stores three little-endian unsigned integers that represent durations at different granularities of time. The first stores a number in months, the second stores a number in days, and the third stores a number in milliseconds. . . .

        I don't think we need to specify a JSON representation because this should use whatever the JSON representation of a fixed(12) is. The logical type just states how those bytes should be interpreted.

        Consider for example a financial product whose definition says that it "ceases to trade on YYYY-MM-DD hh:mm Moscow time". . .

        I think the right way to handle this is to use the zone-independent date/time types and an application-level zone implementation. These cases aren't very common, as you noted, and I think having a timestamp with zone logical type allows people to get around best practices and doesn't deliver a better solution for people that actually need to represent the zone. It may be slightly easier to represent the type in a single field, but size is significantly larger and the value only has significance when interpreted at the application layer anyway.

        Show
        Ryan Blue added a comment - On the endianness of numbers in the "interval" type: This array stores three little-endian unsigned integers that represent durations at different granularities of time. The first stores a number in months, the second stores a number in days, and the third stores a number in milliseconds. . . . I don't think we need to specify a JSON representation because this should use whatever the JSON representation of a fixed(12) is. The logical type just states how those bytes should be interpreted. Consider for example a financial product whose definition says that it "ceases to trade on YYYY-MM-DD hh:mm Moscow time". . . I think the right way to handle this is to use the zone-independent date/time types and an application-level zone implementation. These cases aren't very common, as you noted, and I think having a timestamp with zone logical type allows people to get around best practices and doesn't deliver a better solution for people that actually need to represent the zone. It may be slightly easier to represent the type in a single field, but size is significantly larger and the value only has significance when interpreted at the application layer anyway.
        Hide
        Dmitry Kovalev added a comment -

        is this something that we could realistically accomplish with better support for logical types in the data models? If, for example, we added the conversion to a Calendar or a Date to deserialization rather than returning integers or longs, would that meet your use case?

        The problem here is that it is going to be difficult to agree on what this support should constitute on each platform. For example, I don't care about either Date or Calendar (or .NET DateTime for that matter) - none of these can inambiguously represent all the different kinds of date/time data that I need to communicate. JodaTime/ NodaTime provide all the types I need, but people would probably not be happy to add these libraries as firm dependencies in Avro codebase. Newer versions of Java and .NET are going to try and address this problem by embracing core concepts from those libraries, but currently on these platforms there are just no standard types which we could safely translate to. So at the moment the best that could be done is to provide on each platform a framework for plugging in logical type "implementations", including the codegen for specific, and a means for the users to specify which one to use. Then on top of this core framework there could be "contributed" implementations for most popular types.

        Show
        Dmitry Kovalev added a comment - is this something that we could realistically accomplish with better support for logical types in the data models? If, for example, we added the conversion to a Calendar or a Date to deserialization rather than returning integers or longs, would that meet your use case? The problem here is that it is going to be difficult to agree on what this support should constitute on each platform. For example, I don't care about either Date or Calendar (or .NET DateTime for that matter) - none of these can inambiguously represent all the different kinds of date/time data that I need to communicate. JodaTime/ NodaTime provide all the types I need, but people would probably not be happy to add these libraries as firm dependencies in Avro codebase. Newer versions of Java and .NET are going to try and address this problem by embracing core concepts from those libraries, but currently on these platforms there are just no standard types which we could safely translate to. So at the moment the best that could be done is to provide on each platform a framework for plugging in logical type "implementations", including the codegen for specific, and a means for the users to specify which one to use. Then on top of this core framework there could be "contributed" implementations for most popular types.
        Hide
        Dmitry Kovalev added a comment -

        The extra string encodings seem like an unreasonably high price to pay in development and test effort for human-readable output (I count 4 extra type/logicalType pairs that need to be implemented by every library/application implementing the spec). I think we'll end up with very few libraries/applications implementing all of the types and encodings, which is bad for interop whether you're using Avro for storage or IPC.

        Well from IPC perspective the typical price to pay is 2-3 lines of code to convert to and from the chosen internal representation (see e.g. my initial comment on this topic), so I wouldn't fear that nobody will want to implement this, and having a service which accepts both does make the clients' life easier, especially scripted clients etc. However, I have already agreed before that in other scenarios it may be best to insist on just one representation.

        I think the way forward is to invite everyone who cares to "vote" for or against including string representations in the spec. Then if there are more votes "against" - please feel free to take my patch, remove the string bits and resubmit.

        My only concern would be the datetime-timezone type which had only string representation, but I have a feeling that there won't be much interest in including it either way, because it is relatively rare compared to timestamps etc.

        Show
        Dmitry Kovalev added a comment - The extra string encodings seem like an unreasonably high price to pay in development and test effort for human-readable output (I count 4 extra type/logicalType pairs that need to be implemented by every library/application implementing the spec). I think we'll end up with very few libraries/applications implementing all of the types and encodings, which is bad for interop whether you're using Avro for storage or IPC. Well from IPC perspective the typical price to pay is 2-3 lines of code to convert to and from the chosen internal representation (see e.g. my initial comment on this topic), so I wouldn't fear that nobody will want to implement this, and having a service which accepts both does make the clients' life easier, especially scripted clients etc. However, I have already agreed before that in other scenarios it may be best to insist on just one representation. I think the way forward is to invite everyone who cares to "vote" for or against including string representations in the spec. Then if there are more votes "against" - please feel free to take my patch, remove the string bits and resubmit. My only concern would be the datetime-timezone type which had only string representation, but I have a feeling that there won't be much interest in including it either way, because it is relatively rare compared to timestamps etc.
        Hide
        Dmitry Kovalev added a comment -

        My vote is to require milliseconds in the string representations for both time-millis and timestamp-millis to solve the problem.

        Sure

        The interval type needs to specify the endianness of its components. Parquet uses little-endian, so I'd say we should specify that here also.

        Could you suggest the wording? Bearing in mind that Avro specifies binary and JSON encodings for each type

        I'd rather not include representations that have a time zone because the logic is always tricky and changes. I think best practice is to convert to UTC and I'd like for people to do that rather than using an expensive representation to get around best practice.

        I assume you are referring to datetime-timezone type? This is not another representation of timestamp-millis (where it already allows UTC only) - this is smth different. Consider for example a financial product whose definition says that it "ceases to trade on YYYY-MM-DD hh:mm Moscow time". It you convert it to UTC and store in a timestamp-millis, and on the next day the Russian authorities change the offset or daylight saving rules (which they did a few times in last decade) - you will end up with wrong expiration time (and potentially date). I think some zones like Israel adjust the rules every year. If you store timezone id with the "local" date-time in that zone - you can use the open database maintained by IANA to adjust. So the point of this type is exactly that it is required when you cannot just convert to UTC and have to embrace the trickiness and mutability of timezones. Both components (date-time and timezone id) are pretty standard.

        Show
        Dmitry Kovalev added a comment - My vote is to require milliseconds in the string representations for both time-millis and timestamp-millis to solve the problem. Sure The interval type needs to specify the endianness of its components. Parquet uses little-endian, so I'd say we should specify that here also. Could you suggest the wording? Bearing in mind that Avro specifies binary and JSON encodings for each type I'd rather not include representations that have a time zone because the logic is always tricky and changes. I think best practice is to convert to UTC and I'd like for people to do that rather than using an expensive representation to get around best practice. I assume you are referring to datetime-timezone type? This is not another representation of timestamp-millis (where it already allows UTC only) - this is smth different. Consider for example a financial product whose definition says that it "ceases to trade on YYYY-MM-DD hh:mm Moscow time". It you convert it to UTC and store in a timestamp-millis, and on the next day the Russian authorities change the offset or daylight saving rules (which they did a few times in last decade) - you will end up with wrong expiration time (and potentially date). I think some zones like Israel adjust the rules every year. If you store timezone id with the "local" date-time in that zone - you can use the open database maintained by IANA to adjust. So the point of this type is exactly that it is required when you cannot just convert to UTC and have to embrace the trickiness and mutability of timezones. Both components (date-time and timezone id) are pretty standard.
        Hide
        Ryan Blue added a comment -

        I think Skye makes some good points. Dmitry, is this something that we could realistically accomplish with better support for logical types in the data models? If, for example, we added the conversion to a Calendar or a Date to deserialization rather than returning integers or longs, would that meet your use case? That would certainly make it an easier-to-implement spec and avoid performance problems.

        Show
        Ryan Blue added a comment - I think Skye makes some good points. Dmitry, is this something that we could realistically accomplish with better support for logical types in the data models? If, for example, we added the conversion to a Calendar or a Date to deserialization rather than returning integers or longs, would that meet your use case? That would certainly make it an easier-to-implement spec and avoid performance problems.
        Hide
        Skye Wanderman-Milne added a comment -

        I realize I'm chiming in a little late, but I would strongly prefer a single binary encoding for each type (rather than both a human-readable and binary encoding). The extra string encodings seem like an unreasonably high price to pay in development and test effort for human-readable output (I count 4 extra type/logicalType pairs that need to be implemented by every library/application implementing the spec). I think we'll end up with very few libraries/applications implementing all of the types and encodings, which is bad for interop whether you're using Avro for storage or IPC. And once it's in the spec, we won't be able to scale back to a smaller, easier-to-maintain option.

        From a more ideological perspective, I don't think it's always a good idea to offer more choices. Instead of having users choose between performance and human-readability (and realistically between which applications will be able to read their data), maybe it would make more sense to only use the binary encoding and provide a tool for dumping the data in a human-readable format.

        (As to why I would prefer only the binary encoding and not only the string encoding: not only is the binary encoding more performant, I think we'll see fewer bugs/incompatibilities around what constitutes a valid date/time string).

        Show
        Skye Wanderman-Milne added a comment - I realize I'm chiming in a little late, but I would strongly prefer a single binary encoding for each type (rather than both a human-readable and binary encoding). The extra string encodings seem like an unreasonably high price to pay in development and test effort for human-readable output (I count 4 extra type/logicalType pairs that need to be implemented by every library/application implementing the spec). I think we'll end up with very few libraries/applications implementing all of the types and encodings, which is bad for interop whether you're using Avro for storage or IPC. And once it's in the spec, we won't be able to scale back to a smaller, easier-to-maintain option. From a more ideological perspective, I don't think it's always a good idea to offer more choices. Instead of having users choose between performance and human-readability (and realistically between which applications will be able to read their data), maybe it would make more sense to only use the binary encoding and provide a tool for dumping the data in a human-readable format. (As to why I would prefer only the binary encoding and not only the string encoding: not only is the binary encoding more performant, I think we'll see fewer bugs/incompatibilities around what constitutes a valid date/time string).
        Hide
        Ryan Blue added a comment -

        Dmitry, thanks for doing this. It looks really good to me with just a couple of minor things:

        • If the timezone Z is required for timestamp-millis and the milliseconds are optional, then the naive sort order no longer works. My vote is to require milliseconds in the string representations for both time-millis and timestamp-millis to solve the problem.
          >> '10.123Z' < '10Z'
          => true
          
        • The interval type needs to specify the endianness of its components. Parquet uses little-endian, so I'd say we should specify that here also.
        • I'd rather not include representations that have a time zone because the logic is always tricky and changes. I think best practice is to convert to UTC and I'd like for people to do that rather than using an expensive representation to get around best practice.
        Show
        Ryan Blue added a comment - Dmitry, thanks for doing this. It looks really good to me with just a couple of minor things: If the timezone Z is required for timestamp-millis and the milliseconds are optional, then the naive sort order no longer works. My vote is to require milliseconds in the string representations for both time-millis and timestamp-millis to solve the problem. >> '10.123Z' < '10Z' => true The interval type needs to specify the endianness of its components. Parquet uses little-endian, so I'd say we should specify that here also. I'd rather not include representations that have a time zone because the logic is always tricky and changes. I think best practice is to convert to UTC and I'd like for people to do that rather than using an expensive representation to get around best practice.
        Hide
        Dmitry Kovalev added a comment -

        Attaching a first draft - please review. My comments/issues:

        • in its current form, the spec only provides for a precision of up to a millisecond - this may save space and be a most universally used precision, but modern platforms and ISO standard provide for better precision
        • we could support high-precision time as a separate type on the grounds that it is less frequently used, or we could introduce an optional "precision" annotation which will say define a number of decimal places in second fractions
        • also, whether we only support the millis or also a higher precision, in either case your reasoning about simple names implying "canonical" use would arguably suggest smth like "timestamp" instead of "timestamp-millis" and "time" instead of "time-millis"? Was there a specific reason for adding "millis" in Parquet and is it important from interop point of view if Avro adopts a different name (as long as the actual definition is the same)?
        • I didn't provide a binary representation for Timestamp-timezone as I'm not entirely sure how it would look like and whether it will be popular at all, compared to the string representation
        • finally, re Parquet Interval type - I used to think (and ISO, Noda time etc seem to agree) that an "interval" means an interval on a global timeline, i.e. something with start and end at a specific instant in time, whereas what the current wording defines is actually better called a Duration. So the question is again - whether there was a specific reason to call it Interval in Parquet and if naming it Duration in Avro impacts Hadoop interop?
        Show
        Dmitry Kovalev added a comment - Attaching a first draft - please review. My comments/issues: in its current form, the spec only provides for a precision of up to a millisecond - this may save space and be a most universally used precision, but modern platforms and ISO standard provide for better precision we could support high-precision time as a separate type on the grounds that it is less frequently used, or we could introduce an optional "precision" annotation which will say define a number of decimal places in second fractions also, whether we only support the millis or also a higher precision, in either case your reasoning about simple names implying "canonical" use would arguably suggest smth like "timestamp" instead of "timestamp-millis" and "time" instead of "time-millis"? Was there a specific reason for adding "millis" in Parquet and is it important from interop point of view if Avro adopts a different name (as long as the actual definition is the same)? I didn't provide a binary representation for Timestamp-timezone as I'm not entirely sure how it would look like and whether it will be popular at all, compared to the string representation finally, re Parquet Interval type - I used to think (and ISO, Noda time etc seem to agree) that an "interval" means an interval on a global timeline, i.e. something with start and end at a specific instant in time, whereas what the current wording defines is actually better called a Duration. So the question is again - whether there was a specific reason to call it Interval in Parquet and if naming it Duration in Avro impacts Hadoop interop?
        Hide
        Ryan Blue added a comment -

        Yes, that sounds like a good solution to me. For the spec part, we can use the same logical types for both the string / ISO representations and the numeric encoding. We would just state in the spec that the string encoding is ISO-8601 and the int encoding is days from unix epoch and that no other underlying types are allowed. That way we don't have more logical types, just different ways of representing them. That's what we ended up doing for decimal, which has an unscaled component that can be stored in an int, a long, binary, or fixed.

        Show
        Ryan Blue added a comment - Yes, that sounds like a good solution to me. For the spec part, we can use the same logical types for both the string / ISO representations and the numeric encoding. We would just state in the spec that the string encoding is ISO-8601 and the int encoding is days from unix epoch and that no other underlying types are allowed. That way we don't have more logical types, just different ways of representing them. That's what we ended up doing for decimal, which has an unscaled component that can be stored in an int, a long, binary, or fixed.
        Hide
        Tom White added a comment -

        So what would you say to the following: adopt Parquet names/specs for binary representations, and add ISO-8601 string ones on top?

        That sounds reasonable to me.

        If you agree that this makes sense from both storage and IPC perspective then I could draft it and post here as a documentation patch.

        Please do.

        Show
        Tom White added a comment - So what would you say to the following: adopt Parquet names/specs for binary representations, and add ISO-8601 string ones on top? That sounds reasonable to me. If you agree that this makes sense from both storage and IPC perspective then I could draft it and post here as a documentation patch. Please do.
        Hide
        Dmitry Kovalev added a comment -

        Epochs and short names - I wasn't suggesting to support different ones, I just mentioned this to illustrate why I thought longer, more descriptive names were better. I still think they are, but I having read about Parquet etc I understand where you are coming from.

        I believe the different views come from the usage context - if you view Avro as storage format for Hadoop only then I would agree that it makes sense to choose a single, most compact representation as standard and this would improve interop in the sense that every part of that ecosystem will only have to support this single format.

        However I am using Avro as an IPC protocol in an application which exchanges complex data between services (saves me a lot of manual work as it supports maps and unions and codegen on all platforms). From this perspective, I believe that having standard type names for ISO8601-based representations of the above, in addition to binary ones, would actually improve interop. This is because people will want to use ISO-8601 in their protocols anyway (it is human readable in JSON dumps, familiar to people with XML background etc). So I think it is better to provide standard type names for these, rather than forcing to either use binary representation or custom names.

        So what would you say to the following: adopt Parquet names/specs for binary representations, and add ISO-8601 string ones on top?

        If you agree that this makes sense from both storage and IPC perspective then I could draft it and post here as a documentation patch.
        If you see ISO alternative representations as redundant or even evil then I guess it means I cannot contribute anything else to this topic and will leave it to you guys to sort out.

        Show
        Dmitry Kovalev added a comment - Epochs and short names - I wasn't suggesting to support different ones, I just mentioned this to illustrate why I thought longer, more descriptive names were better. I still think they are, but I having read about Parquet etc I understand where you are coming from. I believe the different views come from the usage context - if you view Avro as storage format for Hadoop only then I would agree that it makes sense to choose a single, most compact representation as standard and this would improve interop in the sense that every part of that ecosystem will only have to support this single format. However I am using Avro as an IPC protocol in an application which exchanges complex data between services (saves me a lot of manual work as it supports maps and unions and codegen on all platforms). From this perspective, I believe that having standard type names for ISO8601-based representations of the above, in addition to binary ones, would actually improve interop. This is because people will want to use ISO-8601 in their protocols anyway (it is human readable in JSON dumps, familiar to people with XML background etc). So I think it is better to provide standard type names for these, rather than forcing to either use binary representation or custom names. So what would you say to the following: adopt Parquet names/specs for binary representations, and add ISO-8601 string ones on top? If you agree that this makes sense from both storage and IPC perspective then I could draft it and post here as a documentation patch. If you see ISO alternative representations as redundant or even evil then I guess it means I cannot contribute anything else to this topic and will leave it to you guys to sort out.
        Hide
        Ryan Blue added a comment -

        I agree we should standardize on a single epoch. I've been working lately on high-level types across a variety of storage formats and I think we need to keep the specifications as small as possible to ensure people can actually implement them. A spec doesn't help much if it ends up being partially implemented and we have to worry about what parts of it different components implemented.

        I'm also in favor of simple names – "date", "time" and so on. These names imply that they are the canonical way to store the type, which is exactly what we want for interoperability.

        For specifics on what each type means, here is what we added to parquet:

        • date - an int, the number of days from the unix epoch, 1 January 1970 (no time component)
        • time-millis - an int, the number of milliseconds after midnight, 00:00:00.000 (no date component)
        • timestamp-millis - a long, the number of milliseconds from the unix epoch, 1 January 1970 00:00:00.000 UTC (combined date and time)
        • interval - 12-byte fixed, a 3-tuple of independent durations in months, days, milliseconds

        There are more specifics on the spec PR. I would really like to see the Avro and Parquet communities adopt the same logical type encodings. That would be much easier for applications to implement, which means fewer bugs and better compatibility.

        Show
        Ryan Blue added a comment - I agree we should standardize on a single epoch. I've been working lately on high-level types across a variety of storage formats and I think we need to keep the specifications as small as possible to ensure people can actually implement them. A spec doesn't help much if it ends up being partially implemented and we have to worry about what parts of it different components implemented. I'm also in favor of simple names – "date", "time" and so on. These names imply that they are the canonical way to store the type, which is exactly what we want for interoperability. For specifics on what each type means, here is what we added to parquet: date - an int, the number of days from the unix epoch, 1 January 1970 (no time component) time-millis - an int, the number of milliseconds after midnight, 00:00:00.000 (no date component) timestamp-millis - a long, the number of milliseconds from the unix epoch, 1 January 1970 00:00:00.000 UTC (combined date and time) interval - 12-byte fixed, a 3-tuple of independent durations in months, days, milliseconds There are more specifics on the spec PR . I would really like to see the Avro and Parquet communities adopt the same logical type encodings. That would be much easier for applications to implement, which means fewer bugs and better compatibility.
        Hide
        Tom White added a comment -

        My guess is that we should standardize on a single epoch and set of types compatible with SQL and Parquet, and use the simplest names that achieve that.

        +1

        Show
        Tom White added a comment - My guess is that we should standardize on a single epoch and set of types compatible with SQL and Parquet, and use the simplest names that achieve that. +1
        Hide
        Doug Cutting added a comment -

        I don't have a strong opinion on the single versus dual attribute approach. Are there features we want to support that will be substantially helped or hindered by one or the other? From a simplicity point of view, a single attribute is attractive.

        Whether we opt for simple names ("date", "time") or those that also include the encoding ("unix-epoch-millis") depends on what we wish to distinguish them from. If we think we'll primarily support only a single timestamp representation then a simple name like "timestamp" suffices. If we think we may need to support multiple kinds of epochs, then putting the epoch type in the name is probably wise. My guess is that we should standardize on a single epoch and set of types compatible with SQL and Parquet, and use the simplest names that achieve that.

        Show
        Doug Cutting added a comment - I don't have a strong opinion on the single versus dual attribute approach. Are there features we want to support that will be substantially helped or hindered by one or the other? From a simplicity point of view, a single attribute is attractive. Whether we opt for simple names ("date", "time") or those that also include the encoding ("unix-epoch-millis") depends on what we wish to distinguish them from. If we think we'll primarily support only a single timestamp representation then a simple name like "timestamp" suffices. If we think we may need to support multiple kinds of epochs, then putting the epoch type in the name is probably wise. My guess is that we should standardize on a single epoch and set of types compatible with SQL and Parquet, and use the simplest names that achieve that.
        Hide
        Dmitry Kovalev added a comment -

        Hi Tom,

        So basically you suggest to "mix" type and encoding, and have a logicalType for each combination, correct? From pure design perspective, I still think my approach is better because it clearly separates the "what it is" from the "how it is represented", and allows for more powerful mapping logic in e.g. codegen scenario.

        Other than that, I agree that in general the single-attribute approach is more or less equivalent, so if it gets more "votes" here than my original proposition, then I am happy to adopt it. Even then, I would definitely changes some of the type names you have suggested.

        For example:

         { "type": "int", "logicalType": "date" } // a bit vague - what exactly is a date here? 

        - this one is vague as it doesn't tell you much about how the date is represented in the int - presicely the thing we want to address with the attributes. I would go for smth like

         { "type": "int", "logicalType": "days-unix-epoch" } 

        Likewise,

         { "type": "long", "logicalType": "timestamp-millis" } //what is the epoch from which the millis are counted? 

        doesn't tell you what is the epoch you count from, I know in opensource context one can assume it is unix epoch, but we also have some databases which use 1 January 1900, Excel/OLE automation uses 30 December 1899 etc - so more generally I think it would improve clarity a lot if it was

         { "type": "long", "logicalType": "millis-unix-epoch" } 

        as I suggested, or smth similar.

        How problematic would it be to rename stuff in your project? Or could you support both sets of names - one for backward compatibility, one to conform to whatever will be agreed as Avro standard?

        Doug, what is your opinion, both on single attribute vs two attributes and on the naming stuff?

        Show
        Dmitry Kovalev added a comment - Hi Tom, So basically you suggest to "mix" type and encoding, and have a logicalType for each combination, correct? From pure design perspective, I still think my approach is better because it clearly separates the "what it is" from the "how it is represented", and allows for more powerful mapping logic in e.g. codegen scenario. Other than that, I agree that in general the single-attribute approach is more or less equivalent, so if it gets more "votes" here than my original proposition, then I am happy to adopt it. Even then, I would definitely changes some of the type names you have suggested. For example: { "type" : " int " , "logicalType" : "date" } // a bit vague - what exactly is a date here? - this one is vague as it doesn't tell you much about how the date is represented in the int - presicely the thing we want to address with the attributes. I would go for smth like { "type" : " int " , "logicalType" : "days-unix-epoch" } Likewise, { "type" : " long " , "logicalType" : "timestamp-millis" } //what is the epoch from which the millis are counted? doesn't tell you what is the epoch you count from, I know in opensource context one can assume it is unix epoch, but we also have some databases which use 1 January 1900, Excel/OLE automation uses 30 December 1899 etc - so more generally I think it would improve clarity a lot if it was { "type" : " long " , "logicalType" : "millis-unix-epoch" } as I suggested, or smth similar. How problematic would it be to rename stuff in your project? Or could you support both sets of names - one for backward compatibility, one to conform to whatever will be agreed as Avro standard? Doug, what is your opinion, both on single attribute vs two attributes and on the naming stuff?
        Hide
        Tom White added a comment -

        Thanks Dmitry. The Parquet project recently added date and time types (PARQUET-12 and https://github.com/apache/incubator-parquet-format/pull/3/files) and I think it would be very useful to align the two where possible, since this would make Hive integration easier (for example).

        For DateTimeInstant I would propose the two types (one string, one int):

        { "type": "string", "logicalType": "ISO-8601-datetime-offset" }
        { "type": "long", "logicalType": "timestamp-millis" }
        

        For LocalDate we could have two logical types, one for Avro string (ISO-8610 encoding), and one for int (days since Unix epoch):

        { "type": "string", "logicalType": "ISO-8601-date" }
        { "type": "int", "logicalType": "date" }
        

        We might also consider time and interval logical types.

        Show
        Tom White added a comment - Thanks Dmitry. The Parquet project recently added date and time types ( PARQUET-12 and https://github.com/apache/incubator-parquet-format/pull/3/files ) and I think it would be very useful to align the two where possible, since this would make Hive integration easier (for example). For DateTimeInstant I would propose the two types (one string, one int): { "type" : "string" , "logicalType" : "ISO-8601-datetime-offset" } { "type" : " long " , "logicalType" : "timestamp-millis" } For LocalDate we could have two logical types, one for Avro string (ISO-8610 encoding), and one for int (days since Unix epoch): { "type" : "string" , "logicalType" : "ISO-8601-date" } { "type" : " int " , "logicalType" : "date" } We might also consider time and interval logical types.
        Hide
        Dmitry Kovalev added a comment -

        Cool, I'll wait for a while in case there will be more comments/suggestions, and then prepare a documentation patch for your review

        Show
        Dmitry Kovalev added a comment - Cool, I'll wait for a while in case there will be more comments/suggestions, and then prepare a documentation patch for your review
        Hide
        Doug Cutting added a comment -

        These sound reasonable to me.

        Show
        Doug Cutting added a comment - These sound reasonable to me.
        Hide
        Dmitry Kovalev added a comment -

        Hi guys,

        I am currently using an attribute-based approach which builds upon this discussion plus the one about decimal type - and extends it to support various types of date/time information - roughly following the JodaTime/NodaTime classification.

        I distinguish a few logicalTypes (this defines what it represents e.g. just a date, an instant in time, a time in specific zone etc), each of which can have one or more logicalEncoding (this defines which primitive type is holding the info, and how it is encoded).

        For example (using IDL syntax, sorry I'm not good at emitting JSON):

        1) DateTimeInstant - represents exact moment in time - millis, UTC or specific offset are all equivalent representations

        @logicalType("DateTimeInstant") @logicalEncoding("millis-unix-epoch") long timestamp; //just a usual long with number of millis since unix epoch

        @logicalType("DateTimeInstant") @logicalEncoding("ISO8601-datetime-offset") string asOfDateTime; //must be ISO full date and time including offset or UTC symbol i.e. YYYY-MM-DDThh:mm:ss.xxx+-hh:mm or YYYY-MM-DDThh:mm:ss.xxxZ

        2) LocalDate - represents a calendar date, time is not important/not defined, so day start differences due to timezones are not directly applicable

        @logicalType("LocalDate") @logicalEncoding("ISO8601-date") string settlementDate; //must be ISO local date without any time, offset or UTC symbol, i.e. YYYY-MM-DD

        3) ZonedDateTime - represents not just a moment in time, but preserves the information about which time zone it was defined originally - which may be important

        @logicalType("ZonedDateTime") @logicalEncoding("ISO8601-datetime-timezone") string optionExpiration; //must be ISO local date and time without any offset or UTC symbol, followed by a space and either "UTC" or an IANA tzdb id such as "Europe/London", "Europe/Moscow", i.e. YYYY-MM-DDThh:mm:ss.xxx zzzzz/zzzzz

        All of the above logicalEncodings are sufficiently well-defined to be interpreted easily and unambiguously on any platform, for example I use the following:

        ISO8601-datetime-offset:
        in JodaTime: ISODateTimeFormat.dateTimeParser().parseDateTime(), ISODateTimeFormat.dateTime().print()
        in NodaTime: InstantPattern.ExtendedIsoPattern.Parse((string)inputValue).GetValueOrThrow(), InstantPattern.ExtendedIsoPattern.Format(instant)

        ISO8601-date:
        in JodaTime: ISODateTimeFormat.localDateParser().parseDateTime(), ISODateTimeFormat.date().print()
        in NodaTime: LocalDatePattern.IsoPattern.Parse((string)inputValue).GetValueOrThrow(), LocalDatePattern.IsoPattern.Format(localDate)

        ISO8601-datetime-timezone:
        in JodaTime: sb.append(zdt.toLocalDateTime().toString()).append(" ").append(zdt.getZone().getID()),

        String[] parts = cs.toString().split(" ");

        LocalDateTime localDt = ISODateTimeFormat.dateTimeParser().parseDateTime(parts[0].trim()).toLocalDateTime();

        DateTimeZone tz = DateTimeZone.forID(parts[1].trim());
        in NodaTime: ZonedDateTimePattern.ExtendedFormatOnlyIsoPattern.Format(zonedDateTime),
        var parts = (inputValue as string).Split(' ');

        var localDateTime = LocalDateTimePattern.ExtendedIsoPattern.Parse(parts[0].Trim()).GetValueOrThrow();
        DateTimeZone tz = NodaTime.DateTimeZoneProviders.Tzdb[parts[1].Trim()];

        return localDateTime.InZoneStrictly(tz);

        This is backwards compatible as older code can ignore the attributes and assume the encoding, while newer code can make use of this metainformation - for example it can be used to detect the end type in generic code, or during codegen to generate properly typed properties.

        If you like the general idea, then I would suggest to just document these (happy to rename/amend/extend them) to provide a common basis, without any code for the moment - in the spirit of the decimal type.

        Any thoughts welcome.

        Thanks,
        Dmitry

        Show
        Dmitry Kovalev added a comment - Hi guys, I am currently using an attribute-based approach which builds upon this discussion plus the one about decimal type - and extends it to support various types of date/time information - roughly following the JodaTime/NodaTime classification. I distinguish a few logicalTypes (this defines what it represents e.g. just a date, an instant in time, a time in specific zone etc), each of which can have one or more logicalEncoding (this defines which primitive type is holding the info, and how it is encoded). For example (using IDL syntax, sorry I'm not good at emitting JSON): 1) DateTimeInstant - represents exact moment in time - millis, UTC or specific offset are all equivalent representations @logicalType("DateTimeInstant") @logicalEncoding("millis-unix-epoch") long timestamp; //just a usual long with number of millis since unix epoch @logicalType("DateTimeInstant") @logicalEncoding("ISO8601-datetime-offset") string asOfDateTime; //must be ISO full date and time including offset or UTC symbol i.e. YYYY-MM-DDThh:mm:ss.xxx+-hh:mm or YYYY-MM-DDThh:mm:ss.xxxZ 2) LocalDate - represents a calendar date, time is not important/not defined, so day start differences due to timezones are not directly applicable @logicalType("LocalDate") @logicalEncoding("ISO8601-date") string settlementDate; //must be ISO local date without any time, offset or UTC symbol, i.e. YYYY-MM-DD 3) ZonedDateTime - represents not just a moment in time, but preserves the information about which time zone it was defined originally - which may be important @logicalType("ZonedDateTime") @logicalEncoding("ISO8601-datetime-timezone") string optionExpiration; //must be ISO local date and time without any offset or UTC symbol, followed by a space and either "UTC" or an IANA tzdb id such as "Europe/London", "Europe/Moscow", i.e. YYYY-MM-DDThh:mm:ss.xxx zzzzz/zzzzz All of the above logicalEncodings are sufficiently well-defined to be interpreted easily and unambiguously on any platform, for example I use the following: ISO8601-datetime-offset: in JodaTime: ISODateTimeFormat.dateTimeParser().parseDateTime(), ISODateTimeFormat.dateTime().print() in NodaTime: InstantPattern.ExtendedIsoPattern.Parse((string)inputValue).GetValueOrThrow(), InstantPattern.ExtendedIsoPattern.Format(instant) ISO8601-date: in JodaTime: ISODateTimeFormat.localDateParser().parseDateTime(), ISODateTimeFormat.date().print() in NodaTime: LocalDatePattern.IsoPattern.Parse((string)inputValue).GetValueOrThrow(), LocalDatePattern.IsoPattern.Format(localDate) ISO8601-datetime-timezone: in JodaTime: sb.append(zdt.toLocalDateTime().toString()).append(" ").append(zdt.getZone().getID()), String[] parts = cs.toString().split(" "); LocalDateTime localDt = ISODateTimeFormat.dateTimeParser().parseDateTime(parts [0] .trim()).toLocalDateTime(); DateTimeZone tz = DateTimeZone.forID(parts [1] .trim()); in NodaTime: ZonedDateTimePattern.ExtendedFormatOnlyIsoPattern.Format(zonedDateTime), var parts = (inputValue as string).Split(' '); var localDateTime = LocalDateTimePattern.ExtendedIsoPattern.Parse(parts [0] .Trim()).GetValueOrThrow(); DateTimeZone tz = NodaTime.DateTimeZoneProviders.Tzdb[parts [1] .Trim()]; return localDateTime.InZoneStrictly(tz); This is backwards compatible as older code can ignore the attributes and assume the encoding, while newer code can make use of this metainformation - for example it can be used to detect the end type in generic code, or during codegen to generate properly typed properties. If you like the general idea, then I would suggest to just document these (happy to rename/amend/extend them) to provide a common basis, without any code for the moment - in the spirit of the decimal type. Any thoughts welcome. Thanks, Dmitry
        Hide
        Tom White added a comment -

        Using subtypes for optional extensions sounds like a good approach to me. We might promote them to primitive types in a future major version of Avro.

        I've posted a patch with a trial implementation of a decimal type in AVRO-1402.

        Show
        Tom White added a comment - Using subtypes for optional extensions sounds like a good approach to me. We might promote them to primitive types in a future major version of Avro. I've posted a patch with a trial implementation of a decimal type in AVRO-1402 .
        Hide
        Doug Cutting added a comment -

        Here's another approach. Instead of defining some new record types (which would bloat schemas), or some new primitives (which would be incompatible), might we instead standardize on some attributes?

        Thus we might use something like:

        {"type":"string", "subType":"ISO-8601-date"}

        This could be added to the specification, as an optional extension. If it's specified, then the string must be in ISO 8601 format.

        We could also have a type like:

        {"type":"long", "subType":"unix-epoch"}

        Note that with both of these formats, sorting by the primitive Avro type is consistent with sorting by time.

        Implementations can insert language-specific types for these at runtime.

        This approach might also be used to handle decimal values, using a lexicographic-friendly string format.

        http://www.zanopha.com/docs/elen.pdf

        Show
        Doug Cutting added a comment - Here's another approach. Instead of defining some new record types (which would bloat schemas), or some new primitives (which would be incompatible), might we instead standardize on some attributes? Thus we might use something like: {"type":"string", "subType":"ISO-8601-date"} This could be added to the specification, as an optional extension. If it's specified, then the string must be in ISO 8601 format. We could also have a type like: {"type":"long", "subType":"unix-epoch"} Note that with both of these formats, sorting by the primitive Avro type is consistent with sorting by time. Implementations can insert language-specific types for these at runtime. This approach might also be used to handle decimal values, using a lexicographic-friendly string format. http://www.zanopha.com/docs/elen.pdf
        Hide
        Doug Cutting added a comment -

        > One could log another field with the timezone identifier for these.

        From my understanding of SQL, for TIMETZ and TIMESTAMPTZ columns, a separate timezone is not stored per row. Rather, the TZ in the schema only affects how dates are parsed and displayed. Am I wrong? If I am correct, then the timezone should not be a field but a schema attribute that's used by implementations when parsing and displaying values. In all cases I believe we should only store a single UTC timestamp per value. Adding a distinct primitive type for each parsing/display variant seems like a poor design choice.

        > A long in binary form makes sense, but in JSON, an ISO8601 string might be more useful.

        Special-casing this would rule out back-compatibility, no?

        Show
        Doug Cutting added a comment - > One could log another field with the timezone identifier for these. From my understanding of SQL, for TIMETZ and TIMESTAMPTZ columns, a separate timezone is not stored per row. Rather, the TZ in the schema only affects how dates are parsed and displayed. Am I wrong? If I am correct, then the timezone should not be a field but a schema attribute that's used by implementations when parsing and displaying values. In all cases I believe we should only store a single UTC timestamp per value. Adding a distinct primitive type for each parsing/display variant seems like a poor design choice. > A long in binary form makes sense, but in JSON, an ISO8601 string might be more useful. Special-casing this would rule out back-compatibility, no?
        Hide
        Scott Carey added a comment -

        These seem like two different external representations of the same thing. A time plus a timezone can be losslessly converted to a UTC time. You do lose the original timezone, but dates and times are usually displayed in the timezone of the displayer, not where the time was originally noted.

        I completely agree for use cases where the time is being displayed to a user, but there are use cases where the loss of the original time zone is not acceptable. One could log another field with the timezone identifier for these. The use case for a UTC timestamp is more broadly applicable. I do not think we need to implement the one that also persists timezone now, but I do think we need to make sure that if we did implement such a thing in the future, that the names for these two things would be consistent. If we name this "Datetime" we are implying it has relation to dates, which implies relationship to timezones.

        With respect to the SQL variants, I see only two that represent a single point in time. Three are either dates or times and not the combination (e.g. "January 7, 2100", representing a time with granularity of one day, or "5:01" – a time of day, respectively).

        The two SQL equivalents are TIMESTAMP and TIMESTAMP WITH TIMEZONE. This proposal covers TIMESTAMP, roughly. I am suggesting we reserve space for a future TIMESTAMP WITH TIMEZONE. We could adopt the names for consistency.

        "timestamp"
        and
        "timestamptz"

        There is also the question of serialization in JSON form. A long in binary form makes sense, but in JSON, an ISO8601 string might be more useful.

        Show
        Scott Carey added a comment - These seem like two different external representations of the same thing. A time plus a timezone can be losslessly converted to a UTC time. You do lose the original timezone, but dates and times are usually displayed in the timezone of the displayer, not where the time was originally noted. I completely agree for use cases where the time is being displayed to a user, but there are use cases where the loss of the original time zone is not acceptable. One could log another field with the timezone identifier for these. The use case for a UTC timestamp is more broadly applicable. I do not think we need to implement the one that also persists timezone now, but I do think we need to make sure that if we did implement such a thing in the future, that the names for these two things would be consistent. If we name this "Datetime" we are implying it has relation to dates, which implies relationship to timezones. With respect to the SQL variants, I see only two that represent a single point in time. Three are either dates or times and not the combination (e.g. "January 7, 2100", representing a time with granularity of one day, or "5:01" – a time of day, respectively). The two SQL equivalents are TIMESTAMP and TIMESTAMP WITH TIMEZONE. This proposal covers TIMESTAMP, roughly. I am suggesting we reserve space for a future TIMESTAMP WITH TIMEZONE. We could adopt the names for consistency. "timestamp" and "timestamptz" There is also the question of serialization in JSON form. A long in binary form makes sense, but in JSON, an ISO8601 string might be more useful.
        Hide
        Doug Cutting added a comment -

        > There are two types of interest [ ... ]

        These seem like two different external representations of the same thing. A time plus a timezone can be losslessly converted to a UTC time. You do lose the original timezone, but dates and times are usually displayed in the timezone of the displayer, not where the time was originally noted.

        Also note that SQL has five different variants (http://en.wikipedia.org/wiki/SQL#Date_and_time) which are interconvertable. I suggest all of these should be converted to a single type in Avro. Perhaps one could annotate the schema with resolution and/or timezone to improve fidelity, e.g.,

        {"type":"Datetime", "resolution":"date", "timezone":"PST"}

        , but a long would always be written and used for comparison with other Datetime instances.

        > How would we migrate from this klunky form to "type":"instant"?

        We could simply treat instances of the klunky schema identically to "type":"instant".

        Show
        Doug Cutting added a comment - > There are two types of interest [ ... ] These seem like two different external representations of the same thing. A time plus a timezone can be losslessly converted to a UTC time. You do lose the original timezone, but dates and times are usually displayed in the timezone of the displayer, not where the time was originally noted. Also note that SQL has five different variants ( http://en.wikipedia.org/wiki/SQL#Date_and_time ) which are interconvertable. I suggest all of these should be converted to a single type in Avro. Perhaps one could annotate the schema with resolution and/or timezone to improve fidelity, e.g., {"type":"Datetime", "resolution":"date", "timezone":"PST"} , but a long would always be written and used for comparison with other Datetime instances. > How would we migrate from this klunky form to "type":"instant"? We could simply treat instances of the klunky schema identically to "type":"instant".
        Hide
        Scott Carey added a comment -

        There are two types of interest, one that is a UTC coordinate – a long like this one without any timezone or other 'date' information, and one that is a date-time, and therefore must contain timezone information. The latter would probably best be some ISO8601 subset. The former is a long (which is only 5 bytes for 'today', if in ms since 1970 UTC.

        We should decide on the names for these two things now. I think that "Datetime" is probably the thing that includes dates, times, and therefore time zones. The pure long universal time coordinate is perhaps "Instant" or "timestamp" – it has nothing to do with dates except that in Java the typical class used to hold such an instant is Date (or a long).

        I wish these were fundamental Avro primitive types. To work well with database systems we need these two types. The syntax as a special Record in the schema is klunky, but more backwards compatible. If we assume that some version of Avro in the future requires all language implementations to support new primitive types for these, how would we migrate from this klunky form to "type":"instant" ?

        This proposal isn't all that backwards compatible: If Python doesn't know what "org.apache.avro.Datetime" is, it won't be able to decode the type. Perhaps

        {"type":"instant"}

        is better – other than colliding with existing schemas with a custom type of that name. Perhaps

        {"type":"org.apache.avro.instant"}

        Rather than the record with nested field?

        Show
        Scott Carey added a comment - There are two types of interest, one that is a UTC coordinate – a long like this one without any timezone or other 'date' information, and one that is a date-time, and therefore must contain timezone information. The latter would probably best be some ISO8601 subset. The former is a long (which is only 5 bytes for 'today', if in ms since 1970 UTC. We should decide on the names for these two things now. I think that "Datetime" is probably the thing that includes dates, times, and therefore time zones. The pure long universal time coordinate is perhaps "Instant" or "timestamp" – it has nothing to do with dates except that in Java the typical class used to hold such an instant is Date (or a long). I wish these were fundamental Avro primitive types. To work well with database systems we need these two types. The syntax as a special Record in the schema is klunky, but more backwards compatible. If we assume that some version of Avro in the future requires all language implementations to support new primitive types for these, how would we migrate from this klunky form to "type":"instant" ? This proposal isn't all that backwards compatible: If Python doesn't know what "org.apache.avro.Datetime" is, it won't be able to decode the type. Perhaps {"type":"instant"} is better – other than colliding with existing schemas with a custom type of that name. Perhaps {"type":"org.apache.avro.instant"} Rather than the record with nested field?
        Hide
        Doug Cutting added a comment -

        Here's a patch that changes Java's specific & reflect to serialize and deserialize java.util.Date using the following schema:

        {"type":"record","name":"org.apache.avro.Datetime","fields":[{"name":"ms","type":"long"}]}"
        

        This is implemented by adding a custom encodings feature to SpecificData that permits a class to be mapped to a record schema. I had to modify reflect's CustomEncoding API. To make this back-compatible, we'll perhaps need to copy that API into specific, so this is not yet ready for commit.

        Do folks like this approach? We proclaim a language-independent schema for datetimes, then implementations can choose to map this into a native type or not.

        I did not extend Generic, since I believe there is value in keeping Generic's representations a closed set of classes. This permits applications to be sure they can process any data read using Generic. I might be convinced to add this to Generic, but that would make it an incompatible change.

        Show
        Doug Cutting added a comment - Here's a patch that changes Java's specific & reflect to serialize and deserialize java.util.Date using the following schema: { "type" : "record" , "name" : "org.apache.avro.Datetime" , "fields" :[{ "name" : "ms" , "type" : " long " }]}" This is implemented by adding a custom encodings feature to SpecificData that permits a class to be mapped to a record schema. I had to modify reflect's CustomEncoding API. To make this back-compatible, we'll perhaps need to copy that API into specific, so this is not yet ready for commit. Do folks like this approach? We proclaim a language-independent schema for datetimes, then implementations can choose to map this into a native type or not. I did not extend Generic, since I believe there is value in keeping Generic's representations a closed set of classes. This permits applications to be sure they can process any data read using Generic. I might be convinced to add this to Generic, but that would make it an incompatible change.
        Hide
        Doug Cutting added a comment -

        The custom encoding feature added in AVRO-1341 might be a good way to implement this. SpecificData could have a table mapping classes to custom encodings.

        By default this would map java.util.Date to a standard schema that writes it as a long. My instinct is to use a record schema rather than a long schema, however.

        Show
        Doug Cutting added a comment - The custom encoding feature added in AVRO-1341 might be a good way to implement this. SpecificData could have a table mapping classes to custom encodings. By default this would map java.util.Date to a standard schema that writes it as a long. My instinct is to use a record schema rather than a long schema, however.
        Hide
        John A. De Goes added a comment -

        Adopting UTC milliseconds as the date/time format is fundamentally wrong and will render the type useless for any serious application. ISO8601 is the standard format for date/time. It preserves the critical notion of timezone and daylight savings time, and of course lets you express time in UTC as well if that's what you want. The binary encoding is only slightly bulkier than UTC milliseconds.

        Show
        John A. De Goes added a comment - Adopting UTC milliseconds as the date/time format is fundamentally wrong and will render the type useless for any serious application. ISO8601 is the standard format for date/time. It preserves the critical notion of timezone and daylight savings time, and of course lets you express time in UTC as well if that's what you want. The binary encoding is only slightly bulkier than UTC milliseconds.
        Hide
        Kenneth Baltrinic added a comment -

        I concur w/ C Fletcher that some consideration to timezones and daylight savings time is needed. At the very minimum the spec would need require that in the absence of an explicit timezone, all times are in UTC.

        Show
        Kenneth Baltrinic added a comment - I concur w/ C Fletcher that some consideration to timezones and daylight savings time is needed. At the very minimum the spec would need require that in the absence of an explicit timezone, all times are in UTC.
        Hide
        Russell Jurney added a comment -

        PIG-1314 may be relevant. ISO8601 datetime format seemed convenient.

        Show
        Russell Jurney added a comment - PIG-1314 may be relevant. ISO8601 datetime format seemed convenient.
        Hide
        Colin Fletcher added a comment -

        The serialization of date/times must incorporate timezone. If it does not, then i will be unable to use it for the large scale projects I am leading. It doesnt matter to me if the format is custom in byte mode, but in json must be json compliant.

        Show
        Colin Fletcher added a comment - The serialization of date/times must incorporate timezone. If it does not, then i will be unable to use it for the large scale projects I am leading. It doesnt matter to me if the format is custom in byte mode, but in json must be json compliant.
        Hide
        Jeremy Custenborder added a comment -

        What were you thinking a long with the number of milliseconds since 1980 UTC? If you need more precision than that you are most likely going to make your own type. I really like the idea of getting something that can map to the native types in most of the languages. This would be a really cool feature.

        Show
        Jeremy Custenborder added a comment - What were you thinking a long with the number of milliseconds since 1980 UTC? If you need more precision than that you are most likely going to make your own type. I really like the idea of getting something that can map to the native types in most of the languages. This would be a really cool feature.
        Hide
        Ron Bodkin added a comment -

        Sorry I forgot to pate in Doug Cutting's design:
        The way that I have imagined doing this is to specify a standard schema
        for dates, then implementations can optionally map this to a native date
        type.

        The schema could be a record containing a long, e.g.:

        {"type": "record", "name":"org.apache.avro.lib.Date", "fields" : [

        {"name": "time", "type": "long"}

        ]
        }

        Java could read this into a java.util.Date, Python to a datetime, etc.
        Such conventions could be added to the Avro specification.

        Does this sound like a reasonable approach?

        And also this email thread -

        On 01/18/2011 09:19 AM, Jeremy Custenborder wrote:
        I agree with storing it as a long. How would you handle this in code
        generation and serialization? Would you envision hooks during code
        generation that would generate a member that is the native date time
        for the language?

        Yes. Just as "bytes" is represented in Java by java.nio.ByteBuffer,
        "org.apache.avro.lib.Date" could be represented by java.util.Date.

        Does the serializer handle a date object that is
        native to the language?

        Yes, serializers and deserializers would need to implement this mapping.

        Does this sound like a reasonable approach?

        I really like the idea of having a standard
        datetime as a supported type of avro. It's a problem that everyone has
        to solve on their own.

        Show
        Ron Bodkin added a comment - Sorry I forgot to pate in Doug Cutting's design: The way that I have imagined doing this is to specify a standard schema for dates, then implementations can optionally map this to a native date type. The schema could be a record containing a long, e.g.: {"type": "record", "name":"org.apache.avro.lib.Date", "fields" : [ {"name": "time", "type": "long"} ] } Java could read this into a java.util.Date, Python to a datetime, etc. Such conventions could be added to the Avro specification. Does this sound like a reasonable approach? And also this email thread - On 01/18/2011 09:19 AM, Jeremy Custenborder wrote: I agree with storing it as a long. How would you handle this in code generation and serialization? Would you envision hooks during code generation that would generate a member that is the native date time for the language? Yes. Just as "bytes" is represented in Java by java.nio.ByteBuffer, "org.apache.avro.lib.Date" could be represented by java.util.Date. Does the serializer handle a date object that is native to the language? Yes, serializers and deserializers would need to implement this mapping. Does this sound like a reasonable approach? I really like the idea of having a standard datetime as a supported type of avro. It's a problem that everyone has to solve on their own.
        Hide
        Ron Bodkin added a comment -

        From the discussion on the users list, I agree that it'd be great to start with a simple timestamp, which gets serialized as a long. Let's start with a simple feature, and future enhancements can be tracked separately.

        Doug proposed this design:

        I noted that it would be nice to allow some flexibility in the implementation
        classes for dates, e.g., letting Java users use Joda time classes as well
        as java.util.Date

        Scott said:
        Absolutely. This is a per-language feature though, so it may not require
        much of the spec. For example, in Java it could simply be a configuration
        parameter passed to the DatumReader/Writers. It doesn't make a lot of
        sense to store metadata on the data that says "this is a Joda object, not
        java.util.Date" – that is a user choice and not intrinsic to describing
        the data.

        My input:
        I agree this shouldn't be part of the serialized format. It would be nice to
        have a clean way to specify the configuration/mappings used that allows
        for specifying the mappings for more such org.apache.avro data types. It
        also should be supported for reflection and code generation approaches, as well.

        Scott also said:
        There are other questions too – what are the timestamp units
        (milliseconds? configurable?), what is the origin (1970? 2010?
        configurable?) – these decisions affect the serialization size.

        My input:
        I would like to see a format that allows storing data at the precision of popular libraries and languages (java.util.Date, Joda time, Python datetime, etc.). Having a long representing microseconds since Jan. 1 1970 seems like a good compromise for general purpose use. It supports higher precision libraries and still allows representing a few hundred thousand years of data. Some libraries do allow nanosecond resolution - but limiting to 270 years seems like a bigger limitation than microsecond precision.

        Show
        Ron Bodkin added a comment - From the discussion on the users list, I agree that it'd be great to start with a simple timestamp, which gets serialized as a long. Let's start with a simple feature, and future enhancements can be tracked separately. Doug proposed this design: I noted that it would be nice to allow some flexibility in the implementation classes for dates, e.g., letting Java users use Joda time classes as well as java.util.Date Scott said: Absolutely. This is a per-language feature though, so it may not require much of the spec. For example, in Java it could simply be a configuration parameter passed to the DatumReader/Writers. It doesn't make a lot of sense to store metadata on the data that says "this is a Joda object, not java.util.Date" – that is a user choice and not intrinsic to describing the data. My input: I agree this shouldn't be part of the serialized format. It would be nice to have a clean way to specify the configuration/mappings used that allows for specifying the mappings for more such org.apache.avro data types. It also should be supported for reflection and code generation approaches, as well. Scott also said: There are other questions too – what are the timestamp units (milliseconds? configurable?), what is the origin (1970? 2010? configurable?) – these decisions affect the serialization size. My input: I would like to see a format that allows storing data at the precision of popular libraries and languages (java.util.Date, Joda time, Python datetime, etc.). Having a long representing microseconds since Jan. 1 1970 seems like a good compromise for general purpose use. It supports higher precision libraries and still allows representing a few hundred thousand years of data. Some libraries do allow nanosecond resolution - but limiting to 270 years seems like a bigger limitation than microsecond precision.

          People

          • Assignee:
            Dmitry Kovalev
            Reporter:
            Jeff Hammerbacher
          • Votes:
            8 Vote for this issue
            Watchers:
            28 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development