Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.7.0
    • Fix Version/s: 0.11
    • Component/s: data
    • Labels:
    • Tags:
      pig date datetime time support primitive type

      Description

      Hadoop/Pig are primarily used to parse log data, and most logs have a timestamp component. Therefore Pig should support dates as a primitive.

      Can someone familiar with adding types to pig comment on how hard this is? We're looking at doing this, rather than use UDFs. Is this a patch that would be accepted?

      This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012

      1. PIG-1314-7.patch
        296 kB
        Zhijie Shen
      2. PIG-1314-6.patch
        288 kB
        Zhijie Shen
      3. PIG-1314-5.patch
        286 kB
        Zhijie Shen
      4. PIG-1314-4.patch
        226 kB
        Zhijie Shen
      5. PIG-1314-3.patch
        150 kB
        Zhijie Shen
      6. PIG-1314-2.patch
        148 kB
        Zhijie Shen
      7. PIG-1314-1.patch
        115 kB
        Zhijie Shen
      8. joda_vs_builtin.zip
        4 kB
        Zhijie Shen

        Issue Links

          Activity

          Hide
          Alan Gates added a comment -

          Major +1. Adding DateTime as a Pig primitive is definitely a good idea. It's on our list of things to do (http://wiki.apache.org/pig/PigJournal). A brief overview of the work to be done:

          1. Add support in parser, both for declaring an input to be of type datetime and datetime constants
          2. Add support in TypeChecker for datetime types, including any allowed type promotions (ie implicit casts)
          3. Change LoadCaster interface to include bytesToDateTime method, add method to default implementation
          4. Determerine which builtin UDFs that we want for datetime and get agreement from community. Implement these UDFs.
          5. Implement any allowed cast operators for datetime (probably just string <-> datetime).
          6. Implement datetime class represents datetime in memory. This needs to implement WritableComparable so that it can be serialized and compared in Hadoop
          7. Implement raw comparator for the type so it can be used as a key in groups bys and joins.
          8. Change physical operators and builtin UDFs to handle processing of datetime types.
          9. Change data conversion and type discovery routines in DataType
          10. And, of course, add prolific tests

          The other question is backward compatibility. I can think of only two backward incompatible changes

          1. Addition of bytesToDateTime in the LoadCaster interface. Given that this will only require a change if people recompile their implementation, and AFAIK there are no implementations of LoadCaster before our default implementation, I think this is ok.
          2. Changes to Pig Latin to specify a field as of type date, plus however we denote datetime strings. We need to make these as unobtrusive as possible, but again I think it will be ok, though we'll need to get community buy in on it.

          Would such a patch be accepted? If it's of good quality deals with backward compatibility concerns, certainly. In time for 0.8, I don't know. We try to do a release every three months, with a feature cut off about a month before release (give or take). Branching and feature cutoff for 0.7 is today, so branching and features cut off for 0.8 will probably be in June.

          If you want to pursue this, the first step should be a brief design that says how you'll go about doing it. It should cover things like which date format will you use (SQL, something else)? Which date function do you think should be built in? How to you plan to store this type in memory? Are there existing datetime libraries you can leverage or incorporate to avoid rebuilding the wheel? It's easiest to write up the design on Pig's wiki and then link to it on this bug. This will give users and developers a chance to review your thoughts and give feedback.

          Show
          Alan Gates added a comment - Major +1. Adding DateTime as a Pig primitive is definitely a good idea. It's on our list of things to do ( http://wiki.apache.org/pig/PigJournal ). A brief overview of the work to be done: Add support in parser, both for declaring an input to be of type datetime and datetime constants Add support in TypeChecker for datetime types, including any allowed type promotions (ie implicit casts) Change LoadCaster interface to include bytesToDateTime method, add method to default implementation Determerine which builtin UDFs that we want for datetime and get agreement from community. Implement these UDFs. Implement any allowed cast operators for datetime (probably just string <-> datetime). Implement datetime class represents datetime in memory. This needs to implement WritableComparable so that it can be serialized and compared in Hadoop Implement raw comparator for the type so it can be used as a key in groups bys and joins. Change physical operators and builtin UDFs to handle processing of datetime types. Change data conversion and type discovery routines in DataType And, of course, add prolific tests The other question is backward compatibility. I can think of only two backward incompatible changes Addition of bytesToDateTime in the LoadCaster interface. Given that this will only require a change if people recompile their implementation, and AFAIK there are no implementations of LoadCaster before our default implementation, I think this is ok. Changes to Pig Latin to specify a field as of type date, plus however we denote datetime strings. We need to make these as unobtrusive as possible, but again I think it will be ok, though we'll need to get community buy in on it. Would such a patch be accepted? If it's of good quality deals with backward compatibility concerns, certainly. In time for 0.8, I don't know. We try to do a release every three months, with a feature cut off about a month before release (give or take). Branching and feature cutoff for 0.7 is today, so branching and features cut off for 0.8 will probably be in June. If you want to pursue this, the first step should be a brief design that says how you'll go about doing it. It should cover things like which date format will you use (SQL, something else)? Which date function do you think should be built in? How to you plan to store this type in memory? Are there existing datetime libraries you can leverage or incorporate to avoid rebuilding the wheel? It's easiest to write up the design on Pig's wiki and then link to it on this bug. This will give users and developers a chance to review your thoughts and give feedback.
          Hide
          rjurney added a comment -

          Thanks, Alan. That is quite helpful. Let me look into it and see about feasibility.

          What about durations as well? http://en.wikipedia.org/wiki/ISO_8601#Durations ISO8601 durations would be very handy in enabling use of pig operators on datetimes via +/-, etc. This might be something to do later, though.

          Show
          rjurney added a comment - Thanks, Alan. That is quite helpful. Let me look into it and see about feasibility. What about durations as well? http://en.wikipedia.org/wiki/ISO_8601#Durations ISO8601 durations would be very handy in enabling use of pig operators on datetimes via +/-, etc. This might be something to do later, though.
          Hide
          Alan Gates added a comment -

          I think durations would be useful, and others have mentioned to me that they'd like to have them. As you note, this might be a good phase 2 addition, as getting datetime in alone will be a fair chunk of work.

          Show
          Alan Gates added a comment - I think durations would be useful, and others have mentioned to me that they'd like to have them. As you note, this might be a good phase 2 addition, as getting datetime in alone will be a fair chunk of work.
          Hide
          rjurney added a comment -

          The UDFs in PIG-1310 are a segway to full datetime support. They can be used until datetimes are supported in Pig.

          Show
          rjurney added a comment - The UDFs in PIG-1310 are a segway to full datetime support. They can be used until datetimes are supported in Pig.
          Hide
          rjurney added a comment -

          I would not say this blocks PIG-1310 at all - the UDFs there simply treat ISO dates as strings, which works reasonably well. They should also handle Long unix times, and will in a next patch. In any case, this isn't a blocker to that ticket, for which a patch was just submitted.

          Show
          rjurney added a comment - I would not say this blocks PIG-1310 at all - the UDFs there simply treat ISO dates as strings, which works reasonably well. They should also handle Long unix times, and will in a next patch. In any case, this isn't a blocker to that ticket, for which a patch was just submitted.
          Hide
          rjurney added a comment -

          Changing from blocks to related.

          Show
          rjurney added a comment - Changing from blocks to related.
          Hide
          Russell Jurney added a comment -

          As a first pass, I am going to add Boolean, which should be easier than DateTime, but will inform this implementation. See PIG-1429

          Show
          Russell Jurney added a comment - As a first pass, I am going to add Boolean, which should be easier than DateTime, but will inform this implementation. See PIG-1429
          Hide
          Russell Jurney added a comment -

          Ok, thinking about really doing this soon, after Boolean. I'd like to add two new primitives to Pig - DateTime and Duration.

          I'd do this on the wiki, but I don't have edit access. Can someone please grant the ability to make a new page to user RussellJurney on the Pig wiki?

          Design Notes:

          1) I'd like to use Jodatime for this, as I did in the DateTime UDFs. It is possible to use the Java date libs, but it would be painful to do so. Jodatime also performs better than Java's native date classes. It is Apache 2.0 licensed and is already pulled in via ivy in the DateTime UDFs - see PIG-1310

          2) Date Format for text/dumps: ISO8601. Looks like: [YYYY][MM][DD]T[hh][mm]Z It is a human readable, sortable/comparable, international standard. See http://en.wikipedia.org/wiki/ISO_8601#Dates

          2.5) In memory type: org.joda.time.DateTime. See http://joda-time.sourceforge.net/apidocs/org/joda/time/DateTime.html

          The internal format of jodatime is a Long epoch/Unix/POSIX time. See http://joda-time.sourceforge.net/faq.html#internalstorage

          3) Duration Format for text/dumps: ISO8601. Looks like: P[n]Y[n]M[n]DT[n]H[n]M[n]S It is a human readable, sortable/comparable, international standard. See http://en.wikipedia.org/wiki/ISO_8601#Durations

          3.5) In-memory format: org.joda.time.Duration. See http://joda-time.sourceforge.net/apidocs/org/joda/time/Duration.html

          4) All date functions in PIG-1310 should be included, except those replaced by the use of operators on datetimes and durations. Adding/subtracting datetimes should result in a duration. Durations can be added/subtracted/divided/multiplied/negated.

          Date/Duration truncation, date differences, date parsing/conversion should be included. Conversion from int/long POSIX, SQL and datemonth should be included. Conversion from any string with a DateFormat string should be included.

          5) Casting to and from Integer and Long should be supported, as a Unix/POSIX time. Casting to/from chararray in ISO8601 format should be supported.

          Comments? Suggestions?

          Show
          Russell Jurney added a comment - Ok, thinking about really doing this soon, after Boolean. I'd like to add two new primitives to Pig - DateTime and Duration. I'd do this on the wiki, but I don't have edit access. Can someone please grant the ability to make a new page to user RussellJurney on the Pig wiki? Design Notes: 1) I'd like to use Jodatime for this, as I did in the DateTime UDFs. It is possible to use the Java date libs, but it would be painful to do so. Jodatime also performs better than Java's native date classes. It is Apache 2.0 licensed and is already pulled in via ivy in the DateTime UDFs - see PIG-1310 2) Date Format for text/dumps: ISO8601. Looks like: [YYYY] [MM] [DD] T [hh] [mm] Z It is a human readable, sortable/comparable, international standard. See http://en.wikipedia.org/wiki/ISO_8601#Dates 2.5) In memory type: org.joda.time.DateTime. See http://joda-time.sourceforge.net/apidocs/org/joda/time/DateTime.html The internal format of jodatime is a Long epoch/Unix/POSIX time. See http://joda-time.sourceforge.net/faq.html#internalstorage 3) Duration Format for text/dumps: ISO8601. Looks like: P [n] Y [n] M [n] DT [n] H [n] M [n] S It is a human readable, sortable/comparable, international standard. See http://en.wikipedia.org/wiki/ISO_8601#Durations 3.5) In-memory format: org.joda.time.Duration. See http://joda-time.sourceforge.net/apidocs/org/joda/time/Duration.html 4) All date functions in PIG-1310 should be included, except those replaced by the use of operators on datetimes and durations. Adding/subtracting datetimes should result in a duration. Durations can be added/subtracted/divided/multiplied/negated. Date/Duration truncation, date differences, date parsing/conversion should be included. Conversion from int/long POSIX, SQL and datemonth should be included. Conversion from any string with a DateFormat string should be included. 5) Casting to and from Integer and Long should be supported, as a Unix/POSIX time. Casting to/from chararray in ISO8601 format should be supported. Comments? Suggestions?
          Hide
          Russell Jurney added a comment -

          Hmmm not sure if I should use durations or periods, or both. See http://joda-time.sourceforge.net/apidocs/org/joda/time/Period.html

          Show
          Russell Jurney added a comment - Hmmm not sure if I should use durations or periods, or both. See http://joda-time.sourceforge.net/apidocs/org/joda/time/Period.html
          Hide
          Russell Jurney added a comment -

          Been thinking about this... I don't think we should add a full datetime type at this time. See comments in PIG-1314 on alternative approach using builtins.

          Show
          Russell Jurney added a comment - Been thinking about this... I don't think we should add a full datetime type at this time. See comments in PIG-1314 on alternative approach using builtins.
          Hide
          Russell Jurney added a comment -

          I suck at JIRA. See proposal in PIG-1430.

          Show
          Russell Jurney added a comment - I suck at JIRA. See proposal in PIG-1430 .
          Hide
          Olga Natkovich added a comment -

          Russell, are you still planning to finish this for Pig 0.8.0 release?

          Show
          Olga Natkovich added a comment - Russell, are you still planning to finish this for Pig 0.8.0 release?
          Hide
          Olga Natkovich added a comment -

          Unlinking from 0.8 since we are branching today

          Show
          Olga Natkovich added a comment - Unlinking from 0.8 since we are branching today
          Hide
          Jeremy Hanna added a comment -

          I think this would be nice also when outputting from pig scripts using DBStorage to an RDBMS - to be able to serialize properly to the db's timestamp or date type (without extra UDF work).

          Show
          Jeremy Hanna added a comment - I think this would be nice also when outputting from pig scripts using DBStorage to an RDBMS - to be able to serialize properly to the db's timestamp or date type (without extra UDF work).
          Hide
          Zhijie Shen added a comment -

          I've solved the related issue PIG-1429. If nobody is currently working on this issue, I volunteer to investigate into it.

          Show
          Zhijie Shen added a comment - I've solved the related issue PIG-1429 . If nobody is currently working on this issue, I volunteer to investigate into it.
          Hide
          Daniel Dai added a comment -

          That will be great. Here is a specification I wrote: https://cwiki.apache.org/confluence/display/PIG/DateTime+type+specification. Take a look and we can discuss.

          Show
          Daniel Dai added a comment - That will be great. Here is a specification I wrote: https://cwiki.apache.org/confluence/display/PIG/DateTime+type+specification . Take a look and we can discuss.
          Hide
          Zhijie Shen added a comment -

          GSoC is back! I'd like to apply it with this issue. The proposal draft will come in following days

          Show
          Zhijie Shen added a comment - GSoC is back! I'd like to apply it with this issue. The proposal draft will come in following days
          Hide
          Daniel Dai added a comment -

          Looking forward to your proposal!

          Show
          Daniel Dai added a comment - Looking forward to your proposal!
          Hide
          Zhijie Shen added a comment -

          Hi folks,

          Below is my proposal draft. Any comments are welcome

          ==

          Proposal Title: Adding the Datetime Type as a Primitive for Pig

          Student Name: Zhijie Shen
          Student E-mail: zjshen14@gmail.com

          Organization/Project: Apache Software Foundation - Pig
          Assigned Mentor: Daniel Dai /Russell Jurney

          Proposal Abstract:

          Apache Pig is a platform for analyzing large data sets based on Hadoop. Currently Pig does not support the primitive datetime type [1], which is a desired feature to be implemented. In this proposal, I explain my plan to implement the primitive datetime type, including the details of my solution and schedule. Additionally, I briefly introduce my background and the motivation of applying GSoC'12.

          Detailed Description:

          1. Understanding of the Project

          1.1 What is Apache Pig?

          Apache Pig is a platform for analyzing large data sets. Notably, at Yahoo! 40% of all Hadoop jobs are run with Pig [5]. Pig has is own dataflow language, named Pig Latin, which encapsulates map/reduce jobs step-by-step, and offers the relational primitives such as LOAD, FOREACH, GROUP, FILTER and JOIN. Pig provides many built-in functions, but also allow users to define their user-defined functions (UDFs) to achieve particular purposes. There are more benefits: Pig can operates on the plain files directly without any schema information; it has a flexible, nested data model, which is more compatible with that of major programming languages; it provides a debugging environment.

          1.2 Why primitive datetime type is required?

          Datetime is a conventional data type in many of database management systems as well as programming languages. Within the Hadoop ecosystem, Hive, which is an analog of Pig, also supports the primitive datetime type (timestamp actually). In contrast, Pig does not fully support this type. Currently, users can only use the string type for the datetime data, and rely on the UDF which takes datetime strings. However, Pig is supposed to primarily parse log data, and most log data has attributes in the datetime type.

          Consequently, it is desired for Pig to support the datetime type as a primitive. By doing so, we can expect the following benefits: a more compact serialized format, working with conventional operators (+/-/==/!=/</>), a dedicated faster comparator, being sortable, fewer times of runtime conversion from string, and relieving users
          from deciding the input datetime string format.

          2. Roadmap of Implementing the New Feature

          2.1 To Do List

          2.1.1 Adding Support in Antlr Parser

          Pig Latin supports the assign data type explicitly, such that the “datetime” keyword and some constants, such as “now()” and “today()” can be recognized. The related syntax needs to be added into 5 antlr scripts: AliasMasker.g, AstPrinter.g, AstValidator.g, LogicalPlanGenerator.g, QueryParser.g.

          2.1.2 Adding Datetime as a Primitive

          The dateime type should be added into the DataType class, and the basic conversion between it and other data types need to be defined. Previously, the internal data structure relies on Joda datetime data type, which is more powerful than java.util.DateTime, but much easier than java.util.Calendar. Hence it is wise to keep this convention.
          Moreover, be careful that implicit type cast from/to the datetime type is not allowed.

          I also need to change the LoadCaster and StoreCaster interfaces to include bytesToDateTime/toBytes(DateTime) method, and add details to the classes that implemented these two interfaces. In addition, I need override +/-/==/!=/</> operators for the datetime type, mapping the to some bulitin EvalFuncs. The TypeCheckingExpVisitor class needs to be modified as well to support the datetime type vailidation. One important issue is that according to my previous experience, the data type related code in Pig is widely spread, such that I need to be careful all the related parts are touched.

          2.1.3 Refactoring of the Datetime Related UDFs

          Thanks Russell Jurney for having implemented a number of useful datetime related UDFs, which can be utilized for the primitive datetime type as well. Part of the UDF Classes located in the “org.apache.pig.piggybank.evaluation.datetime” package under the “contrib” folder need to be move to the “org.apache.pig.builtin” package under the “src” folder. Below are the related UDFs:

          int DiffDate(DateTime d1, DateTime d2)
          int YearsBetween(DateTime d1, DateTime d2)
          int MonthsBetween(DateTime d1, DateTime d2)
          int DaysBetween(DateTime d1, DateTime d2)
          int HoursBetween(DateTime d1, DateTime d2)
          int MinutesBetween(DateTime d1, DateTime d2)
          int SecondsBetween(DateTime d1, DateTime d2)
          int GetYear(DateTime d1)
          int GetMonth(DateTime d1)
          int GetDate(DateTime d1)
          int GetHour(DateTime d1)
          int GetMinute(DateTime d1)
          int GetSecond(DateTime d1)
          DateTime DateAdd(DateTime d1)
          String ToString(DateTime d, String format)
          (Probably rename it DateTimeFormat)

          The remaining UDFs can be eliminated, while their logics can be used in the primitive type conversion part, which has been introduced in the previous section. Below are the UDFs of this kind:

          DateTime ToDate(String s)
          DateTime ToDate(String s, String format)
          DateTime ToDate(String s, String format, String timezone)
          DateTime toDate(long t)
          String ToString(DateTime d)
          long ToUnixTime(DateTime d)

          Probably the following additional UDFs are also required, I need to discuss these with the community:

          DateTime Now()
          DateTime Today()
          bool IsDateTime(String s)

          2.1.4 Test Cases

          A large number of test cases are required to test the parser, the datatime operations and conversion, and loading from / storing into the secondary storage.

          2.1.5 Documentation

          A user manual is required to describe how to use datetime primitive, such as the input format, the supported built-in functions.

          2.2 Project Schedule

          During the summer, I will have not much workload except writing my Ph.D. thesis. Hence it is possible for me to spend around 40 hours per week on this project. The concrete schedule are summarized as follows:

          Present - May 20 (before official start of summer of code): Reading the related code in detail, and keeping touch with the community to clarify some issues, such as the necessary built-in UDFs and the rules of data conversion.

          May 21 - Jun 3 (two weeks): Adding the datetime into the primitive type list, and completing the functionality of parsing the datetime keyword and constraints, such that the string representing a datetime can be recognized from Pig Lating scripts.

          Jun 4 - Jun 24 (thee weeks): Implementing type conversion (from/to string) and loading/storing cast functionality. After this step, data of the datetime type can be correctly reading from/storing into the secondary storage.

          Jun 25 - Jul 8 (two weeks until mid-term evaluation): Completing the remaining part of the type conversion (e.g., between the datatime type and the long type), dealing with some issues that have not been foreseen yet, and preparing for the mid-term evaluation.

          Jul 9 - Jul 29 (three weeks): Refactoring the datetime related UDFs, adding new required UDFs, and overloading the primitive operators, such that all the defined operations on datetime values are supported after this step.

          Jul 30 - Aug 5 (one week): Writing the test cases to systematically verify the code, debugging the possible bugs. After this step, the coding part is nearly done.

          Aug 6 - Aug 12 (one week until final evaluation ): Documenting the user manual to show how to work with the datetime type, and preparing for the final evaluation.

          Additional Information:

          I am a Ph.D. student from National University of Singapore. My research topics are large scale multimedia systems, geo-referenced video systems and P2P video streaming. In addition to research, I love programming and have long-term experience in several languages, including Java. Moreover, I am quite interested in distributed systems and big data, and have acquired solid background knowledge. I used to take the course - "Parallel and Distributed Databases", drafted a survey of the cloud storage systems (including Pig) [4] and obtained the A+ score.

          Notably, I am a open source advocate, and have contributed to it to some extent. Last year, I have participated into GSoC with a Pig project. I successfully implemented the nested cross feature [2]. And I overfulfiled my proposed task by contributing one more patch of adding the primitive boolean type [3], which is somewhat similar to the task proposed for this year's GsoC. Therefore, I am quite familiar with this task and confident of completing it on time. Last but not least, I enjoy the long term participation into the Pig community, and am willing to keep contributing to it.

          Reference:

          [1] https://issues.apache.org/jira/browse/PIG-1314W
          [2] https://issues.apache.org/jira/browse/PIG-1916
          [3] https://issues.apache.org/jira/browse/PIG-1429
          [4] http://www.comp.nus.edu.sg/~z-shen/survey.pdf
          [5] http://wiki.apache.org/pig/OldFrontPage

          Show
          Zhijie Shen added a comment - Hi folks, Below is my proposal draft. Any comments are welcome == Proposal Title: Adding the Datetime Type as a Primitive for Pig Student Name: Zhijie Shen Student E-mail: zjshen14@gmail.com Organization/Project: Apache Software Foundation - Pig Assigned Mentor: Daniel Dai /Russell Jurney Proposal Abstract: Apache Pig is a platform for analyzing large data sets based on Hadoop. Currently Pig does not support the primitive datetime type [1] , which is a desired feature to be implemented. In this proposal, I explain my plan to implement the primitive datetime type, including the details of my solution and schedule. Additionally, I briefly introduce my background and the motivation of applying GSoC'12. Detailed Description: 1. Understanding of the Project 1.1 What is Apache Pig? Apache Pig is a platform for analyzing large data sets. Notably, at Yahoo! 40% of all Hadoop jobs are run with Pig [5] . Pig has is own dataflow language, named Pig Latin, which encapsulates map/reduce jobs step-by-step, and offers the relational primitives such as LOAD, FOREACH, GROUP, FILTER and JOIN. Pig provides many built-in functions, but also allow users to define their user-defined functions (UDFs) to achieve particular purposes. There are more benefits: Pig can operates on the plain files directly without any schema information; it has a flexible, nested data model, which is more compatible with that of major programming languages; it provides a debugging environment. 1.2 Why primitive datetime type is required? Datetime is a conventional data type in many of database management systems as well as programming languages. Within the Hadoop ecosystem, Hive, which is an analog of Pig, also supports the primitive datetime type (timestamp actually). In contrast, Pig does not fully support this type. Currently, users can only use the string type for the datetime data, and rely on the UDF which takes datetime strings. However, Pig is supposed to primarily parse log data, and most log data has attributes in the datetime type. Consequently, it is desired for Pig to support the datetime type as a primitive. By doing so, we can expect the following benefits: a more compact serialized format, working with conventional operators (+/-/==/!=/</>), a dedicated faster comparator, being sortable, fewer times of runtime conversion from string, and relieving users from deciding the input datetime string format. 2. Roadmap of Implementing the New Feature 2.1 To Do List 2.1.1 Adding Support in Antlr Parser Pig Latin supports the assign data type explicitly, such that the “datetime” keyword and some constants, such as “now()” and “today()” can be recognized. The related syntax needs to be added into 5 antlr scripts: AliasMasker.g, AstPrinter.g, AstValidator.g, LogicalPlanGenerator.g, QueryParser.g. 2.1.2 Adding Datetime as a Primitive The dateime type should be added into the DataType class, and the basic conversion between it and other data types need to be defined. Previously, the internal data structure relies on Joda datetime data type, which is more powerful than java.util.DateTime, but much easier than java.util.Calendar. Hence it is wise to keep this convention. Moreover, be careful that implicit type cast from/to the datetime type is not allowed. I also need to change the LoadCaster and StoreCaster interfaces to include bytesToDateTime/toBytes(DateTime) method, and add details to the classes that implemented these two interfaces. In addition, I need override +/-/==/!=/</> operators for the datetime type, mapping the to some bulitin EvalFuncs. The TypeCheckingExpVisitor class needs to be modified as well to support the datetime type vailidation. One important issue is that according to my previous experience, the data type related code in Pig is widely spread, such that I need to be careful all the related parts are touched. 2.1.3 Refactoring of the Datetime Related UDFs Thanks Russell Jurney for having implemented a number of useful datetime related UDFs, which can be utilized for the primitive datetime type as well. Part of the UDF Classes located in the “org.apache.pig.piggybank.evaluation.datetime” package under the “contrib” folder need to be move to the “org.apache.pig.builtin” package under the “src” folder. Below are the related UDFs: int DiffDate(DateTime d1, DateTime d2) int YearsBetween(DateTime d1, DateTime d2) int MonthsBetween(DateTime d1, DateTime d2) int DaysBetween(DateTime d1, DateTime d2) int HoursBetween(DateTime d1, DateTime d2) int MinutesBetween(DateTime d1, DateTime d2) int SecondsBetween(DateTime d1, DateTime d2) int GetYear(DateTime d1) int GetMonth(DateTime d1) int GetDate(DateTime d1) int GetHour(DateTime d1) int GetMinute(DateTime d1) int GetSecond(DateTime d1) DateTime DateAdd(DateTime d1) String ToString(DateTime d, String format) (Probably rename it DateTimeFormat) The remaining UDFs can be eliminated, while their logics can be used in the primitive type conversion part, which has been introduced in the previous section. Below are the UDFs of this kind: DateTime ToDate(String s) DateTime ToDate(String s, String format) DateTime ToDate(String s, String format, String timezone) DateTime toDate(long t) String ToString(DateTime d) long ToUnixTime(DateTime d) Probably the following additional UDFs are also required, I need to discuss these with the community: DateTime Now() DateTime Today() bool IsDateTime(String s) 2.1.4 Test Cases A large number of test cases are required to test the parser, the datatime operations and conversion, and loading from / storing into the secondary storage. 2.1.5 Documentation A user manual is required to describe how to use datetime primitive, such as the input format, the supported built-in functions. 2.2 Project Schedule During the summer, I will have not much workload except writing my Ph.D. thesis. Hence it is possible for me to spend around 40 hours per week on this project. The concrete schedule are summarized as follows: Present - May 20 (before official start of summer of code): Reading the related code in detail, and keeping touch with the community to clarify some issues, such as the necessary built-in UDFs and the rules of data conversion. May 21 - Jun 3 (two weeks): Adding the datetime into the primitive type list, and completing the functionality of parsing the datetime keyword and constraints, such that the string representing a datetime can be recognized from Pig Lating scripts. Jun 4 - Jun 24 (thee weeks): Implementing type conversion (from/to string) and loading/storing cast functionality. After this step, data of the datetime type can be correctly reading from/storing into the secondary storage. Jun 25 - Jul 8 (two weeks until mid-term evaluation): Completing the remaining part of the type conversion (e.g., between the datatime type and the long type), dealing with some issues that have not been foreseen yet, and preparing for the mid-term evaluation. Jul 9 - Jul 29 (three weeks): Refactoring the datetime related UDFs, adding new required UDFs, and overloading the primitive operators, such that all the defined operations on datetime values are supported after this step. Jul 30 - Aug 5 (one week): Writing the test cases to systematically verify the code, debugging the possible bugs. After this step, the coding part is nearly done. Aug 6 - Aug 12 (one week until final evaluation ): Documenting the user manual to show how to work with the datetime type, and preparing for the final evaluation. Additional Information: I am a Ph.D. student from National University of Singapore. My research topics are large scale multimedia systems, geo-referenced video systems and P2P video streaming. In addition to research, I love programming and have long-term experience in several languages, including Java. Moreover, I am quite interested in distributed systems and big data, and have acquired solid background knowledge. I used to take the course - "Parallel and Distributed Databases", drafted a survey of the cloud storage systems (including Pig) [4] and obtained the A+ score. Notably, I am a open source advocate, and have contributed to it to some extent. Last year, I have participated into GSoC with a Pig project. I successfully implemented the nested cross feature [2] . And I overfulfiled my proposed task by contributing one more patch of adding the primitive boolean type [3] , which is somewhat similar to the task proposed for this year's GsoC. Therefore, I am quite familiar with this task and confident of completing it on time. Last but not least, I enjoy the long term participation into the Pig community, and am willing to keep contributing to it. Reference: [1] https://issues.apache.org/jira/browse/PIG-1314W [2] https://issues.apache.org/jira/browse/PIG-1916 [3] https://issues.apache.org/jira/browse/PIG-1429 [4] http://www.comp.nus.edu.sg/~z-shen/survey.pdf [5] http://wiki.apache.org/pig/OldFrontPage
          Hide
          Zhijie Shen added a comment -

          By the way, who would like to mentor this issue?

          Show
          Zhijie Shen added a comment - By the way, who would like to mentor this issue?
          Hide
          Russell Jurney added a comment -

          I am happy to help regarding questions about the DateTime UDFs, but do not remember the internals of my attempt to add Boolean in preparation for DateTime. I suggest the comitter who got Boolean working would be a good candidate?

          Show
          Russell Jurney added a comment - I am happy to help regarding questions about the DateTime UDFs, but do not remember the internals of my attempt to add Boolean in preparation for DateTime. I suggest the comitter who got Boolean working would be a good candidate?
          Hide
          Zhijie Shen added a comment -

          Coincidentally, I'm that person making Boolean working

          Daniel helped me a lot to work out that issue, if he'd like to mentor this one, it will also be awesome.

          Show
          Zhijie Shen added a comment - Coincidentally, I'm that person making Boolean working Daniel helped me a lot to work out that issue, if he'd like to mentor this one, it will also be awesome.
          Hide
          Daniel Dai added a comment -

          I would like to mentor this.

          Show
          Daniel Dai added a comment - I would like to mentor this.
          Hide
          Zhijie Shen added a comment -

          I've pasted the proposal to the official website: http://www.google-melange.com/gsoc/proposal/review/google/gsoc2012/zjshen/21002

          Any comments are welcome, such that I can improve the proposal in the remaining days.

          Show
          Zhijie Shen added a comment - I've pasted the proposal to the official website: http://www.google-melange.com/gsoc/proposal/review/google/gsoc2012/zjshen/21002 Any comments are welcome, such that I can improve the proposal in the remaining days.
          Hide
          Prashant Kommireddi added a comment -

          Thanks Zhijie. Can you please make it public?

          Show
          Prashant Kommireddi added a comment - Thanks Zhijie. Can you please make it public?
          Hide
          Zhijie Shen added a comment -

          Ah, I forgot doing it. Public now

          Show
          Zhijie Shen added a comment - Ah, I forgot doing it. Public now
          Hide
          Zhijie Shen added a comment -

          As suggested by Thejas, I've done performance comparison between JODA and builtin datetime-related objects. For each function, I repeated 100,000 times of computation, and calculated the time respectively. Please refer to the attachment for the code details. Bellow is the summary of the results (unit is millisecond):

          ISOToSecond: JODA-958 Builtin-1326
          ISOToMinute: JODA-532 Builtin-850
          ISOToHour: JODA-414 Builtin-680
          ISOToDay: JODA-475 Builtin-685
          ISOToMonth: JODA-463 Builtin-692
          ISOToYear: JODA-462 Builtin-715
          ISOSecondsBetween: JODA-961 Builtin-968
          ISOMinutesBetween: JODA-734 Builtin-565
          ISOHoursBetween: JODA-596 Builtin-656
          ISODaysBetween: JODA-592 Builtin-555
          ISOMonthsBetween: JODA-586 Builtin-968
          ISOYearsBetween: JODA-654 Builtin-952
          ISOToUnix: JODA-678 Builtin-6965
          UnixToISO: JODA-225 Builtin-206
          Custom Format 1 [yyyy.MM.dd G 'at' HH:mm:ss.SSS Z]: JODA-596 Builtin-6914
          Custom Format 2 [yyyyy.MMMMM.dd GGG hh:mm aaa]: JODA-534 Builtin-425

          Two major conclusions are as follows:
          1. The datetime operations with the help of JODA generally performs as good as those with the builtin data structure (according to my implementation), except the operation of parsing a time string.
          2. It is found that based on my implementation, the builtin data structure needs one more order of magnitude of time to parse a time string when the format has a timezone component (i.e., "Z").

          To sum up, my suggestion is that since JODA provides no worse performance and more trustworthy correctness, I vote for going on with JODA when implementing the datetime primitive type.

          Show
          Zhijie Shen added a comment - As suggested by Thejas, I've done performance comparison between JODA and builtin datetime-related objects. For each function, I repeated 100,000 times of computation, and calculated the time respectively. Please refer to the attachment for the code details. Bellow is the summary of the results (unit is millisecond): ISOToSecond: JODA-958 Builtin-1326 ISOToMinute: JODA-532 Builtin-850 ISOToHour: JODA-414 Builtin-680 ISOToDay: JODA-475 Builtin-685 ISOToMonth: JODA-463 Builtin-692 ISOToYear: JODA-462 Builtin-715 ISOSecondsBetween: JODA-961 Builtin-968 ISOMinutesBetween: JODA-734 Builtin-565 ISOHoursBetween: JODA-596 Builtin-656 ISODaysBetween: JODA-592 Builtin-555 ISOMonthsBetween: JODA-586 Builtin-968 ISOYearsBetween: JODA-654 Builtin-952 ISOToUnix: JODA-678 Builtin-6965 UnixToISO: JODA-225 Builtin-206 Custom Format 1 [yyyy.MM.dd G 'at' HH:mm:ss.SSS Z] : JODA-596 Builtin-6914 Custom Format 2 [yyyyy.MMMMM.dd GGG hh:mm aaa] : JODA-534 Builtin-425 Two major conclusions are as follows: 1. The datetime operations with the help of JODA generally performs as good as those with the builtin data structure (according to my implementation), except the operation of parsing a time string. 2. It is found that based on my implementation, the builtin data structure needs one more order of magnitude of time to parse a time string when the format has a timezone component (i.e., "Z"). To sum up, my suggestion is that since JODA provides no worse performance and more trustworthy correctness, I vote for going on with JODA when implementing the datetime primitive type.
          Hide
          Russell Jurney added a comment -

          I concur about JODA. So far as I know you can't even parse ISO times with java builtins without using javax.xml.bind.DatatypeConverter, and it is ugly and slow.

          Show
          Russell Jurney added a comment - I concur about JODA. So far as I know you can't even parse ISO times with java builtins without using javax.xml.bind.DatatypeConverter, and it is ugly and slow.
          Hide
          Zhijie Shen added a comment -

          One quick issue: we need to give a name to the new type. We are supposed to use "DATETIME", correct? Or "DATE", "TIMESTAMP"?

          Show
          Zhijie Shen added a comment - One quick issue: we need to give a name to the new type. We are supposed to use "DATETIME", correct? Or "DATE", "TIMESTAMP"?
          Hide
          Thejas M Nair added a comment -

          One quick issue: we need to give a name to the new type. We are supposed to use "DATETIME", correct? Or "DATE", "TIMESTAMP"?

          "datetime" makes sense when it has both date and time (hrs,mins,secs) parts to it. The problem with using (unix) timestamp, is that the date range is limited to 78 years. Using jodatime, we will be able to support much larger date range than timestamp.

          Show
          Thejas M Nair added a comment - One quick issue: we need to give a name to the new type. We are supposed to use "DATETIME", correct? Or "DATE", "TIMESTAMP"? "datetime" makes sense when it has both date and time (hrs,mins,secs) parts to it. The problem with using (unix) timestamp, is that the date range is limited to 78 years. Using jodatime, we will be able to support much larger date range than timestamp.
          Hide
          Russell Jurney added a comment -

          "DATETIME" makes sense, but "TIMESTAMP" is a good (simple) alias for DATETIME(NOW). "DATE" is a good alias for a date-truncated DATETIME.

          I'm not sure if you would want to implement these in Pig... as there is clearly less utility than in a database, where for instance a TIMESTAMP can be updated whenever a field is written or updated. Maybe "DATE" and not "TIMESTAMP," but only as an afterthought?

          Show
          Russell Jurney added a comment - "DATETIME" makes sense, but "TIMESTAMP" is a good (simple) alias for DATETIME(NOW). "DATE" is a good alias for a date-truncated DATETIME. I'm not sure if you would want to implement these in Pig... as there is clearly less utility than in a database, where for instance a TIMESTAMP can be updated whenever a field is written or updated. Maybe "DATE" and not "TIMESTAMP," but only as an afterthought?
          Hide
          Thejas M Nair added a comment -

          CURRENT_TIME() might be a more intuitive alias for DATETIME(NOW). I think we can consider adding support for DATE and CURRENT_TIMESTAMP() as a next step after adding DATETIME. We can focus on DATETIME in this jira.

          I also had a look at timestamp datatype that was added to hive, to see if it will be interoperable (through hcatalog). The only difference is that hive timestamp type supports storing up to nano second precision, while jodatime supports only up to millisecond. Nanoseconds are not likely to be used in most cases, so loosing that precision when converting hive timestamp to pig datetime should be OK in most cases. The range of years supported in both cases is also approximately the same.

          Show
          Thejas M Nair added a comment - CURRENT_TIME() might be a more intuitive alias for DATETIME(NOW). I think we can consider adding support for DATE and CURRENT_TIMESTAMP() as a next step after adding DATETIME. We can focus on DATETIME in this jira. I also had a look at timestamp datatype that was added to hive, to see if it will be interoperable (through hcatalog). The only difference is that hive timestamp type supports storing up to nano second precision, while jodatime supports only up to millisecond. Nanoseconds are not likely to be used in most cases, so loosing that precision when converting hive timestamp to pig datetime should be OK in most cases. The range of years supported in both cases is also approximately the same.
          Hide
          Zhijie Shen added a comment -

          When adding the DateTime type for Pig, we need to take care of the I/O with AVRO, which still doesn't support the Date/Time type.

          Show
          Zhijie Shen added a comment - When adding the DateTime type for Pig, we need to take care of the I/O with AVRO, which still doesn't support the Date/Time type.
          Hide
          Zhijie Shen added a comment -

          One more issue needs to be clarified:

          In the AugmentBaseDataVisitor class, there're two functions: Object GetSmallerValue(Object v) and Object GetLargerValue(Object v). where if v is a numeric value, v is added or reduced by one while if v is a byte array, it is added or reduced by one byte. Then, how do we do if v is a datetime? I vote for returning null, and am looking forward to the community's opinions.

          By the way, how about if v is a boolean, which seems not to be handled?

          Show
          Zhijie Shen added a comment - One more issue needs to be clarified: In the AugmentBaseDataVisitor class, there're two functions: Object GetSmallerValue(Object v) and Object GetLargerValue(Object v). where if v is a numeric value, v is added or reduced by one while if v is a byte array, it is added or reduced by one byte. Then, how do we do if v is a datetime? I vote for returning null, and am looking forward to the community's opinions. By the way, how about if v is a boolean, which seems not to be handled?
          Hide
          Thejas M Nair added a comment -

          When adding the DateTime type for Pig, we need to take care of the I/O with AVRO, which still doesn't support the Date/Time type.

          StoreFuncs that write in avro format will need to throw an exception if the schema being stored contains a datetime type. That will force the users to serialize datetime as some other type. As long as we are not breaking existing pig queries don't use datetime type, we should be fine. Avro is just one of the many formats.

          Regarding AugmentBaseDataVisitor, that is used for example generation. (see sigmod paper on illustrate feature for details) . For example, if there is no value in col1 in sample that satisfies "col1 > 0", a value of col1 > 0 is generated. This will be useful for datetime type as well.
          To have a more realistic value generated (similar to values in input), I think we should increment/decrement the smallest field that is non zero. For example if the millisecond and second fields are 0, but hour field is non zero, increment that. If all time parts are 0, but day of month is not, increment that.
          In case of boolean, as we don't support > or < operations, these functions do not make sense.

          Thanks for bringing this up. I had forgot about this use case. We should add a few unit tests for example generation that involve datetime.

          Show
          Thejas M Nair added a comment - When adding the DateTime type for Pig, we need to take care of the I/O with AVRO, which still doesn't support the Date/Time type. StoreFuncs that write in avro format will need to throw an exception if the schema being stored contains a datetime type. That will force the users to serialize datetime as some other type. As long as we are not breaking existing pig queries don't use datetime type, we should be fine. Avro is just one of the many formats. Regarding AugmentBaseDataVisitor, that is used for example generation. (see sigmod paper on illustrate feature for details) . For example, if there is no value in col1 in sample that satisfies "col1 > 0", a value of col1 > 0 is generated. This will be useful for datetime type as well. To have a more realistic value generated (similar to values in input), I think we should increment/decrement the smallest field that is non zero. For example if the millisecond and second fields are 0, but hour field is non zero, increment that. If all time parts are 0, but day of month is not, increment that. In case of boolean, as we don't support > or < operations, these functions do not make sense. Thanks for bringing this up. I had forgot about this use case. We should add a few unit tests for example generation that involve datetime.
          Hide
          Zhijie Shen added a comment -

          I've modified the codes in the src package related to the primitive DateTime (see the attached file). As the code related to data type is widely spread in the project, I still need to go through it more times to figure the potential missing parts.

          Up till now, there's some more issues that need to be discussed:

          1. Pig can also import into and export from HBase storage, which also doesn't have the primitive DataTime. Throw exception in this case as well, correct?

          2.For the type casting between DateTime and other types of data, how about following the rules below:
          a. Allow: DateTime <-- Numeric value (being converted to Long first)
          b. Allow: DateTime <-- String
          c. Not allow: DateTime <-- Boolean
          d. Only explicit casting allowed

          3. DateTime is serialized as a Long value (Unix timestamp) when it is necessary.

          Show
          Zhijie Shen added a comment - I've modified the codes in the src package related to the primitive DateTime (see the attached file). As the code related to data type is widely spread in the project, I still need to go through it more times to figure the potential missing parts. Up till now, there's some more issues that need to be discussed: 1. Pig can also import into and export from HBase storage, which also doesn't have the primitive DataTime. Throw exception in this case as well, correct? 2.For the type casting between DateTime and other types of data, how about following the rules below: a. Allow: DateTime <-- Numeric value (being converted to Long first) b. Allow: DateTime <-- String c. Not allow: DateTime <-- Boolean d. Only explicit casting allowed 3. DateTime is serialized as a Long value (Unix timestamp) when it is necessary.
          Hide
          Russell Jurney added a comment -

          Avro might store DateTimes as an ISO string?

          Show
          Russell Jurney added a comment - Avro might store DateTimes as an ISO string?
          Hide
          Zhijie Shen added a comment -

          Avro might store DateTimes as an ISO string?

          It's possible, but there seems to be one problem. If we store a datetime as an iso string, how do we determine whether a string is just a string or a datetime when it is loaded?

          One more issue is that it' good to keep all the IO targets that does not support datetime handle the IO process uniformly. Hence if we conclude the design for Avro, we should keep to it for the others.

          Show
          Zhijie Shen added a comment - Avro might store DateTimes as an ISO string? It's possible, but there seems to be one problem. If we store a datetime as an iso string, how do we determine whether a string is just a string or a datetime when it is loaded? One more issue is that it' good to keep all the IO targets that does not support datetime handle the IO process uniformly. Hence if we conclude the design for Avro, we should keep to it for the others.
          Hide
          Zhijie Shen added a comment -

          I've updated the patch with the following changes:

          1. Editing some codes related to IO.
          2. Implemented most of the UDF in lised https://cwiki.apache.org/confluence/display/PIG/DateTime+type+specification, excluding DateAdd, whose functionality is not the clear to me.
          3. Correcting some error when merging the my modifications with the latest version in the repository.

          There's following issues to be discussed:
          1. the output datatype of DiffDate(DateTime d1, DateTime d2) should use long instead of int, because the diff may be too large for int range to conver.
          2. what does DateTime DateAdd(DateTime d1) mean? Adding datetime based on the current time?
          3. we allow explicit cast between datetime and string, correct? Similarly, do we allow explicit cast between datetime and long/int (representing unix timestamp)?

          Show
          Zhijie Shen added a comment - I've updated the patch with the following changes: 1. Editing some codes related to IO. 2. Implemented most of the UDF in lised https://cwiki.apache.org/confluence/display/PIG/DateTime+type+specification , excluding DateAdd, whose functionality is not the clear to me. 3. Correcting some error when merging the my modifications with the latest version in the repository. There's following issues to be discussed: 1. the output datatype of DiffDate(DateTime d1, DateTime d2) should use long instead of int, because the diff may be too large for int range to conver. 2. what does DateTime DateAdd(DateTime d1) mean? Adding datetime based on the current time? 3. we allow explicit cast between datetime and string, correct? Similarly, do we allow explicit cast between datetime and long/int (representing unix timestamp)?
          Hide
          Thejas M Nair added a comment -

          1. Pig can also import into and export from HBase storage, which also doesn't have the primitive DataTime. Throw exception in this case as well, correct?

          Yes. The exception should be thrown from HBaseStorage.

          if we conclude the design for Avro, we should keep to it for the others.

          Please note that pig does not have a way of know if the format will support datetime. The behavior will be controlled by the storage func implementation. But for the ones that are part of pig codebase, I think we should throw an exception.

          3. DateTime is serialized as a Long value (Unix timestamp) when it is necessary.

          JodaTime supports milliseconds as well. Will we be able to convert all values within limits of JodaTime date into a long ?

          the output datatype of DiffDate(DateTime d1, DateTime d2) should use long instead of int, because the diff may be too large for int range to conver.

          Makes sense, we should use a type that is appropriate for range.

          what does DateTime DateAdd(DateTime d1) mean? Adding datetime based on the current time?

          Not sure. Daniel, do you know ?

          we allow explicit cast between datetime and string, correct? Similarly, do we allow explicit cast between datetime and long/int (representing unix timestamp)?

          Yes, we should support explicit cast between these types. Though conversion to int might not be successful for all datetime values.

          Show
          Thejas M Nair added a comment - 1. Pig can also import into and export from HBase storage, which also doesn't have the primitive DataTime. Throw exception in this case as well, correct? Yes. The exception should be thrown from HBaseStorage. if we conclude the design for Avro, we should keep to it for the others. Please note that pig does not have a way of know if the format will support datetime. The behavior will be controlled by the storage func implementation. But for the ones that are part of pig codebase, I think we should throw an exception. 3. DateTime is serialized as a Long value (Unix timestamp) when it is necessary. JodaTime supports milliseconds as well. Will we be able to convert all values within limits of JodaTime date into a long ? the output datatype of DiffDate(DateTime d1, DateTime d2) should use long instead of int, because the diff may be too large for int range to conver. Makes sense, we should use a type that is appropriate for range. what does DateTime DateAdd(DateTime d1) mean? Adding datetime based on the current time? Not sure. Daniel, do you know ? we allow explicit cast between datetime and string, correct? Similarly, do we allow explicit cast between datetime and long/int (representing unix timestamp)? Yes, we should support explicit cast between these types. Though conversion to int might not be successful for all datetime values.
          Hide
          Thejas M Nair added a comment -

          what does DateTime DateAdd(DateTime d1) mean? Adding datetime based on the current time?

          Discussed this with Daniel. I think it makes sense to replace this with different functions -
          // add number of days specified in days param to the DateTime date.
          // The days param can be positive or negative
          AddYears(DateTime date, int days);

          Similarly we should have AddMonths, AddDays, AddHours ..

          Show
          Thejas M Nair added a comment - what does DateTime DateAdd(DateTime d1) mean? Adding datetime based on the current time? Discussed this with Daniel. I think it makes sense to replace this with different functions - // add number of days specified in days param to the DateTime date. // The days param can be positive or negative AddYears(DateTime date, int days); Similarly we should have AddMonths, AddDays, AddHours ..
          Hide
          Russell Jurney added a comment -

          A couple comments:

          1) Don't persist DateTimes as ints/longs unless you also persist a timezone offset with it somehow (is this possible?). Persisting timezones is one of the key benefits of a DateTime type in my opinion. At Hadoop scale you are often dealing with events from different sites/locations. DateTime needs timezone, or we can just use long/unix time.
          2) Consider using jodatime/ISO8601 durations for date math, as a separate type. i.e. If this extends scope too far, save it for later. http://en.wikipedia.org/wiki/ISO_8601#Durations

          Although it may be inefficient, I would encourage an ISO8601 string representation during serialization.

          Show
          Russell Jurney added a comment - A couple comments: 1) Don't persist DateTimes as ints/longs unless you also persist a timezone offset with it somehow (is this possible?). Persisting timezones is one of the key benefits of a DateTime type in my opinion. At Hadoop scale you are often dealing with events from different sites/locations. DateTime needs timezone, or we can just use long/unix time. 2) Consider using jodatime/ISO8601 durations for date math, as a separate type. i.e. If this extends scope too far, save it for later. http://en.wikipedia.org/wiki/ISO_8601#Durations Although it may be inefficient, I would encourage an ISO8601 string representation during serialization.
          Hide
          Thejas M Nair added a comment -

          1) Don't persist DateTimes as ints/longs unless you also persist a timezone offset with it somehow (is this possible?).

          I forgot about timezone. We need to serialize the timezone information as well, while supporting the same range of dates as JodaTime . With int/long this will not be possible. (Zhijie can you confirm ?)

          2) Consider using jodatime/ISO8601 durations for date math, as a separate type. i.e. If this extends scope too far, save it for later. http://en.wikipedia.org/wiki/ISO_8601#Durations

          +1 . This is much cleaner. Lets use replace the Add* functions with just AddDuration . For example AddDuration(d1, "P3Y"), would return d1 + 3 years.

          Show
          Thejas M Nair added a comment - 1) Don't persist DateTimes as ints/longs unless you also persist a timezone offset with it somehow (is this possible?). I forgot about timezone. We need to serialize the timezone information as well, while supporting the same range of dates as JodaTime . With int/long this will not be possible. (Zhijie can you confirm ?) 2) Consider using jodatime/ISO8601 durations for date math, as a separate type. i.e. If this extends scope too far, save it for later. http://en.wikipedia.org/wiki/ISO_8601#Durations +1 . This is much cleaner. Lets use replace the Add* functions with just AddDuration . For example AddDuration(d1, "P3Y"), would return d1 + 3 years.
          Hide
          Zhijie Shen added a comment -

          Dear Thejas and Russell,

          1) Don't persist DateTimes as ints/longs unless you also persist a timezone offset with it somehow (is this possible?).
          I forgot about timezone. We need to serialize the timezone information as well, while supporting the same range of dates as JodaTime . With int/long this will not be possible. (Zhijie can you confirm ?)

          As far as I know, either Java builtin Date or Joda DateTime uses millisecond-shift (stored in a long integer variable) from the midnight UTC, which is not exactly the Unix time. Importantly, the millisecond-shift has nothing to do with the time zone. For example, both

          new DateTime(9223372017043199999L, DateTimeZone.UTC).getMillis();

          and

          new DateTime(9223372017043199999L, DateTimeZone.forID("Asia/Singapore").getMillis();

          will return the same value, that is, 9223372017043199999L. The time zone only determines only determines the ISO time string, such that the two DateTime objects will output different ISO time strings when toString() is called. Hence I think the long variable which represents the millisecond-shift is good for internal serialization. When we need to convert the DateTime object to Unix time string, we may use the default time zone of the Pig environment (I'm still working on this. Please let me know how you think the Pig-wide time zone should be set.) or the user-defined time zone (We probably need one more UDF String ToString(DateTime d, String format, String timezone)).

          AS to Pig DateTime, internal Joda DateTime objects is either created with the long variable of millisecond-shift or with ISO time string. Initialization with a long variable (from Long.MIN_VALUE to Long.MAX_VALUE) has no range problem when getMillis() is called, obtaining the result ranged from Long.MIN_VALUE to Long.MAX_VALUE as well. Initialization with a ISO time string, the JODA DateTime object only accepts the year in the range [-292275054,292278993], such that the corresponding millisecond-shift is also within [Long.MIN_VALUE, Long.MAX_VALUE]. In summary, the range will be fine when Long is used for serialization.

          Please correct me if I'm wrong. Thanks a lot!

          2) Consider using jodatime/ISO8601 durations for date math, as a separate type. i.e. If this extends scope too far, save it for later. http://en.wikipedia.org/wiki/ISO_8601#Durations
          +1 . This is much cleaner. Lets use replace the Add* functions with just AddDuration . For example AddDuration(d1, "P3Y"), would return d1 + 3 years.

          +1. In this way, it is more flexible for users to define the amount of time to add/subtract. Since the ISO duration is non-negative (Please correct me if I'm wrong), we need to SubstractDuration as well.

          Show
          Zhijie Shen added a comment - Dear Thejas and Russell, 1) Don't persist DateTimes as ints/longs unless you also persist a timezone offset with it somehow (is this possible?). I forgot about timezone. We need to serialize the timezone information as well, while supporting the same range of dates as JodaTime . With int/long this will not be possible. (Zhijie can you confirm ?) As far as I know, either Java builtin Date or Joda DateTime uses millisecond-shift (stored in a long integer variable) from the midnight UTC, which is not exactly the Unix time. Importantly, the millisecond-shift has nothing to do with the time zone. For example, both new DateTime(9223372017043199999L, DateTimeZone.UTC).getMillis(); and new DateTime(9223372017043199999L, DateTimeZone.forID("Asia/Singapore").getMillis(); will return the same value, that is, 9223372017043199999L. The time zone only determines only determines the ISO time string, such that the two DateTime objects will output different ISO time strings when toString() is called. Hence I think the long variable which represents the millisecond-shift is good for internal serialization. When we need to convert the DateTime object to Unix time string, we may use the default time zone of the Pig environment (I'm still working on this. Please let me know how you think the Pig-wide time zone should be set.) or the user-defined time zone (We probably need one more UDF String ToString(DateTime d, String format, String timezone)). AS to Pig DateTime, internal Joda DateTime objects is either created with the long variable of millisecond-shift or with ISO time string. Initialization with a long variable (from Long.MIN_VALUE to Long.MAX_VALUE) has no range problem when getMillis() is called, obtaining the result ranged from Long.MIN_VALUE to Long.MAX_VALUE as well. Initialization with a ISO time string, the JODA DateTime object only accepts the year in the range [-292275054,292278993] , such that the corresponding millisecond-shift is also within [Long.MIN_VALUE, Long.MAX_VALUE] . In summary, the range will be fine when Long is used for serialization. Please correct me if I'm wrong. Thanks a lot! 2) Consider using jodatime/ISO8601 durations for date math, as a separate type. i.e. If this extends scope too far, save it for later. http://en.wikipedia.org/wiki/ISO_8601#Durations +1 . This is much cleaner. Lets use replace the Add* functions with just AddDuration . For example AddDuration(d1, "P3Y"), would return d1 + 3 years. +1. In this way, it is more flexible for users to define the amount of time to add/subtract. Since the ISO duration is non-negative (Please correct me if I'm wrong), we need to SubstractDuration as well.
          Hide
          Russell Jurney added a comment -

          Whatever the format is, I think we should serialize/persist DateTimes in a way that the timezone stays with the datetime.

          Show
          Russell Jurney added a comment - Whatever the format is, I think we should serialize/persist DateTimes in a way that the timezone stays with the datetime.
          Hide
          Thejas M Nair added a comment -

          As far as I know, either Java builtin Date or Joda DateTime uses millisecond-shift (stored in a long integer variable) from the midnight UTC, which is not exactly the Unix time.

          Yes, as you noted, the difference is unix timestamp can store upto +/- 292 Billion years, while Joda DateTime supports only +/- 292 Milllion years. Which should be sufficient for most practical purposes!

          The time zone determines only determines the ISO time string,

          It also affects the field values, (getDayOfWeek(), getHourOfDay() etc. In your data, you can have dates belonging to different timezones, and users might want to retain that information.
          An example of use case where timezone also needs to be stored - if you want to do analysis of how many people come to a global website during their morning hours, you want to .getHourOfDay() to return the hour as per local timezone.

          We need an efficient way to serialize timezone along with the long. Can you propose something ? (Maybe, just make it efficient for 256 most 'popular' timezones and store it a byte. And not have the byte for UTC. For other timezones, add a timezone string ?)

          When we need to convert the DateTime object to Unix time string, we may use the default time zone of the Pig environment

          If the date field has the timezone value in it, we don't have to rely on default time zone to convert to unix time stamp. (assuming that is what you meant by 'unix time string' )
          But udfs like DateTime ToDate(String s) where timezone might not be specified, we need a default timezone. I think we should use the default timezone on the pig client machine. Using the default time zone on each task tracker node can lead to a nightmare in debugging if one of the nodes happens to have a different timezone. We should allow the user to set a default timezone using a pig property.

          We probably need one more UDF String ToString(DateTime d, String format, String timezone)

          Having timezone argument in this call is necessary only if user wants to print the time for a different timezone. This is useful, but not mandatory.

          Since the ISO duration is non-negative (Please correct me if I'm wrong), we need to SubstractDuration as well.

          Yes, you are right. I could not find any references to negative values in ISO duration. Lets add SubstractDuration

          Trivia from wikipedia: 64 bit unix timestamp, in the negative direction, goes back more than twenty times the age of the universe

          Show
          Thejas M Nair added a comment - As far as I know, either Java builtin Date or Joda DateTime uses millisecond-shift (stored in a long integer variable) from the midnight UTC, which is not exactly the Unix time. Yes, as you noted, the difference is unix timestamp can store upto +/- 292 Billion years, while Joda DateTime supports only +/- 292 Milllion years. Which should be sufficient for most practical purposes! The time zone determines only determines the ISO time string, It also affects the field values, (getDayOfWeek(), getHourOfDay() etc. In your data, you can have dates belonging to different timezones, and users might want to retain that information. An example of use case where timezone also needs to be stored - if you want to do analysis of how many people come to a global website during their morning hours, you want to .getHourOfDay() to return the hour as per local timezone. We need an efficient way to serialize timezone along with the long. Can you propose something ? (Maybe, just make it efficient for 256 most 'popular' timezones and store it a byte. And not have the byte for UTC. For other timezones, add a timezone string ?) When we need to convert the DateTime object to Unix time string, we may use the default time zone of the Pig environment If the date field has the timezone value in it, we don't have to rely on default time zone to convert to unix time stamp. (assuming that is what you meant by 'unix time string ' ) But udfs like DateTime ToDate(String s) where timezone might not be specified, we need a default timezone. I think we should use the default timezone on the pig client machine. Using the default time zone on each task tracker node can lead to a nightmare in debugging if one of the nodes happens to have a different timezone. We should allow the user to set a default timezone using a pig property. We probably need one more UDF String ToString(DateTime d, String format, String timezone) Having timezone argument in this call is necessary only if user wants to print the time for a different timezone. This is useful, but not mandatory. Since the ISO duration is non-negative (Please correct me if I'm wrong), we need to SubstractDuration as well. Yes, you are right. I could not find any references to negative values in ISO duration. Lets add SubstractDuration Trivia from wikipedia: 64 bit unix timestamp, in the negative direction, goes back more than twenty times the age of the universe
          Hide
          Russell Jurney added a comment -

          Jodatime seems to solve these problems. Serializing from a string without a timezone, it does things in a reasonable manner. Serializing things from a string with a timezone, it does things in a reasonable manner.

          Are we discussing a user-facing API, or an internal storage mechanism? I'm not clear on which. Regarding the interface, presenting integers to a user as an interface seems wrong to me. Excluding certain timezones in the name of efficiency also seems wrong to me. The point of a datetime type is to add timezones, otherwise we can simply use longs.

          As an internal storage mechanism, I'm un-opinionated, so long as all timezones are retained at all times.

          Show
          Russell Jurney added a comment - Jodatime seems to solve these problems. Serializing from a string without a timezone, it does things in a reasonable manner. Serializing things from a string with a timezone, it does things in a reasonable manner. Are we discussing a user-facing API, or an internal storage mechanism? I'm not clear on which. Regarding the interface, presenting integers to a user as an interface seems wrong to me. Excluding certain timezones in the name of efficiency also seems wrong to me. The point of a datetime type is to add timezones, otherwise we can simply use longs. As an internal storage mechanism, I'm un-opinionated, so long as all timezones are retained at all times.
          Hide
          Thejas M Nair added a comment -

          Are we discussing a user-facing API, or an internal storage mechanism?

          Some questions were about interface, some about internal storage.

          Regarding the interface, presenting integers to a user as an interface seems wrong to me.

          Converting dates to integer is something user can optionally do, this is not expected to be a common use case. String representation of date literals will also be supported. Most operations will be on date type itself, without converting it to int/string.

          Excluding certain timezones in the name of efficiency also seems wrong to me.

          All timezones supported by JodaTime will be supported. I was only proposing that we encode the timezone info efficiently, at least for most likely used ones. I think converting the string timezone (location name) to UTC offset in minutes, is one possibility.

          Show
          Thejas M Nair added a comment - Are we discussing a user-facing API, or an internal storage mechanism? Some questions were about interface, some about internal storage. Regarding the interface, presenting integers to a user as an interface seems wrong to me. Converting dates to integer is something user can optionally do, this is not expected to be a common use case. String representation of date literals will also be supported. Most operations will be on date type itself, without converting it to int/string. Excluding certain timezones in the name of efficiency also seems wrong to me. All timezones supported by JodaTime will be supported. I was only proposing that we encode the timezone info efficiently, at least for most likely used ones. I think converting the string timezone (location name) to UTC offset in minutes, is one possibility.
          Hide
          Zhijie Shen added a comment -

          Hi Thejas and Russell,

          I'll do serialization for timezone as well.

          I think converting the string timezone (location name) to UTC offset in minutes, is one possibility.

          In my opinion, this kind of compression is lossy. Several time zones may share the same UTC offset, such that when the reverse operation is to do, it will be unknown which timezone the UTC offset should be converted to.

          We need an efficient way to serialize timezone along with the long. Can you propose something ? (Maybe, just make it efficient for 256 most 'popular' timezones and store it a byte. And not have the byte for UTC. For other timezones, add a timezone string ?)

          The time zone class in either builtin and joda has the function "getAvailableIDs", which returns all the available time zone strings. On my machine, I got 616 from the builtin time zone while 558 from the joda one. Probably we can have a one-to-one mapping between the time zone strings and the integer ids in short variables. However the "available" in the function "getAvailableIDs" sounds tricky. I'm not sure whether "getAvailableIDs" returns the same time zone list on all machines or is machine-dependent.

          Show
          Zhijie Shen added a comment - Hi Thejas and Russell, I'll do serialization for timezone as well. I think converting the string timezone (location name) to UTC offset in minutes, is one possibility. In my opinion, this kind of compression is lossy. Several time zones may share the same UTC offset, such that when the reverse operation is to do, it will be unknown which timezone the UTC offset should be converted to. We need an efficient way to serialize timezone along with the long. Can you propose something ? (Maybe, just make it efficient for 256 most 'popular' timezones and store it a byte. And not have the byte for UTC. For other timezones, add a timezone string ?) The time zone class in either builtin and joda has the function "getAvailableIDs", which returns all the available time zone strings. On my machine, I got 616 from the builtin time zone while 558 from the joda one. Probably we can have a one-to-one mapping between the time zone strings and the integer ids in short variables. However the "available" in the function "getAvailableIDs" sounds tricky. I'm not sure whether "getAvailableIDs" returns the same time zone list on all machines or is machine-dependent.
          Hide
          Thejas M Nair added a comment -

          Several time zones may share the same UTC offset, such that when the reverse operation is to do, it will be unknown which timezone the UTC offset should be converted to.

          Yes, it will be lossy, but the part that is important for date calculations is preserved. The ISO spec only has offset for timezone. I don't think we have to allow datetime field to be used for storing location information. Does JodaTime preserve the location string ?

          I'm not sure whether "getAvailableIDs" returns the same time zone list on all machines or is machine-dependent.

          It depends on the release/jar (http://joda-time.sourceforge.net/tz_update.html). As pig will be shipping this jar to the nodes, it is ok to assume that it will be the same across all nodes for a query. So it is safe to rely on the id for intermediate serialization.
          But won't jodatime support a timezone outside this list, If the user specifies a date using the UTC offset format ?

          Show
          Thejas M Nair added a comment - Several time zones may share the same UTC offset, such that when the reverse operation is to do, it will be unknown which timezone the UTC offset should be converted to. Yes, it will be lossy, but the part that is important for date calculations is preserved. The ISO spec only has offset for timezone. I don't think we have to allow datetime field to be used for storing location information. Does JodaTime preserve the location string ? I'm not sure whether "getAvailableIDs" returns the same time zone list on all machines or is machine-dependent. It depends on the release/jar ( http://joda-time.sourceforge.net/tz_update.html ). As pig will be shipping this jar to the nodes, it is ok to assume that it will be the same across all nodes for a query. So it is safe to rely on the id for intermediate serialization. But won't jodatime support a timezone outside this list, If the user specifies a date using the UTC offset format ?
          Hide
          Zhijie Shen added a comment -

          Yes, it will be lossy, but the part that is important for date calculations is preserved. The ISO spec only has offset for timezone. I don't think we have to allow datetime field to be used for storing location information. Does JodaTime preserve the location string ?

          Yes, I think so. If I get an DateTimeZone object by DateTimeZone.forID("asia/singapore"), the returned DateTimeZone object doesn't change to "+08:00", but keeps "asia/singapore". We'd better preserve it because when users want to output the time in their customized format that has "z" in the pattern string, the exact timezone can be outputed.

          But won't jodatime support a timezone outside this list, If the user specifies a date using the UTC offset format ?

          Yes, DateTimeZone.forID() also allows UTC offset string as input, such as "+08:00", though it is not in the list. However, the offset can be value in the range [-23:59:59.999, +23:59:59.999], and the minimal granularity can be the millisecond

          Then, we are expected to have a combined lookup table that maps canonical timezone ids and UTC offset to their concise representation. Do you have any suggestion here? Or we temporally set aside the performance issue right now, and move forward to make timezone serialization work by simply serializing the timezone id string.

          Show
          Zhijie Shen added a comment - Yes, it will be lossy, but the part that is important for date calculations is preserved. The ISO spec only has offset for timezone. I don't think we have to allow datetime field to be used for storing location information. Does JodaTime preserve the location string ? Yes, I think so. If I get an DateTimeZone object by DateTimeZone.forID("asia/singapore"), the returned DateTimeZone object doesn't change to "+08:00", but keeps "asia/singapore". We'd better preserve it because when users want to output the time in their customized format that has "z" in the pattern string, the exact timezone can be outputed. But won't jodatime support a timezone outside this list, If the user specifies a date using the UTC offset format ? Yes, DateTimeZone.forID() also allows UTC offset string as input, such as "+08:00", though it is not in the list. However, the offset can be value in the range [-23:59:59.999, +23:59:59.999] , and the minimal granularity can be the millisecond Then, we are expected to have a combined lookup table that maps canonical timezone ids and UTC offset to their concise representation. Do you have any suggestion here? Or we temporally set aside the performance issue right now, and move forward to make timezone serialization work by simply serializing the timezone id string.
          Hide
          Thejas M Nair added a comment -

          Or we temporally set aside the performance issue right now, and move forward to make timezone serialization work by simply serializing the timezone id string.

          We can add features later, but dropping features later won't be good. In my opinion, the support for long timezone name is not going to be needed by most people. I think we can support it only for creating a DateTime field, but say that pig will not preserve the long name. Pig will only retain hours+minute offset (no seconds and milliseconds!). The hour+min offset form is portable and more likely to be supported by other serialization formats.

          Show
          Thejas M Nair added a comment - Or we temporally set aside the performance issue right now, and move forward to make timezone serialization work by simply serializing the timezone id string. We can add features later, but dropping features later won't be good. In my opinion, the support for long timezone name is not going to be needed by most people. I think we can support it only for creating a DateTime field, but say that pig will not preserve the long name. Pig will only retain hours+minute offset (no seconds and milliseconds!). The hour+min offset form is portable and more likely to be supported by other serialization formats.
          Hide
          Russell Jurney added a comment -

          This sounds good to me.

          Show
          Russell Jurney added a comment - This sounds good to me.
          Hide
          Zhijie Shen added a comment -

          Hi Thejas, I'll take your suggestions. Thanks!

          Show
          Zhijie Shen added a comment - Hi Thejas, I'll take your suggestions. Thanks!
          Hide
          Zhijie Shen added a comment -

          There's some issues with loading/storing pig data. When store a DateTime object with "Utf8StorageConverter" without using UDFs to convert it to some string, should we serialize it as a millis+timezone composite, or output an UTC-style datetime string (e.g., 2012-07-03T08:14:19.962+01:00))? The latter operation behaves the same as uses "String ToString(DateTime d)" before storing the string? Personally, I like the latter choice, because the data is directly readable from the stored files.

          On the other hand, if a datetime object is stored in the file as a datetime string, when we load it again as a datetime object, should we use the default timezone or use the one specified in the timezone string (e.g., +01:00 in the last example)? I again prefer the second choice. When we use Pig, it is possible to do a bunch of store/load to achieve some goal. The timezone information need to be preserved. For example, let's assume +08:00 is the default timezone. A datatime object whose individual timezone is -04:00 is stored as a string, which will have -04:00 as suffix. When the string is loaded as a datetime object for further process, we'd better keep to the previously used timezone, -04:00, instead of the default one.

          How do you think about this? Thanks!

          Show
          Zhijie Shen added a comment - There's some issues with loading/storing pig data. When store a DateTime object with "Utf8StorageConverter" without using UDFs to convert it to some string, should we serialize it as a millis+timezone composite, or output an UTC-style datetime string (e.g., 2012-07-03T08:14:19.962+01:00))? The latter operation behaves the same as uses "String ToString(DateTime d)" before storing the string? Personally, I like the latter choice, because the data is directly readable from the stored files. On the other hand, if a datetime object is stored in the file as a datetime string, when we load it again as a datetime object, should we use the default timezone or use the one specified in the timezone string (e.g., +01:00 in the last example)? I again prefer the second choice. When we use Pig, it is possible to do a bunch of store/load to achieve some goal. The timezone information need to be preserved. For example, let's assume +08:00 is the default timezone. A datatime object whose individual timezone is -04:00 is stored as a string, which will have -04:00 as suffix. When the string is loaded as a datetime object for further process, we'd better keep to the previously used timezone, -04:00, instead of the default one. How do you think about this? Thanks!
          Hide
          Thejas M Nair added a comment -

          PigStorage is meant to be a human readable format. So that is another reason to store the timestamp in the ISO string as you suggested.
          Yes, If the timezone is specified in the string, pig should use that value. But the timezone part and time part of the datetime string should be optional. Does jodatime support that ?

          Show
          Thejas M Nair added a comment - PigStorage is meant to be a human readable format. So that is another reason to store the timestamp in the ISO string as you suggested. Yes, If the timezone is specified in the string, pig should use that value. But the timezone part and time part of the datetime string should be optional. Does jodatime support that ?
          Hide
          Zhijie Shen added a comment -

          But the timezone part and time part of the datetime string should be optional. Does jodatime support that?

          Yes, these two parts are not mandatory. The default time value is "00:00:00.000" while the default timezone offset is "+00:00". When the datetime object is outputed an ISO-format string, the default parts will be filled up (e.g., 2012-07-03T00:00:00.000Z).

          Show
          Zhijie Shen added a comment - But the timezone part and time part of the datetime string should be optional. Does jodatime support that? Yes, these two parts are not mandatory. The default time value is "00:00:00.000" while the default timezone offset is "+00:00". When the datetime object is outputed an ISO-format string, the default parts will be filled up (e.g., 2012-07-03T00:00:00.000Z).
          Hide
          Zhijie Shen added a comment -

          Here's the newest patch of this issue, which contains the following changes since the last one:

          1. Including the timezone when serializing datetime objects.

          2. Implementing the additional UDFs that we have discussed.

          3. Updating my previous modifications to solve some conflicts with the PIG-2632 patch.

          4. Adding "timezone" configuration for Pig.

          Util now, the patch can basically make the primitive datetime type work.

          However, I've not do the thorough test yet. Therefore, my next step (in the second half of GSOC) will focus on coding the test cases, fixing bugs, etc.

          Show
          Zhijie Shen added a comment - Here's the newest patch of this issue, which contains the following changes since the last one: 1. Including the timezone when serializing datetime objects. 2. Implementing the additional UDFs that we have discussed. 3. Updating my previous modifications to solve some conflicts with the PIG-2632 patch. 4. Adding "timezone" configuration for Pig. Util now, the patch can basically make the primitive datetime type work. However, I've not do the thorough test yet. Therefore, my next step (in the second half of GSOC) will focus on coding the test cases, fixing bugs, etc.
          Hide
          Thejas M Nair added a comment -

          Zhijie,
          I have added comments on your latest patch in https://reviews.apache.org/r/5414/.
          Yes, lets focus on test cases now, so that we can get an initial version committed.

          Show
          Thejas M Nair added a comment - Zhijie, I have added comments on your latest patch in https://reviews.apache.org/r/5414/ . Yes, lets focus on test cases now, so that we can get an initial version committed.
          Hide
          Zhijie Shen added a comment -

          Hi Thejas,

          Thanks for your review. I'll check out your comments.

          Show
          Zhijie Shen added a comment - Hi Thejas, Thanks for your review. I'll check out your comments.
          Hide
          Zhijie Shen added a comment -

          Hi Thejas,

          Here's my latest patch. Compared to the last one, there are the following modifications:

          1. I've modified the code according to many of review comments.

          2. I've added many test cases, but some are still missing. I'll add more in the following days.

          3. I've fixed some bugs while running the newly added test cases.

          There's still some issues related to timezone I need to discuss with you:

          1. You've mentioned that we need to propagate the timezone from the client to backend, where the udfs get executed. How the timezone should be propagated to the backend, which I assume the machine that runs the code? Previously I made the timezone setting in pig.properties, which will be loaded when PigServer runs, such that the default timezone will be set. Consequently, if a datetime object is created without specifying the timezone, the default one will be used. However, do you mean some other way?

          2. According to our previous discussion, ToDate() can take different type of timezone input, either location or UTC offset. However, two timezones of the two different types may be treated not equal even when the offset is same. For example, new DateTime(0L, DateTimeZone.forID("asia/singapore")) and new DateTime(0L, DateTimeZone.forID("+08:00")) are not equal. As we previously chose the UTC offset to be the basic timezone representation, I convert the location-based timezone to the utc-offset one and only use utc-offset style internally. Therefore, the aforementioned two equal datetime objects will not be mis-treated.

          Regards,
          Zhijie

          Show
          Zhijie Shen added a comment - Hi Thejas, Here's my latest patch. Compared to the last one, there are the following modifications: 1. I've modified the code according to many of review comments. 2. I've added many test cases, but some are still missing. I'll add more in the following days. 3. I've fixed some bugs while running the newly added test cases. There's still some issues related to timezone I need to discuss with you: 1. You've mentioned that we need to propagate the timezone from the client to backend, where the udfs get executed. How the timezone should be propagated to the backend, which I assume the machine that runs the code? Previously I made the timezone setting in pig.properties, which will be loaded when PigServer runs, such that the default timezone will be set. Consequently, if a datetime object is created without specifying the timezone, the default one will be used. However, do you mean some other way? 2. According to our previous discussion, ToDate() can take different type of timezone input, either location or UTC offset. However, two timezones of the two different types may be treated not equal even when the offset is same. For example, new DateTime(0L, DateTimeZone.forID("asia/singapore")) and new DateTime(0L, DateTimeZone.forID("+08:00")) are not equal. As we previously chose the UTC offset to be the basic timezone representation, I convert the location-based timezone to the utc-offset one and only use utc-offset style internally. Therefore, the aforementioned two equal datetime objects will not be mis-treated. Regards, Zhijie
          Hide
          Thejas M Nair added a comment -

          1. You've mentioned that we need to propagate the timezone from the client to backend, where the udfs get executed. How the timezone should be propagated to the backend, which I assume the machine that runs the code?

          Yes

          Previously I made the timezone setting in pig.properties, which will be loaded when PigServer runs, such that the default timezone will be set. Consequently, if a datetime object is created without specifying the timezone, the default one will be used. However, do you mean some other way?

          It is possible that some of the task nodes might be misconfigured and have different default time zone. In such cases, the results won't be what you want and it will be very difficult to debug. So the default timezone on the client should be used in the nodes as well.

          I convert the location-based timezone to the utc-offset one and only use utc-offset style internally. Therefore, the aforementioned two equal datetime objects will not be mis-treated.

          Sounds good.

          Show
          Thejas M Nair added a comment - 1. You've mentioned that we need to propagate the timezone from the client to backend, where the udfs get executed. How the timezone should be propagated to the backend, which I assume the machine that runs the code? Yes Previously I made the timezone setting in pig.properties, which will be loaded when PigServer runs, such that the default timezone will be set. Consequently, if a datetime object is created without specifying the timezone, the default one will be used. However, do you mean some other way? It is possible that some of the task nodes might be misconfigured and have different default time zone. In such cases, the results won't be what you want and it will be very difficult to debug. So the default timezone on the client should be used in the nodes as well. I convert the location-based timezone to the utc-offset one and only use utc-offset style internally. Therefore, the aforementioned two equal datetime objects will not be mis-treated. Sounds good.
          Hide
          Zhijie Shen added a comment -

          Hi Thejas,

          I attached my newest patch (the same as that in my previous email sent to you). Compared the last version, there's following improvement:

          1. More test cases have been added, such that the test cases are nearly completed.

          2. Fix some bugs according to the test cases, including the builtin functions (e.g., argToFuncMapping).

          3. I add some more builtin functions: MilliSecondsBetween, GetMilliSecond, ToMillSeconds, since the granularity of pig DateTime is set to millsecond.

          I've also some comments:

          1. DiffDate behaves similar to DaysBetween, except that the former function return opposite values if two arguments change their order.

          2. According to your last response, I'm not clear how the default timezone of client can be sent to the server with the code. In my opinion, the default timezone should be specified on the server side by configuration, which should be taken care of by administrators. How do you think about this.

          I think this patch is close to commit. Please check it out. Thanks!

          Show
          Zhijie Shen added a comment - Hi Thejas, I attached my newest patch (the same as that in my previous email sent to you). Compared the last version, there's following improvement: 1. More test cases have been added, such that the test cases are nearly completed. 2. Fix some bugs according to the test cases, including the builtin functions (e.g., argToFuncMapping). 3. I add some more builtin functions: MilliSecondsBetween, GetMilliSecond, ToMillSeconds, since the granularity of pig DateTime is set to millsecond. I've also some comments: 1. DiffDate behaves similar to DaysBetween, except that the former function return opposite values if two arguments change their order. 2. According to your last response, I'm not clear how the default timezone of client can be sent to the server with the code. In my opinion, the default timezone should be specified on the server side by configuration, which should be taken care of by administrators. How do you think about this. I think this patch is close to commit. Please check it out. Thanks!
          Hide
          Thejas M Nair added a comment -

          2. According to your last response, I'm not clear how the default timezone of client can be sent to the server with the code. In my opinion, the default timezone should be specified on the server side by configuration, which should be taken care of by administrators. How do you think about this.

          I believe you should be able to set the default timezone property in PigContext constructor, and also let user override the default. In backend, you can access the value using something like - PigMapReduce.sJobConfInternal.get().get("pig.datetime.default.tz").

          Show
          Thejas M Nair added a comment - 2. According to your last response, I'm not clear how the default timezone of client can be sent to the server with the code. In my opinion, the default timezone should be specified on the server side by configuration, which should be taken care of by administrators. How do you think about this. I believe you should be able to set the default timezone property in PigContext constructor, and also let user override the default. In backend, you can access the value using something like - PigMapReduce.sJobConfInternal.get().get("pig.datetime.default.tz").
          Hide
          Russell Jurney added a comment -

          I agree with Thejas. The user will want to control the timezone of NOW() without having to reconfigure the hadoop cluster/contact the hadoop administrator. Setting this on the client is consistent with Pig as a client-side technology.

          Show
          Russell Jurney added a comment - I agree with Thejas. The user will want to control the timezone of NOW() without having to reconfigure the hadoop cluster/contact the hadoop administrator. Setting this on the client is consistent with Pig as a client-side technology.
          Hide
          Zhijie Shen added a comment -

          I believe you should be able to set the default timezone property in PigContext constructor, and also let user override the default. In backend, you can access the value using something like - PigMapReduce.sJobConfInternal.get().get("pig.datetime.default.tz").

          Thank you, Thejas! Let me investigate this issue.

          Show
          Zhijie Shen added a comment - I believe you should be able to set the default timezone property in PigContext constructor, and also let user override the default. In backend, you can access the value using something like - PigMapReduce.sJobConfInternal.get().get("pig.datetime.default.tz"). Thank you, Thejas! Let me investigate this issue.
          Hide
          Zhijie Shen added a comment -

          Hi Thejas,

          I attached my latest patch. In this version, I fixed the default timezone issue. Pig can obtain the timezone string from PigContext, which can be loaded from either the default property files or some user supplied sources. Instead of calling PigMapReduce.sJobConfInternal.get().get("pig.datetime.default.tz") every time when no user-supplied timezone is specified for DateTime construction, I configure the default timezone of joda when PigGenericMapBase and PigGenericMapReduce are at the setup() stage. Therefore, when no timezone is specified for DateTime construction, the created DateTime object will automatically use the default timezone. I think by doing this, users do not need to touch too much detail (calling PigMapReduce.sJobConfInternal) when writing their UDFs that are related to DateTime, and avoid the ambiguity that PigMapReduce.sJobConfInternal.get().get("pig.datetime.default.tz") and DateTimeZone.getDefault().getID() may sometimes be different.

          Show
          Zhijie Shen added a comment - Hi Thejas, I attached my latest patch. In this version, I fixed the default timezone issue. Pig can obtain the timezone string from PigContext, which can be loaded from either the default property files or some user supplied sources. Instead of calling PigMapReduce.sJobConfInternal.get().get("pig.datetime.default.tz") every time when no user-supplied timezone is specified for DateTime construction, I configure the default timezone of joda when PigGenericMapBase and PigGenericMapReduce are at the setup() stage. Therefore, when no timezone is specified for DateTime construction, the created DateTime object will automatically use the default timezone. I think by doing this, users do not need to touch too much detail (calling PigMapReduce.sJobConfInternal) when writing their UDFs that are related to DateTime, and avoid the ambiguity that PigMapReduce.sJobConfInternal.get().get("pig.datetime.default.tz") and DateTimeZone.getDefault().getID() may sometimes be different.
          Hide
          Russell Jurney added a comment -

          I have one suggestion - add getWeeks and weeksBetween, if it isn't inconvenient. I think Jodatime can do this. It is useful when dealing in weeks.

          Show
          Russell Jurney added a comment - I have one suggestion - add getWeeks and weeksBetween, if it isn't inconvenient. I think Jodatime can do this. It is useful when dealing in weeks.
          Hide
          Zhijie Shen added a comment -

          I have one suggestion - add getWeeks and weeksBetween, if it isn't inconvenient. I think Jodatime can do this. It is useful when dealing in weeks.

          Yes, week field should be useful. In addition to it, I think it's better to add getWeekYear as well, because using weeks of year alone may cause ambiguity sometimes. For example, both "2008-12-31" and "2009-01-01" are week 1 of weekyear 2009, though the two dates are in two different years.

          In addition, do you think it is better to rename some time UDFs as follows?

          getMonth -> getMonthOfYear
          getDay -> getDayOfMonth (do we need getDayOfWeek and getDayOfYear as well?)
          getHour -> getHourOfDay
          getMinute -> getMinuteOfHour
          getSecond -> getSecondOfMinute
          getMilliSecond -> getMilliOfSecond

          The changes will make UDFs' names longer but clearer.

          Show
          Zhijie Shen added a comment - I have one suggestion - add getWeeks and weeksBetween, if it isn't inconvenient. I think Jodatime can do this. It is useful when dealing in weeks. Yes, week field should be useful. In addition to it, I think it's better to add getWeekYear as well, because using weeks of year alone may cause ambiguity sometimes. For example, both "2008-12-31" and "2009-01-01" are week 1 of weekyear 2009, though the two dates are in two different years. In addition, do you think it is better to rename some time UDFs as follows? getMonth -> getMonthOfYear getDay -> getDayOfMonth (do we need getDayOfWeek and getDayOfYear as well?) getHour -> getHourOfDay getMinute -> getMinuteOfHour getSecond -> getSecondOfMinute getMilliSecond -> getMilliOfSecond The changes will make UDFs' names longer but clearer.
          Hide
          Zhijie Shen added a comment -

          Hi Thejas,

          I've updated the patch again. I'll be sorry if it disturbs your review of the code. In the latest version, I added three more datetime related UDFs and related test cases according to Russell's suggestion:

          1. WeeksBetween
          2. GetWeek
          3. GetWeekYear

          In addition, I modify the code of XXXXBetween UDFs. Previously, all the UDFs in this category leverages joda to compute the interval. While joda can only return the integer interval, the actual interval is likely to be so big that it has to be stored in a long variable. Therefore, for the datetime fields of fixed length:

          1. MilliSecondsBetween
          2. SecondsBetween
          3. MinutesBetween
          4. HoursBetween
          5. DaysBetween
          6. WeeksBetween

          I adopted my our computation methods. On the other side, fot the datetime fields of flexible length:

          1. MonthsBetween
          2. YearsBetween

          I keep to the joda methods. We may improve this later.

          At last, I remove the DiffDate UDF, because it is the same as DaysBetween.

          Show
          Zhijie Shen added a comment - Hi Thejas, I've updated the patch again. I'll be sorry if it disturbs your review of the code. In the latest version, I added three more datetime related UDFs and related test cases according to Russell's suggestion: 1. WeeksBetween 2. GetWeek 3. GetWeekYear In addition, I modify the code of XXXXBetween UDFs. Previously, all the UDFs in this category leverages joda to compute the interval. While joda can only return the integer interval, the actual interval is likely to be so big that it has to be stored in a long variable. Therefore, for the datetime fields of fixed length: 1. MilliSecondsBetween 2. SecondsBetween 3. MinutesBetween 4. HoursBetween 5. DaysBetween 6. WeeksBetween I adopted my our computation methods. On the other side, fot the datetime fields of flexible length: 1. MonthsBetween 2. YearsBetween I keep to the joda methods. We may improve this later. At last, I remove the DiffDate UDF, because it is the same as DaysBetween.
          Hide
          Thejas M Nair added a comment -

          PIG-1314-7.patch committed to trunk! Thanks Zhijie.

          We need to update the documentation regarding this change. Can you please upload a new patch for that ? To see generated docs, run - ant -Dforrest.home=<Forrest installation dir> docs. The files to be edited are under - trunk/src/docs/src/documentation/ .

          We should also add a few end to end test cases for datetime. See https://cwiki.apache.org/confluence/display/PIG/HowToTest#HowToTest-EndtoendTesting . We should have a few queries that do some of the basic operations on date time, and queries that have order-by , group and join on date fields.
          These can be submitted as multiple patches.

          Show
          Thejas M Nair added a comment - PIG-1314 -7.patch committed to trunk! Thanks Zhijie. We need to update the documentation regarding this change. Can you please upload a new patch for that ? To see generated docs, run - ant -Dforrest.home=<Forrest installation dir> docs. The files to be edited are under - trunk/src/docs/src/documentation/ . We should also add a few end to end test cases for datetime. See https://cwiki.apache.org/confluence/display/PIG/HowToTest#HowToTest-EndtoendTesting . We should have a few queries that do some of the basic operations on date time, and queries that have order-by , group and join on date fields. These can be submitted as multiple patches.
          Hide
          Thejas M Nair added a comment -

          We also need to have some test cases that set the timezone property. This might not be easy to do in the e2e framework, so unit test cases are better candidate for this. Please let me know if you need any help.

          Show
          Thejas M Nair added a comment - We also need to have some test cases that set the timezone property. This might not be easy to do in the e2e framework, so unit test cases are better candidate for this. Please let me know if you need any help.
          Hide
          Zhijie Shen added a comment -

          Hi Thejas, let me do that.

          Show
          Zhijie Shen added a comment - Hi Thejas, let me do that.
          Hide
          Julien Le Dem added a comment -

          Hi Thejas,
          this commit added JobControlCompiler.java.orig which I suspect is not what you intended.
          http://svn.apache.org/viewvc/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java.orig?view=log&pathrev=1376800
          Could you double check?
          Thanks, Julien

          Show
          Julien Le Dem added a comment - Hi Thejas, this commit added JobControlCompiler.java.orig which I suspect is not what you intended. http://svn.apache.org/viewvc/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java.orig?view=log&pathrev=1376800 Could you double check? Thanks, Julien
          Hide
          Thejas M Nair added a comment -

          Yes, that was not intentional. Deleted JobControlCompiler.java.orig in svn.

          Show
          Thejas M Nair added a comment - Yes, that was not intentional. Deleted JobControlCompiler.java.orig in svn.
          Hide
          Dmitriy V. Ryaboy added a comment -

          A chunk of this is committed, and it's not clear what's left to do. Can we close this and create a new ticket for the remaining work?

          Show
          Dmitriy V. Ryaboy added a comment - A chunk of this is committed, and it's not clear what's left to do. Can we close this and create a new ticket for the remaining work?
          Hide
          Thejas M Nair added a comment -

          As Dmitriy suggested, closing this jira and opened new ones for remaining work - PIG-2980, PIG-2981, PIG-2982 .

          Show
          Thejas M Nair added a comment - As Dmitriy suggested, closing this jira and opened new ones for remaining work - PIG-2980 , PIG-2981 , PIG-2982 .

            People

            • Assignee:
              Zhijie Shen
              Reporter:
              Russell Jurney
            • Votes:
              6 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - 672h
                672h
                Remaining:
                Remaining Estimate - 672h
                672h
                Logged:
                Time Spent - Not Specified
                Not Specified

                  Development