Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-4765

Missing, incorrect information in Drill data types page

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.6.0
    • None
    • Documentation
    • None

    Description

      Consider the Drill Supported Types page: https://drill.apache.org/docs/supported-data-types/

      A number of issues can be seen.

      For BIGINT, it would be clearer to express the range as: -2^63 to 2^63-1.

      For INTEGER, it would be clearer to express the range as: -2^31 to 2^31-1.

      DATE: The statement "in YYYY-MM-DD format" is wrong. The internal representation has no format, it is just a number representing the day count. The format is applied only on output and varies depending on the tool used. Perhaps for the Drill web UI it is in ISO format.

      DATE: Presumably the date is not time-zone specific. That is, 2016-07-01 is the first of July in both the US and India, though a given absolute time may be on two different dates in these locations.

      DATE: We use 4713 BC as a 0-point. But, the calendar system has changed many times since that date. (Indeed, the current system did not even exist on that date.) Is this a simple projection of the current system back in time, or does it adjust for the discontinuties in the Gregorian calendar? This should be stated as it is important for any data files that contain historical dates. (And is why choosing a 20th-century 0-point would have been better...)

      FLOAT, DOUBLE: presumably these are in the standard IEEE Standard 754 format? If so, let's state that.

      INTERVAL: there are many ways that intervals have been represented in DB systems. Parquet represents data as a triple: months, days and (milli)seconds. Does Drill use a similar format? If not, what is the format? A normal DB can declare the interval as part of the data declaration. How does Drill infer the format? How does the user access the parts of the range?

      INTERVAL: the footnote says, "Internally, INTERVAL is represented as INTERVALDAY or INTERVALYEAR." But, if so, then INTERVAL can't represent a time interval: a serious limitation. Also, we can't convert a Parquet Interval to a Drill interval since there is no mapping to Drill that includes months, days and seconds. This is a huge limitation and should be explained.

      SMALLINT: This is a supported types table, but the footnotes say SMALLINT is not supported. We also do not list the many internal Value Vector types we don't support (int8, uint8, int16, uint16, uint32 and so on.) Should we list SMALLINT if we don't actually support it?

      TIME: the format is acutally number of seconds since 2001-01-01. The "24-hour based time ... in hours, minutes, seconds format" confuses display format with internal representation. See DATE above.

      TIME: Presumably the time is in local time, not UTC. That is, the time is 12:34:56 with the time zone left unspecified.

      TIME: The example for TIME is, "22:55:55.23". But, note that the example shows milliseconds, but the description says the time unit is seconds. Which is right?

      TIME: The example shows just a time (seconds since midnight), but the description says that this is a timestamp: number of seconds since 2010-01-01. If so, then is TIME like TIMESTAMP (with a different basis)? Or is really a time-only value (so that the description is wrong?)

      TIMESTAMP: The description says "JDBC timestamp", but this is not accurate. JDBC is a layer on top of a DB. So, we could say, "JDBC timestamp format".

      TIMESTAMP: Explain the basis. A JDBC timestamp (https://docs.oracle.com/javase/8/docs/api/java/sql/Timestamp.html) expresses time in nanoseconds since 1970-01-01. So, does Drill also have nanosecond precision? The docs say, "optional milliseconds", so presumably Drill only keeps milliseconds, As a result, the Drill timestamp is NOT a JDBC timestamp.

      TIMESTAMP: JDBC timestamps are vague. They are based on a Java Date which is defined as milliseconds since 1970-01-01T00:00:00 UTC. But, it seems a JDBC timestamp is local (it has no implied timezone). Does Drill assume that a TIMESTAMP is UTC (like java.util.Date) or local (like java.sql.Date)?

      TIMESTAMP & DATE/TIME: We've created an incompatibility between the date & time format on the one hand, and TIMESTAMP on the other. We should explain how to convert between the two since it is non-obvious how this would be done without noodling out the conversion factor. Or, is a Drill timestamp also based on 2001-01-01 like a Drill date?

      TIMESTAMP: "format: yyyy-MM-dd HH:mm:ss.SSS". Again, the timestamp does not have a format, it is just a count of millis (or nanos, see above.) As explained above, formatting is done by each tool and can be whatever the user wants.

      CHAR: The description says, "The default limit is 1 character. The maximum character limit is 2,147,483,647." which seems to apply to the CHAR type: CHAR(1) to CHAR(2,147,483,647). But, the footnote says, "Currently, Drill supports only variable-length strings." So, the "default limit..." stuff does not actually apply, does it?

      General: In a full DB, types are important because the user declares columns of the given type. Thus, I can specify DECIMAL(10,2) or CHAR(5) and it means something. But, Drill is a query-only engine. So, how are the types used? In general, types have to be inferred from data (or defined as casts in SQL or in views). So, we need to describe the type inference for each input source. And, the semantic rules that apply when converting data inside a view or query. As an example, what happens when we convert the incompatible Parquet INTERVAL to a Drill Interval?

      General: Each issue should be discussed and resolved with development as some of the above may be more than just a documentation issue.

      Attachments

        Activity

          People

            bbevens Bridget Bevens
            paul-rogers Paul Rogers
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: