Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-4353

Add semantic types to Kafka Connect

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 0.10.0.1
    • None
    • connect

    Description

      Kafka Connect's schema system defines several core types that consist of:

      • STRUCT
      • ARRAY
      • MAP

      plus these primitive types:

      • INT8
      • INT16
      • INT32
      • INT64
      • FLOAT32
      • FLOAT64
      • BOOLEAN
      • STRING
      • BYTES

      The Schema for these core types define several attributes, but they do not have a name.

      Kafka Connect also defines several logical types that are specializations of the primitive types and do have schema names and are automatically mapped to/from Java objects:

      Schema Name Primitive Type Java value class Description
      o.k.c.d.Decimal BYTES java.math.BigDecimal An arbitrary-precision signed decimal number.
      o.k.c.d.Date INT32 java.util.Date A date representing a calendar day with no time of day or timezone. The java.util.Date value's hours, minutes, seconds, milliseconds are set to 0. The underlying representation is an integer representing the number of standardized days (based on a number of milliseconds with 24 hours/day, 60 minutes/hour, 60 seconds/minute, 1000 milliseconds/second with n) since Unix epoch.
      o.k.c.d.Time INT32 java.util.Date A time representing a specific point in a day, not tied to any specific date. Only the java.util.Date value's hours, minutes, seconds, and milliseconds can be non-zero. This effectively makes it a point in time during the first day after the Unix epoch. The underlying representation is an integer representing the number of milliseconds after midnight.
      o.k.c.d.Timestamp INT32 java.util.Date A timestamp representing an absolute time, without timezone information. The underlying representation is a long representing the number of milliseconds since Unix epoch.

      where "o.k.c.d" is short for org.kafka.connect.data. ewencp has stated in the past that adding more logical types is challenging and generally undesirable, since everyone use Kafka Connect values have to deal with all new logical types.

      This proposal adds standard semantic types that are somewhere between the core types and logical types. Basically, they are just predefined schemas that have names and are based on other primitive types. However, there is no mapping to another form other than the primitive.

      The purpose of semantic types is to provide hints as to how the values can be treated. Of course, clients are free to ignore the hints of some or all of the built-in semantic types, and in these cases would treat the values as the primitive value with no extra semantics. This behavior makes it much easier to add new semantic types over time without risking incompatibilities.

      Really, any source connector can define custom semantic types, but there is tremendous value in having a library of standard, well-known semantic types, including:

      Schema Name Primitive Type Description
      o.k.c.d.Uuid STRING A UUID in string form.
      o.k.c.d.Json STRING A JSON document, array, or scalar in string form.
      o.k.c.d.Xml STRING An XML document in string form.
      o.k.c.d.BitSet STRING A string of zero or more 0 or 1 characters.
      o.k.c.d.ZonedTime STRING An ISO-8601 formatted representation of a time (with fractional seconds) with timezone or offset from UTC.
      o.k.c.d.ZonedTimestamp STRING An ISO-8601 formatted representation of a timestamp with timezone or offset from UTC.
      o.k.c.d.EpochDays INT64 A date with no time or timezone information, represented as the number of days since (or before) epoch, or January 1, 1970, at 00:00:00UTC.
      o.k.c.d.Year INT32 The year number.
      o.k.c.d.MilliTime INT32 Number of milliseconds past midnight.
      o.k.c.d.MicroTime INT64 Number of microseconds past midnight.
      o.k.c.d.NanoTime INT64 Number of nanoseconds past midnight.
      o.k.c.d.MilliTimestamp INT64 Number of milliseconds past epoch.
      o.k.c.d.MicroTimestamp INT64 Number of microseconds past epoch.

      Attachments

        Activity

          People

            Unassigned Unassigned
            rhauch Randall Hauch
            Ewen Cheslack-Postava Ewen Cheslack-Postava
            Votes:
            1 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated: