Uploaded image for project: 'Kylin'
  1. Kylin
  2. KYLIN-1513

Time partitioning doesn't work across multiple days

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: v1.5.0
    • Fix Version/s: v1.5.1
    • Component/s: None
    • Labels:
      None

      Description

      I was attempting to use the new time partition functionality added in v1.5.0 (https://issues.apache.org/jira/browse/KYLIN-1427). I realize this hasn't been added to the interface yet (https://issues.apache.org/jira/browse/KYLIN-1441), so I'm not sure if this is feature is completely ready yet, but I thought I'd note the issue I ran into:

      • Using the API, I defined the `partition_time_column` and `partition_time_format` attributes on my model:
        {
          "partition_desc": {
            "partition_date_column": "DEFAULT.LOGS.CAL_DATE",
            "partition_date_format": "yyyy-MM-dd",
            "partition_date_column": "DEFAULT.LOGS.CAL_HOUR",
            "partition_date_format": "H",
            "partition_date_start": null,
            "partition_type": "APPEND"
          }
        }
            
      • I then attempted to build the cube for the first time over a multi-day duration (2010-08-16 to 2010-11-01).
      • The cube reported successfully building, but the resulting cube contained no data.

      When I looked at the Kylin logs, it looks like the SQL generated by the time partition won't work if the query spans multiple days. In my case, the WHERE clause used for 2010-08-16 to 2010-11-01 was:

      WHERE (LOGS.CAL_DATE >= '2010-08-16' AND LOGS.CAL_DATE < '2010-11-01' AND LOGS.CAL_HOUR >= '0' AND LOGS.CAL_HOUR < '0')
      

      The issues seems to be that the hour condition is ANDed to the end of the existing date range query: https://github.com/apache/kylin/blob/kylin-1.5.0/core-metadata/src/main/java/org/apache/kylin/metadata/model/PartitionDesc.java#L170-L174 However, this logic doesn't work when the date range spans multiple days. In this case, it's trying to match where the hour column is both >= 0 and < 0 (since I was processing midnight to midnight), which will never match any results. However, if I switched my cube build to end at 02:00, I believe this would lead to some results being built in the cube, but not what's expected (it would only process 00:00-02:00 on each day.

      So while I think the current implementation will work as long as you're building less than 24 hours of data at a time, it would be nice if this could still support multiple-day builds when this time partition is also present.

      I think when a separate time column is present, the SQL generated would need to be something more like:

      WHERE
        (LOGS.CAL_DATE = '2010-08-16' AND LOGS.CAL_HOUR >= '0') OR
        (LOGS.CAL_DATE > '2010-08-16' AND LOGS.CAL_DATE < '2010-11-01') OR
        (LOGS.CAL_DATE = '2010-11-01' AND LOGS.CAL_HOUR < '0')
      

        Attachments

          Activity

            People

            • Assignee:
              Shaofengshi Shaofeng SHI
              Reporter:
              nickm Nick Muerdter
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: