Uploaded image for project: 'Kylin'
  1. Kylin
  2. KYLIN-1527

Columns with all NULL values can't be queried

VotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • v1.5.0
    • v1.5.1
    • None
    • None

    Description

      I'm running into some issues in Kylin 1.5 when trying to query columns that contain only NULL values. Here's a quick description of my setup and what I'm seeing:

      • I have a Hive ORC table with a "denied_reason" column.
      • The data is partitioned by day.
      • Before 2015-02-09, the values for "denied_reason" are all NULL.
      • Beginning on 2015-02-09, the values of "denied_reason" are a combination of NULLs and various string values.

      If I build a new data cube with the initial build going until 2015-02-08 (during which time the "denied_reason" column will always be NULL), and then I try to query the resulting cube, any query involving the "denied_reason" column will fail with this error:

      SELECT COUNT(*) FROM default.logs WHERE denied_reason IS NULL;
      > Error while executing SQL "SELECT COUNT(*) FROM default.logs WHERE denied_reason IS NULL LIMIT 50000": Dictionary for DEFAULT.LOGS.DENIED_REASON is not found
      

      If I then re-build the existing cube to include additional days after 2015-02-09 (during which time the "denied_reason" column has values), the query will still fail with the same "Dictionary for DEFAULT.LOGS.DENIED_REASON is not found" error.

      If I attempt to merge the segments from before 2015-02-09 with the segments afterwards, I get this error during the segment "Merge Cuboid Data" phase:

      Error: java.lang.NullPointerException at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:187) at com.google.common.cache.LocalCache.get(LocalCache.java:3964) at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3969) at com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4829) at org.apache.kylin.dict.DictionaryManager.getDictionaryInfo(DictionaryManager.java:114) at org.apache.kylin.dict.DictionaryManager.getDictionary(DictionaryManager.java:108) at org.apache.kylin.engine.mr.steps.MergeCuboidMapper.map(MergeCuboidMapper.java:183) at org.apache.kylin.engine.mr.steps.MergeCuboidMapper.map(MergeCuboidMapper.java:62) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:784) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
      

      If I purge the cube and perform the initial build with dates spanning both before and after 2015-02-09 (so during this time the "denied_reason" column would both have days containing all NULL values, as well as real values), then my queries work as expected:

      SELECT COUNT(*) FROM default.logs WHERE denied_reason IS NULL;
      > 2285375
      SELECT COUNT(*) FROM default.logs WHERE denied_reason IS NOT NULL;
      > 35936
      

      So it seems like the issue is that if the data cube initially gets built with a column only containing NULL values, then that column becomes unqueryable, and the segments can't be merged. However, if the cube is initially built with at least some data in each column, then all the NULL data works as expected.

      I hope that makes sense, but let me know if anything isn't clear. And if it helps, I'm pretty sure I had done similar builds with NULL data in Kylin 1.2 without issues, so this might be a new issue somewhere between Kylin 1.2 and 1.5 (however, I'm not totally sure about that, so let me know if it would help for me to get my Kylin 1.2 instance spun up again to verify).

      (And I was attempting to come up with a smaller, reproducible case to help demonstrate this issue, but I actually ended up encountering slightly different NULL issues when trying to build that test case. Basically, I was getting this error during the "Build Phase" of the initial cube build if my simpler example cube was being built against a date range with only NULL data: https://gist.github.com/GUI/1ddf3624fa1bd10e53cf I wasn't sure if that related to this same NULL issue I outlined above, so let me know if you'd like for me to provide more detail or open a separate issue for that.)

      Thanks you!

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            liyang.gmt8@gmail.com liyang
            nickm Nick Muerdter
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment