Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-3558

Functions are not being called if the use dirN functions that hva not beeni initialized

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.1.0
    • Future
    • SQL Parser
    • None

    Description

      If a function takes dir1 as a parameter (example: fn(dir0, dir1)) and there is no second level dir being traversed then the function is not called.

      See the following from the usergroup:

      ===================================================
      I'm still working on our evaluation and now focusing on directory based queries and mixing directory and parquet based partitions.
      This is also a continued trip down the UDF rabbit hole (some pun intended)

      I continue to come across things that surprise me and I would like to share them with both the developers, that might want to address some of them, and other newcomers that might benefit from them.

      The UDF code referenced here van be found at (https://github.com/acmeguy/asdrill) and the documents and the directory structure used in these examples are included in the tiny attachment.

      I will try to keep this as brief as possible.

      What you need to know is:
      there are 33 records in the 19 files in a mixed directory structure

      • see zip for details

      each record contains a date that is valid within that directory structure

      the dirInRange function is a UDF that takes a date range and directory information to determine if a directory contains target data for the provided date range

      • see github for details

      the dirInRange function should be able to accept all null values or missing parameters for everything other than the time range starts parameter

      the dirInRange function returns a string with a number that represents the number of parameters used (function variant)

      • has no other purpose/function at this point
      • will return the value of the last dirN paramater that is not null (dir0, dir1 or dir2)
        Observations

      1. The UDF function (dirInRange) is not called if dir0, dir1 or dir2 are missing (missing is not the same as null here)
      select occurred_at, dir0 dYear, dir1 dMonth, dir2 dDay from dfs.tmp.`/analytics/processed/test/events` as t order by occurred_at;

      • return 33 records

      select occurred_at, dir0 dYear, dir1 dMonth, dir2 dDay from dfs.tmp.`/analytics/processed/test/events` as t where dirInRange(cast('2015-04-10' as timestamp),cast('2015-07-11' as timestamp),COALESCE(dir0,''),COALESCE(dir1,''),COALESCE(dir2,'-')) > '0' order by occurred_at;

      • returns 33 record (Coalesce handles the missing values are replaces them with '-')

      select occurred_at, dir0 dYear, dir1 dMonth, dir2 dDay from dfs.tmp.`/analytics/processed/test/events` as t where dirInRange(cast('2015-04-10' as timestamp),cast('2015-07-11' as timestamp),dir0,dir1,dir2) > '0' order by occurred_at;

      • returns 13 records (only those in the deepest directories where dir0, dir1, dir2 are all set

      select occurred_at, dir0 dYear, dir1 dMonth, dir2 dDay, dirInRange(cast('2015-04-10' as timestamp),cast('2015-07-11' as timestamp),dir0,dir1,dir2) inRange from dfs.tmp.`/analytics/processed/test/events` as t order by occurred_at;

      • return 33 records but 20 of the records will have inRange set to null (the UDF never returns null so it's being ignored completely)

      Lesson: It's not enough to use Nullable*Holder in UDF and have all permutations covered

      • Drill will not call the function and fails silently, evaluating the outcome of the function to null, if any of the dirN parameters are not initialized
        2. System.out.print out is the way to get information from within the Drillbit
        It would be good to know which Drillbit instance, if many, is responsible for the println
      • I don't know how to get the parent drillbit injected into the UDF
        3. If directories have numeric names then Drill starts to insist they are all numeric (in the where condition) event though dirInRange always returns a varchar.
        select occurred_at, dir0 dYear, dir1 dMonth, dir2 dDay from dfs.tmp.`/analytics/processed/test/events` as t where dir0 = dirInRange(cast('2015-04-10' as timestamp),cast('2015-07-11' as timestamp),COALESCE(dir0,''),COALESCE(dir1,''),COALESCE(dir2,'-')) order by occurred_at;
        "2011,2012" is the name of the directory (same happens with directories ("Q1" and "W1" etc.)
        java.lang.RuntimeException: java.sql.SQLException: SYSTEM ERROR: NumberFormatException: 2011,2012
        Fragment 0:0
        [Error Id: 0c3e1370-ccc5-4288-b6c9-ea0ef4884f1e on localhost:31010]

      This seems to fail on the other side where Drill thinks that the outcome of the dirInRange function is numeric and that the "=" expression is a numerical one.

      this runs though: select occurred_at, dir0 dYear, dir1 dMonth, dir2 dDay from dfs.tmp.`/analytics/processed/test/events` as t where dir0 = trim(dirInRange(cast('2015-04-10' as timestamp),cast('2015-07-11' as timestamp),COALESCE(dir2,''),COALESCE(dir2,''),COALESCE(dir2,'-'))) order by occurred_at;

      • here the trim() function takes care of making sure that the UDF returns a string (I think) even though the return type of the UDF is always a varcharholder.
        4. Directories do not fail and all files and all their records are evaluated
        select occurred_at, dir0 dYear, dir1 dMonth, dir2 dDay from dfs.tmp.`/analytics/processed/test/events` as t where dir0 = trim(dirInRange(cast('2015-04-10' as timestamp),cast('2015-07-11' as timestamp),COALESCE(dir2,''),COALESCE(dir2,''),COALESCE(dir2,'-'))) order by occurred_at;
      • correctly returns no records (return value of dirInRange never matches dir0)

      This still evaluates all the records in all the files in all the directories

      • no partition pruning is available for selecting from directories
      • massive performance/optimization gain to be had
        5. File name is not available
        It might be good to have dynamic filename variable available (just as the directory variables)

      Attachments

        Activity

          People

            Unassigned Unassigned
            acmeguy Stefán Baxter
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: