Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-6168

Table functions do not "inherit" default configuration

VotersStop watchingWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.12.0
    • 1.18.0
    • None

    Description

      See DRILL-6167 that describes an attempt to use a table function with a regex format plugin.

      Consider the plugin configuration:

          RegexFormatConfig sampleConfig = new RegexFormatConfig();
          sampleConfig.extension = "log1";
          sampleConfig.regex = DATE_ONLY_PATTERN;
          sampleConfig.fields = Lists.newArrayList("year", "month", "day");
      

      (This plugin is defined in code in a test rather than the usual JSON in the Web console.)

      Run a test with the above. Things work fine.

      Now, try the plugin config with a table function as described in DRILL-6167:

            String sql = "SELECT * FROM table(cp.`regex/simple.log2`\n" +
                "(type => 'regex', regex => '(\\\\d\\\\d\\\\d\\\\d)-(\\\\d\\\\d)-(\\\\d\\\\d) .*'))";
            client.queryBuilder().sql(sql).printCsv();
      

      Because we are using a file with suffix "log2", the query will match the format plugin config defined above. A query without the table function does, in fact, work using the defined config. But, with a table function, we get this warning from our regex code:

      13307 WARN [257590e1-e846-9d82-61d4-e246a4925ac3:frag:0:0] [org.apache.drill.exec.store.easy.regex.RegexRecordReader] - Column list has fewer
        names than the pattern has groups, filling extras with Column$n.
      

      (The warning is in the custom plugin, not Drill.) This is the plugin saying, "hey! you didn't provide column names!". But, in the format definition, we did provide names. If we run the query without a table function, we do see those names used.

      Result:

      3 row(s):
      Column$0<VARCHAR(OPTIONAL)>,Column$1<VARCHAR(OPTIONAL)>,Column$2<VARCHAR(OPTIONAL)>
      2017,12,17
      2017,12,18
      2017,12,19
      Total rows returned : 3.  Returned in 9072ms.
      

      Yes, indeed, the table function discarded the defined format config values, filling in blanks, including for the column names.

      The expected behavior is that all properties defined in the config should remain unchanged except for those in the table function. Why? In order to know which format plugin to use, the code has to map from the suffix (".log2" here) to a format plugin config. (The config is the only thing that specifies a suffix.) Since we mapped to a config (not the unconfigured plugin), we'd expect the config properties to be used.

      It is highly surprising that all we get to use is the suffix, but all other attributes are ignored. This seems very much in the "bug" category and not at all in the "feature" category.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Paul.Rogers Paul Rogers
            paul-rogers Paul Rogers
            Arina Ielchiieva Arina Ielchiieva
            Votes:
            0 Vote for this issue
            Watchers:
            2 Stop watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment