Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-4264

Dots in identifier are not escaped correctly

    Details

    • Type: Bug
    • Status: In Progress
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Execution - Codegen
    • Labels:
      None

      Description

      If you have some json data like this...

          {
            "0.0.1":{
              "version":"0.0.1",
              "date_created":"2014-03-15"
            },
            "0.1.2":{
              "version":"0.1.2",
              "date_created":"2014-05-21"
            }
          }
      

      ... there is no way to select any of the rows since their identifiers contain dots and when trying to select them, Drill throws the following error:

      Error: SYSTEM ERROR: UnsupportedOperationException: Unhandled field reference "0.0.1"; a field reference identifier must not have the form of a qualified name

      This must be fixed since there are many json data files containing dots in some of the keys (e.g. when specifying version numbers etc)

        Issue Links

          Activity

          Hide
          zfong Zelaine Fong added a comment -

          Although the error is slightly different, DRILL-3922 looks like it might be related.

          Show
          zfong Zelaine Fong added a comment - Although the error is slightly different, DRILL-3922 looks like it might be related.
          Hide
          arina Arina Ielchiieva added a comment -

          Some useful notes provided by Paul Rogers:

          1. Escape mechanism for dots in names. (See this for how it is done in JSON. Can we support table[“1.2.3”] syntax in Drill?)

          2. Dots in column names: Identify the issue (Drill attaches meaning to the dots.) Research SQL escape characters. How do other products handle this? Given that these names come from JSON, can we use Javascript syntax (table[“1.2.3”])? How do we ensure current behavior does not break? Experiment to see what works.

          Show
          arina Arina Ielchiieva added a comment - Some useful notes provided by Paul Rogers : 1. Escape mechanism for dots in names. (See this for how it is done in JSON . Can we support table [“1.2.3”] syntax in Drill?) 2. Dots in column names: Identify the issue (Drill attaches meaning to the dots.) Research SQL escape characters. How do other products handle this? Given that these names come from JSON, can we use Javascript syntax (table [“1.2.3”] )? How do we ensure current behavior does not break? Experiment to see what works.
          Hide
          vvysotskyi Volodymyr Vysotskyi added a comment - - edited

          Currently Drill has inconsistent behaviour when querying the file with quotes. Query

          select * from test_table t
          

          fails, but query

          select `rk.q` as `rk.q` from test_table t
          

          returns correct result for the file

          {"rk.q": "a", "m": {"a.b":"1", "a":{"b":"2"}, "c":"3"}}
          

          The difference between these two cases is that for the second case filed reference was created using the method FieldReference.getWithQuotedRef(field.getName()) which does not check the field name. In the first case constructor with check was used.

          Nested field may be selected by few ways:
          t.m.c or t.m['c'].
          Without checking the field name, query

          select t.m.`a.b`, t.m.a.b, t.m['a.b'] from test_table t
          

          returns correct result.
          Mysql, for example, also allows quoted field with dots.

          Preferred solution is to remove the check for field with dots.
          But user may forget to add quotes for the field with dots, so query may return result that does not expected by user.

          Other solution is to add session option that allows to use fields with dots and depending on this option check the field or not. By default the value of this option will be disabled (the same behaviour as now). So user will be responsible for the queries with forgotten quotes.

          Show
          vvysotskyi Volodymyr Vysotskyi added a comment - - edited Currently Drill has inconsistent behaviour when querying the file with quotes. Query select * from test_table t fails, but query select `rk.q` as `rk.q` from test_table t returns correct result for the file {"rk.q": "a", "m": {"a.b":"1", "a":{"b":"2"}, "c":"3"}} The difference between these two cases is that for the second case filed reference was created using the method FieldReference.getWithQuotedRef(field.getName()) which does not check the field name. In the first case constructor with check was used . Nested field may be selected by few ways: t.m.c or t.m['c'] . Without checking the field name, query select t.m.`a.b`, t.m.a.b, t.m['a.b'] from test_table t returns correct result. Mysql, for example, also allows quoted field with dots. Preferred solution is to remove the check for field with dots. But user may forget to add quotes for the field with dots, so query may return result that does not expected by user. Other solution is to add session option that allows to use fields with dots and depending on this option check the field or not. By default the value of this option will be disabled (the same behaviour as now). So user will be responsible for the queries with forgotten quotes.
          Hide
          paul-rogers Paul Rogers added a comment -

          In the example case, what happens for:

          SELECT `rk.q`, `m.a.b` FROM test_table;
          

          To work, is it necessary to do this:

          SELECT `rk.q`, m.`a.b` FROM test_table;
          

          That is, is a dot inside back-ticks considered part of the name, but those outside considered separators?

          I ask because I often do the following, and it works:

          select `name`, `monisid`, `validitydate` from `dfs.data`.`gen.json`
          

          That is, in the table name, dots inside backticks are, in fact, separators. So, do column and table names have different syntax rules?

          Can we spell out the syntax rules for these three cases:

          • Column names in the planner
          • Table names in the planner
          • Column names discovered at runtime

          All that said, we should certainly make SELECT * work as the name expansion is done in the execution engine, not the planner.

          Show
          paul-rogers Paul Rogers added a comment - In the example case, what happens for: SELECT `rk.q`, `m.a.b` FROM test_table; To work, is it necessary to do this: SELECT `rk.q`, m.`a.b` FROM test_table; That is, is a dot inside back-ticks considered part of the name, but those outside considered separators? I ask because I often do the following, and it works: select `name`, `monisid`, `validitydate` from `dfs.data`.`gen.json` That is, in the table name, dots inside backticks are, in fact, separators. So, do column and table names have different syntax rules? Can we spell out the syntax rules for these three cases: Column names in the planner Table names in the planner Column names discovered at runtime All that said, we should certainly make SELECT * work as the name expansion is done in the execution engine, not the planner.
          Hide
          vvysotskyi Volodymyr Vysotskyi added a comment -

          Drill handles field names and schema paths in different ways. For schema paths Drill uses Calcite. The required schema is returned by the method getSchema(). So for the input `dfs.data` needful chema will be returned from the first iteration of the cycle. For the case dfs.data in the first iteration will be assigned schema dfs and in the next iteration schema data.

          Query

          SELECT `rk.q`, `m.a.b` FROM test_table;
          

          returns

          ---------------------------------------------------------------------------------------------------------
          | rk.q<VARCHAR(OPTIONAL)>                           | m.a.b<INT(OPTIONAL)>                              |
          ---------------------------------------------------------------------------------------------------------
          | a                                                 | null                                              |
          ---------------------------------------------------------------------------------------------------------
          

          For this case Drill is looking for the field with name m.a.b.

          Query

          SELECT t.`rk.q`, t.m.`a.b` FROM test_table t;
          

          returns correct result:

          ---------------------------------------------------------------------------------------------------------
          | rk.q<VARCHAR(OPTIONAL)>                           | EXPR$1<VARCHAR(OPTIONAL)>                         |
          ---------------------------------------------------------------------------------------------------------
          | a                                                 | 1                                                 |
          ---------------------------------------------------------------------------------------------------------
          

          So yes, dot inside backticks is considered as a part of the field name and outside is considered as a separator.

          Show
          vvysotskyi Volodymyr Vysotskyi added a comment - Drill handles field names and schema paths in different ways. For schema paths Drill uses Calcite. The required schema is returned by the method getSchema() . So for the input `dfs.data` needful chema will be returned from the first iteration of the cycle. For the case dfs.data in the first iteration will be assigned schema dfs and in the next iteration schema data . Query SELECT `rk.q`, `m.a.b` FROM test_table; returns --------------------------------------------------------------------------------------------------------- | rk.q<VARCHAR(OPTIONAL)> | m.a.b<INT(OPTIONAL)> | --------------------------------------------------------------------------------------------------------- | a | null | --------------------------------------------------------------------------------------------------------- For this case Drill is looking for the field with name m.a.b . Query SELECT t.`rk.q`, t.m.`a.b` FROM test_table t; returns correct result: --------------------------------------------------------------------------------------------------------- | rk.q<VARCHAR(OPTIONAL)> | EXPR$1<VARCHAR(OPTIONAL)> | --------------------------------------------------------------------------------------------------------- | a | 1 | --------------------------------------------------------------------------------------------------------- So yes, dot inside backticks is considered as a part of the field name and outside is considered as a separator.
          Hide
          paul-rogers Paul Rogers added a comment -

          Putting on my user hat, I don't think users of Drill (or even non-planner developers like me) will understand that column names behave differently than table/schema names.

          For example, from the description, I cannot predict what happens here:

          Data: dfs.ds.foo.json, contents: { "a" : { "b.c": 10 } }
          
          SELECT `dfs`.`ds`.`foo.json`.`a`.`b.c` FROM `dfs`.`ds`.`foo.json` (1)
          SELECT `dfs.ds.foo.json.a`.`b.c` FROM `dfs.ds`.`foo.json` (2)
          SELECT `dfs.ds.foo.json.a.b.c` FROM `dfs.ds.foo.json` (3)
          

          Case (1) should work, right? Each component of the path names are enclosed in back-ticks, the dots are unquoted. `b.c` is quoted so is a complete field name. Similarly the file name, `foo.json` is quoted so clearly .json is part of the file name.

          By this logic, case (2) should not work. The quotes enclose parts of a path and so the dots in those names should not be treated as delimiters, but rather as part of the name. That is the SELECT list should be:

          Table: `dfs.ds.foo.a`
          Column: `b.c`
          Schema: `dfs.ds`
          Table: `foo.json`

          Since the two tables do not agree, the query should fail in the planner. Even if it didn't, it should return null because no table column matches `b.c`. And yet your explanation suggests that some part of the quoted name will be considered separate components.

          If so, then Drill is magic, it knows when dots are part of the name (file name) and so (3) should work also. But, it won't for the reasons you state.

          Ideally, we'd enforce form (2): dots inside quotes are part of the name; they are not separators. But, it seems if we do that we might break existing queries. Or, can this actually work?

          Let me throw in two more complications:

          • The directory containing foo.json has a dot: "my.files"
          • The workspace name itself contains a dot: "my.ws"

          Can I do the above? If not, why not? What SQL syntax would I use? Maybe:

          SELECT `my.ws`.`/my.files/foo.json`.`a`.`b.c` FROM `dfs`.`ds`.`foo.json` (4)
          

          Seems we've rather gotten ourselves into a muddle by allowing separator dots in names.

          About here I guess we should ask, what do Hive and Parquet do? They must have solved this issue.

          Show
          paul-rogers Paul Rogers added a comment - Putting on my user hat, I don't think users of Drill (or even non-planner developers like me) will understand that column names behave differently than table/schema names. For example, from the description, I cannot predict what happens here: Data: dfs.ds.foo.json, contents: { "a" : { "b.c" : 10 } } SELECT `dfs`.`ds`.`foo.json`.`a`.`b.c` FROM `dfs`.`ds`.`foo.json` (1) SELECT `dfs.ds.foo.json.a`.`b.c` FROM `dfs.ds`.`foo.json` (2) SELECT `dfs.ds.foo.json.a.b.c` FROM `dfs.ds.foo.json` (3) Case (1) should work, right? Each component of the path names are enclosed in back-ticks, the dots are unquoted. `b.c` is quoted so is a complete field name. Similarly the file name, `foo.json` is quoted so clearly .json is part of the file name. By this logic, case (2) should not work. The quotes enclose parts of a path and so the dots in those names should not be treated as delimiters, but rather as part of the name. That is the SELECT list should be: Table: `dfs.ds.foo.a` Column: `b.c` Schema: `dfs.ds` Table: `foo.json` Since the two tables do not agree, the query should fail in the planner. Even if it didn't, it should return null because no table column matches `b.c` . And yet your explanation suggests that some part of the quoted name will be considered separate components. If so, then Drill is magic, it knows when dots are part of the name (file name) and so (3) should work also. But, it won't for the reasons you state. Ideally, we'd enforce form (2): dots inside quotes are part of the name; they are not separators. But, it seems if we do that we might break existing queries. Or, can this actually work? Let me throw in two more complications: The directory containing foo.json has a dot: "my.files" The workspace name itself contains a dot: "my.ws" Can I do the above? If not, why not? What SQL syntax would I use? Maybe: SELECT `my.ws`.`/my.files/foo.json`.`a`.`b.c` FROM `dfs`.`ds`.`foo.json` (4) Seems we've rather gotten ourselves into a muddle by allowing separator dots in names. About here I guess we should ask, what do Hive and Parquet do? They must have solved this issue.
          Hide
          vvysotskyi Volodymyr Vysotskyi added a comment -

          On the one hand user knows that tmp workspace inside dfs plugin, so user expects that query

          use dfs.tmp;
          

          should work. On the other hand query

          show schemas;
          

          returns the schema name dfs.tmp. So user also expects that the query with such schema name should work.

          use `dfs.tmp`;
          

          Both these cases work since schema name and its path are the same.

          Schemas names with dots currently does not work in Drill. It is due to the handling schema paths in this way

          Queries

          SELECT `dfs`.`ds`.`foo.json`.`a`.`b.c` FROM `dfs`.`ds`.`foo.json` (1)
          SELECT `dfs.ds.foo.json.a`.`b.c` FROM `dfs.ds`.`foo.json` (2)
          SELECT `dfs.ds.foo.json.a.b.c` FROM `dfs.ds.foo.json` (3)
          

          will not work since Drill allows only table names or aliases before the field names.
          Considering only from clause, third case would not work, since Drill assumes that `dfs.ds.foo.json` is the schema name only.
          Queries on the directories with dots and table names with dots also works correctly.

          Hive has an option that allows dots in the columns names (by default dots in the columns is allowed).
          Parquet also allows field names with dots.
          Also current version of Drill can create parquet files with dots in field names, but Drill will fail when querying this file.

          Show
          vvysotskyi Volodymyr Vysotskyi added a comment - On the one hand user knows that tmp workspace inside dfs plugin, so user expects that query use dfs.tmp; should work. On the other hand query show schemas; returns the schema name dfs.tmp . So user also expects that the query with such schema name should work. use `dfs.tmp`; Both these cases work since schema name and its path are the same. Schemas names with dots currently does not work in Drill. It is due to the handling schema paths in this way Queries SELECT `dfs`.`ds`.`foo.json`.`a`.`b.c` FROM `dfs`.`ds`.`foo.json` (1) SELECT `dfs.ds.foo.json.a`.`b.c` FROM `dfs.ds`.`foo.json` (2) SELECT `dfs.ds.foo.json.a.b.c` FROM `dfs.ds.foo.json` (3) will not work since Drill allows only table names or aliases before the field names. Considering only from clause, third case would not work, since Drill assumes that `dfs.ds.foo.json` is the schema name only. Queries on the directories with dots and table names with dots also works correctly. Hive has an option that allows dots in the columns names (by default dots in the columns is allowed). Parquet also allows field names with dots. Also current version of Drill can create parquet files with dots in field names, but Drill will fail when querying this file.
          Hide
          paul-rogers Paul Rogers added a comment -

          Thanks for the explanation! Can you suggest a solution (from the user's view) that offers the best trade-off between a number of requirements?

          Requirements:

          • Allow dots in any name (schema, workspace, table, column, directory, file).
          • Consistent behavior for all names.
          • Compatible with existing queries.
          • Familiar to users of similar systems (Hive, Impala, etc.)

          Maybe propose a syntax and user-visible rules that achieve the above (as best we can.) Once we agree on those rules, we can move on to discuss how we'll implement the rules given Calcite and the Drill execution classes.

          Show
          paul-rogers Paul Rogers added a comment - Thanks for the explanation! Can you suggest a solution (from the user's view) that offers the best trade-off between a number of requirements? Requirements: Allow dots in any name (schema, workspace, table, column, directory, file). Consistent behavior for all names. Compatible with existing queries. Familiar to users of similar systems (Hive, Impala, etc.) Maybe propose a syntax and user-visible rules that achieve the above (as best we can.) Once we agree on those rules, we can move on to discuss how we'll implement the rules given Calcite and the Drill execution classes.
          Hide
          vvysotskyi Volodymyr Vysotskyi added a comment -

          I think we shouldn't allow dots in the names of plugins and workspaces since Drill user can control these names and schema names are stored without quotes. So for example when we will have a plugin with name `df.s`, a plugin with name `df` and workspace in it with name `s`, schema name df.s will refer to the two schemes.

          For the table, directory or file names, dots are allowed. The path to the table in the query is specifying using quotes. The same for table names with dots.

          Currently, dots are allowed for column aliases and they are quoted also.
          But Drill may create a table using the column alias with dots and will fail at the execution stage when it will try to do a select *.

          The solution that I am proposed in my comment does not change the current syntax of Drill queries, I think the current syntax is fine. The proposed solution allows Drill does not fail when it will discover such columns in the table.

          Show
          vvysotskyi Volodymyr Vysotskyi added a comment - I think we shouldn't allow dots in the names of plugins and workspaces since Drill user can control these names and schema names are stored without quotes. So for example when we will have a plugin with name `df.s` , a plugin with name `df` and workspace in it with name `s` , schema name df.s will refer to the two schemes. For the table, directory or file names, dots are allowed. The path to the table in the query is specifying using quotes. The same for table names with dots. Currently, dots are allowed for column aliases and they are quoted also. But Drill may create a table using the column alias with dots and will fail at the execution stage when it will try to do a select * . The solution that I am proposed in my comment does not change the current syntax of Drill queries, I think the current syntax is fine. The proposed solution allows Drill does not fail when it will discover such columns in the table.
          Hide
          paul-rogers Paul Rogers added a comment -

          Let's see if I understand the proposal:

          • Dots not allowed in storage plugin or workspace names.
          • As a result, does in quoted workspace names are assumed to be sparators: `dfs.x` = `dfs`.`x`
          • Dots in table names must be quoted. `dfs.x`.`my.table.json`
          • Dots in columns must be quoted. SELECT a, b.c, b.`d.e` ...

          Is this correct?

          If so, then it seems fine.

          Are you also proposing to support array syntax? SELECT a, b.c, b[`d.e`] ...?

          You mentioned we use one code for names in the planner, another (SchemaPath) for runtime. What rules to we use at runtime? Can that code handle column names with dots? That is, do we ever start with a map "b" and a column "d.e", combine them to get "a.d.e" and try to split them again to get "a", "d" and "e"? If so, how do we fix that?

          Once we get all these questions resolved, I'd suggest posting a summary of the rules to the dev list so that others can take a look.

          Show
          paul-rogers Paul Rogers added a comment - Let's see if I understand the proposal: Dots not allowed in storage plugin or workspace names. As a result, does in quoted workspace names are assumed to be sparators: `dfs.x` = `dfs`.`x` Dots in table names must be quoted. `dfs.x`.`my.table.json` Dots in columns must be quoted. SELECT a, b.c, b.`d.e` ... Is this correct? If so, then it seems fine. Are you also proposing to support array syntax? SELECT a, b.c, b [`d.e`] ... ? You mentioned we use one code for names in the planner, another ( SchemaPath ) for runtime. What rules to we use at runtime? Can that code handle column names with dots? That is, do we ever start with a map "b" and a column "d.e", combine them to get "a.d.e" and try to split them again to get "a", "d" and "e"? If so, how do we fix that? Once we get all these questions resolved, I'd suggest posting a summary of the rules to the dev list so that others can take a look.
          Hide
          vvysotskyi Volodymyr Vysotskyi added a comment -

          Yes, these statements are correct.

          Drill also supports array syntax b['d.e']. I used it as an example of using field names with dots in the query.

          No, I meant that we are using the same FieldReference class, but in planner we are using checkSimpleString() method and at the execution stage we are using constructor FieldReference() that checks input string for dots.
          The code does not combine the map name and nested field names. Nested field names are stored as children of the NameSegment rootSegment in SchemaPath (it is a superclass of FieldReference).

          So I propose to remove that check.

          Show
          vvysotskyi Volodymyr Vysotskyi added a comment - Yes, these statements are correct. Drill also supports array syntax b['d.e'] . I used it as an example of using field names with dots in the query. No, I meant that we are using the same FieldReference class, but in planner we are using checkSimpleString() method and at the execution stage we are using constructor FieldReference() that checks input string for dots. The code does not combine the map name and nested field names. Nested field names are stored as children of the NameSegment rootSegment in SchemaPath (it is a superclass of FieldReference). So I propose to remove that check.
          Hide
          paul-rogers Paul Rogers added a comment -

          On the planner side, we represent field references with the FieldReference class you mentioned. FieldReference extends SchemaPath. These classes break names down into one object per name part.

          Assume we have SELECT a.b.c, a."d.e" ...

          Within the FieldReference itself, we hold onto the name using a PathSegment which has two subclasses: ArraySegment and NameSegment. So, as you noted, in the planner, we can tell the difference between the two cases (using functional notation):

          a.b.c: FieldReference(NameSegment("a", NameSegment("b", NameSegment("c"))))
          a."d.e": FieldReference(NameSegment("a", NameSegment("d.e")))
          

          So far so good. Bug, SchemaPath provides the getAsUnescapedPath() method which concatenates the parts of the name using dots. We end up with two FieldReference instances. Calling getAsUnescapedPath() on each produces a.b.c and a.d.e. So, if anything actually uses this unescaped path, we end up with an ambiguity: does "a.d.e" represent one field, two fields or three fields? We cannot tell.

          Now, if this method was only used for debugging (line toString()), it would be fine. But, in fact, many operators refer to this method, especially when creating the run-time representation of a field schema: MaterializedField:

          From StreamingAggBatch:

                final MaterializedField outputField = MaterializedField.create(
                              ne.getRef().getAsUnescapedPath(), expr.getMajorType());
          

          In our examples, we end up with two materialized fields: one called "a.b.c", the other "a.d.e", so the ambiguity persists.

          As it turns out, each MaterializedField represents one field or structured column. So, our map "a" is represented by a MaterializedField, "b" by another, "c" by yet another and "d.e" by another. So, each should correspond to a single name part.

          But, the code doesn't work that way, it actually builds up the full unescaped name.

          Now, I suspect that the code here is old and inconsistent. It should be that creating a materialized field pulls out only one name part. But, the code actually concatenates. My suspicion increases when I see methods like these in MaterializedField:

            public String getPath() { return getName(); }
            public String getLastName() { return getName(); }
            public String getName() { return name; }
          

          That first one really worries me: it is asking for the "path", which means dotted name. There are many references to this name. Does this mean the code expects to get a string (not a NameSegment) that holds the composite name. If so, we are in trouble.

          Now, as it turns out, it seems that the "modern" form of MaterializedSchema is that each hold just one name part. So:

          MaterializedField(name="a", children = (
            MaterializedField(name="b", children = (
              MaterializedField(name = c))),
            MaterializedField(name="d.e")))
          

          I wonder, because the code appears to be written assuming that a MaterializedField had a path name, does any code still rely on this fact, then split the name at dots to get fields?

          If not, can we remove the getPath(), and getLastPath() methods to make it clear that each MaterializedField corresponds to a single NameSegment?

          And, if we do that, should we remove all calls to NameSegment.getAsUnescapedPath() to make clear that we never (except for display) want dotted, combined path name?

          By carefully looking at the above issues, we can be sure that no old code in Drill tries to concatenate "a" and "d.e" to get the name "a.d.e" which it then splits into "a", "d" and "e".

          A quick search for ".split(" found a number of places where we split names on a dot, including in the Parquet Metadata file:

                  public Object deserializeKey(String key, com.fasterxml.jackson.databind.DeserializationContext ctxt)
                      throws IOException, com.fasterxml.jackson.core.JsonProcessingException {
                    return new Key(key.split("\\."));
                  }
          

          Are there others? Do these need to be fixed?

          Show
          paul-rogers Paul Rogers added a comment - On the planner side, we represent field references with the FieldReference class you mentioned. FieldReference extends SchemaPath . These classes break names down into one object per name part. Assume we have SELECT a.b.c, a."d.e" ... Within the FieldReference itself, we hold onto the name using a PathSegment which has two subclasses: ArraySegment and NameSegment . So, as you noted, in the planner, we can tell the difference between the two cases (using functional notation): a.b.c: FieldReference(NameSegment( "a" , NameSegment( "b" , NameSegment( "c" )))) a. "d.e" : FieldReference(NameSegment( "a" , NameSegment( "d.e" ))) So far so good. Bug, SchemaPath provides the getAsUnescapedPath() method which concatenates the parts of the name using dots. We end up with two FieldReference instances. Calling getAsUnescapedPath() on each produces a.b.c and a.d.e . So, if anything actually uses this unescaped path, we end up with an ambiguity: does "a.d.e" represent one field, two fields or three fields? We cannot tell. Now, if this method was only used for debugging (line toString() ), it would be fine. But, in fact, many operators refer to this method, especially when creating the run-time representation of a field schema: MaterializedField : From StreamingAggBatch : final MaterializedField outputField = MaterializedField.create( ne.getRef().getAsUnescapedPath(), expr.getMajorType()); In our examples, we end up with two materialized fields: one called "a.b.c", the other "a.d.e", so the ambiguity persists. As it turns out, each MaterializedField represents one field or structured column. So, our map "a" is represented by a MaterializedField , "b" by another, "c" by yet another and "d.e" by another. So, each should correspond to a single name part. But, the code doesn't work that way, it actually builds up the full unescaped name. Now, I suspect that the code here is old and inconsistent. It should be that creating a materialized field pulls out only one name part. But, the code actually concatenates. My suspicion increases when I see methods like these in MaterializedField : public String getPath() { return getName(); } public String getLastName() { return getName(); } public String getName() { return name; } That first one really worries me: it is asking for the "path", which means dotted name. There are many references to this name. Does this mean the code expects to get a string (not a NameSegment ) that holds the composite name. If so, we are in trouble. Now, as it turns out, it seems that the "modern" form of MaterializedSchema is that each hold just one name part. So: MaterializedField(name= "a" , children = ( MaterializedField(name= "b" , children = ( MaterializedField(name = c))), MaterializedField(name= "d.e" ))) I wonder, because the code appears to be written assuming that a MaterializedField had a path name, does any code still rely on this fact, then split the name at dots to get fields? If not, can we remove the getPath() , and getLastPath() methods to make it clear that each MaterializedField corresponds to a single NameSegment ? And, if we do that, should we remove all calls to NameSegment.getAsUnescapedPath() to make clear that we never (except for display) want dotted, combined path name? By carefully looking at the above issues, we can be sure that no old code in Drill tries to concatenate "a" and "d.e" to get the name "a.d.e" which it then splits into "a", "d" and "e". A quick search for ".split(" found a number of places where we split names on a dot, including in the Parquet Metadata file: public Object deserializeKey( String key, com.fasterxml.jackson.databind.DeserializationContext ctxt) throws IOException, com.fasterxml.jackson.core.JsonProcessingException { return new Key(key.split( "\\." )); } Are there others? Do these need to be fixed?

            People

            • Assignee:
              vvysotskyi Volodymyr Vysotskyi
              Reporter:
              AleCaste Alex
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:

                Development