Uploaded image for project: 'ORC'
  1. ORC
  2. ORC-92

Support type id (column id) selection in ReaderOptions

    Details

    • Type: New Feature
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 1.2.0
    • Fix Version/s: 1.2.1, 1.3.0
    • Component/s: C++
    • Labels:
      None

      Description

      Currently, in C++ version of orc. We can only select by filed id or field name. This works fine when data structure is flat such as struct<int1:int, s1:string, list1:array<int>>. But when we have a nested structure, struct<int1:int, struct1:struct<int2:int, long2:long>>. We still can only select the field of int1 and struct1. We can not directly select long2.

      We can select long2 by its column id. This can be achieved by updating include function in ReaderOptions.

        Issue Links

          Activity

          Hide
          chunyang-wen Chunyang Wen added a comment - - edited

          I have provided a pull request to partially solve this problem. It supports nested column id selection.

          https://github.com/apache/orc/pull/54

          For nested column name selection, we firstly have to decide how to describe it.

          <s1:struct<s2:struct<int1: int>>>

          int1 is described as s1.s2.int1? (which will be used in include function of ReaderOptions)

          ReaderOptions.include(std::list<std::string> include) function.

          Show
          chunyang-wen Chunyang Wen added a comment - - edited I have provided a pull request to partially solve this problem. It supports nested column id selection. https://github.com/apache/orc/pull/54 For nested column name selection, we firstly have to decide how to describe it. <s1:struct<s2:struct<int1: int>>> int1 is described as s1.s2.int1? (which will be used in include function of ReaderOptions) ReaderOptions.include(std::list<std::string> include) function.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/orc/pull/54

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/orc/pull/54
          Hide
          owen.omalley Owen O'Malley added a comment - - edited

          I just committed this. Thanks, Chunyang!

          I modified the patch so that it looks like options.includeTypes(cols) instead of options.include(cols, true) since I think that makes the intent a little clearer.

          Show
          owen.omalley Owen O'Malley added a comment - - edited I just committed this. Thanks, Chunyang! I modified the patch so that it looks like options.includeTypes(cols) instead of options.include(cols, true) since I think that makes the intent a little clearer.
          Hide
          leftylev Lefty Leverenz added a comment -

          Does this need to be documented?

          Show
          leftylev Lefty Leverenz added a comment - Does this need to be documented?
          Hide
          chunyang-wen Chunyang Wen added a comment -

          enum ColumnSelection

          { ColumnSelection_NONE = 0, ColumnSelection_FIELD_NAMES = 1, ColumnSelection_FIELD_IDS = 2, ColumnSelection_TYPE_IDS = 3 }

          ;

          how about TYPE_NAMES ?

          in my application, actually we specify columns by column names.
          <s1:struct<s2:struct<int1: int>>>

          we choose int1 by s1.s2.int1 which will be passed include(std::list<std:string>).

          We can get type id using tools provided by orc c++, but using names sometimes is a little convenient and more meaningful.

          Show
          chunyang-wen Chunyang Wen added a comment - enum ColumnSelection { ColumnSelection_NONE = 0, ColumnSelection_FIELD_NAMES = 1, ColumnSelection_FIELD_IDS = 2, ColumnSelection_TYPE_IDS = 3 } ; how about TYPE_NAMES ? in my application, actually we specify columns by column names. <s1:struct<s2:struct<int1: int>>> we choose int1 by s1.s2.int1 which will be passed include(std::list<std:string>). We can get type id using tools provided by orc c++, but using names sometimes is a little convenient and more meaningful.
          Hide
          owen.omalley Owen O'Malley added a comment -

          It is an API change and the patch has comments in the header file.

          We should probably write a documentation page on the site about how to use the C++ API.

          Show
          owen.omalley Owen O'Malley added a comment - It is an API change and the patch has comments in the header file. We should probably write a documentation page on the site about how to use the C++ API.
          Hide
          owen.omalley Owen O'Malley added a comment -

          I agree that would be really useful to be able to specify the lower types by name.

          You could even use virtual names for the other complex types:

          map type could use "key" and "value"
          union type could use the tag "0", "1", etc.
          array type could use "value"

          You could use the same Reader.include(list<string>) method because the proposed semantics are the same as the current ones if the names don't have a '.'. Please file a jira for the enhancement.

          The ColumnSelection enum is an internal detail, the code had gotten messy with booleans for each potential way of setting which columns should be read. An enum seemed like a more natural way to make sure that no more than one boolean was set.

          Show
          owen.omalley Owen O'Malley added a comment - I agree that would be really useful to be able to specify the lower types by name. You could even use virtual names for the other complex types: map type could use "key" and "value" union type could use the tag "0", "1", etc. array type could use "value" You could use the same Reader.include(list<string>) method because the proposed semantics are the same as the current ones if the names don't have a '.'. Please file a jira for the enhancement. The ColumnSelection enum is an internal detail, the code had gotten messy with booleans for each potential way of setting which columns should be read. An enum seemed like a more natural way to make sure that no more than one boolean was set.
          Hide
          owen.omalley Owen O'Malley added a comment -

          Currently, if the user picks an intermediate node in the type tree, it will enable the parents, but not the children of that node. So in the case of:

          struct<
            a:struct<
              a0:struct<
                a00:int,
                a01:int>,
              a1:struct<
                a10:int,
                a11:int>>>
          

          if the user selects 'a0' they will get 'a' and the root. They won't get 'a00' or 'a01'. I think that would surprise most users.

          What do you think?

          Show
          owen.omalley Owen O'Malley added a comment - Currently, if the user picks an intermediate node in the type tree, it will enable the parents, but not the children of that node. So in the case of: struct< a:struct< a0:struct< a00: int , a01: int >, a1:struct< a10: int , a11: int >>> if the user selects 'a0' they will get 'a' and the root. They won't get 'a00' or 'a01'. I think that would surprise most users. What do you think?
          Hide
          chunyang-wen Chunyang Wen added a comment - - edited

          Let's move to jira orc-97 for type name (column name) selection

          https://issues.apache.org/jira/browse/ORC-97

          Show
          chunyang-wen Chunyang Wen added a comment - - edited Let's move to jira orc-97 for type name (column name) selection https://issues.apache.org/jira/browse/ORC-97
          Hide
          chunyang-wen Chunyang Wen added a comment - - edited

          Owen O'Malley, I think your implementation has some problem.

          Yes, it will surprise users without selecting child types.

          1. includeTypes does Not select sub-types
          2. selectParents may stop early, so that some sub-types may not be selected when building columnvector.

          e.g.

          <a(1):int, 
            b(2):double,
            c(3):struct<
                   c20(4):int,
                   c21(5):double
                   c22(6):struct<
                       d0(7):int,
                       d1(8):double>>>
          

          number after field name is its column id.

          if we select type: 3,4,8, then 6 is not selected. It is wrong because selectParents will return when it meets 3.

          Show
          chunyang-wen Chunyang Wen added a comment - - edited Owen O'Malley , I think your implementation has some problem. Yes, it will surprise users without selecting child types. 1. includeTypes does Not select sub-types 2. selectParents may stop early, so that some sub-types may not be selected when building columnvector. e.g. <a(1): int , b(2): double , c(3):struct< c20(4): int , c21(5): double c22(6):struct< d0(7): int , d1(8): double >>> number after field name is its column id. if we select type: 3,4,8, then 6 is not selected. It is wrong because selectParents will return when it meets 3.
          Hide
          owen.omalley Owen O'Malley added a comment -

          How about if we select both:
          1. The parents all of the way to the root.
          2. All recursive children from a selected column.

          Thus, 3 in your example would read 0, 3, 4, 5, 6, 7, and 8.
          If you instead selected 4 and 6 it would read 0, 3, 4, 6, 7, and 8.

          I feel like that would be a better match for what users would expect. In other words, if they select c22, they expect both d0 and d1.

          What do you think?

          Show
          owen.omalley Owen O'Malley added a comment - How about if we select both: 1. The parents all of the way to the root. 2. All recursive children from a selected column. Thus, 3 in your example would read 0, 3, 4, 5, 6, 7, and 8. If you instead selected 4 and 6 it would read 0, 3, 4, 6, 7, and 8. I feel like that would be a better match for what users would expect. In other words, if they select c22, they expect both d0 and d1. What do you think?
          Hide
          chunyang-wen Chunyang Wen added a comment - - edited

          I agree with that. I am trying to provide a fix.
          https://github.com/apache/orc/pull/59

          Show
          chunyang-wen Chunyang Wen added a comment - - edited I agree with that. I am trying to provide a fix. https://github.com/apache/orc/pull/59
          Hide
          owen.omalley Owen O'Malley added a comment -

          Released as part of ORC 1.2.1.

          Show
          owen.omalley Owen O'Malley added a comment - Released as part of ORC 1.2.1.

            People

            • Assignee:
              chunyang-wen Chunyang Wen
              Reporter:
              chunyang-wen Chunyang Wen
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development