Hive
  1. Hive
  2. HIVE-2941

Hive should expand nested structs when setting the table schema from thrift structs

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.10.0
    • Component/s: None
    • Labels:
      None

      Description

      When setting a table serde, the deserializer is queried for its schema, which is used to set the metastore table schema. The current implementation uses the class name stored in the field as the field type.

      By storing the class name as the field type, users cannot see the contents of a struct with "describe tblname". Applications that query HiveMetaStore for the table schema (specifically HCatalog in this case) see an unknown field type, rather than a struct containing known field types.

      Hive should store the expanded schema in the metastore so users browsing the schema see expanded fields, and applications querying metastore see familiar types.

      DETAILS

      Set the table serde to something like this. This serde uses the built-in ThriftStructObjectInspector.

      alter table foo_test
        set serde "com.twitter.elephantbird.hive.serde.ThriftSerDe"
        with serdeproperties ("serialization.class"="com.foo.Foo");
      

      This causes a call to MetaStoreUtils.getFieldsFromDeserializer which returns a list of fields and their schemas. However, currently it does not handle nested structs, and if com.foo.Foo above contains a field com.foo.Bar, the class name com.foo.Bar would appear as the field type. Instead, nested structs should be expanded.

        Issue Links

          Activity

          Travis Crawford created issue -
          Hide
          Phabricator added a comment -

          travis requested code review of "HIVE-2941 [jira] Hive should expand nested structs when setting the table schema from thrift structs".
          Reviewers: JIRA

          Update ReflectionStructObjectInspector, when returning its type, to return an expanded struct containing fields and their types.

          When setting a table serde, the deserializer is queried for its schema, which is used to set the metastore table schema. The current implementation uses the class name stored in the field as the field type.

          By storing the class name as the field type, users cannot see the contents of a struct with "describe tblname". Applications that query HiveMetaStore for the table schema (specifically HCatalog in this case) see an unknown field type, rather than a struct containing known field types.

          Hive should store the expanded schema in the metastore so users browsing the schema see expanded fields, and applications querying metastore see familiar types.

          DETAILS

          Set the table serde to something like this. This serde uses the built-in ThriftStructObjectInspector.

          alter table foo_test
          set serde "com.twitter.elephantbird.hive.serde.ThriftSerDe"
          with serdeproperties ("serialization.class"="com.foo.Foo");

          This causes a call to MetaStoreUtils.getFieldsFromDeserializer which returns a list of fields and their schemas. However, currently it does not handle nested structs, and if com.foo.Foo above contains a field com.foo.Bar, the class name com.foo.Bar would appear as the field type. Instead, nested structs should be expanded.

          TEST PLAN
          Manually verified table schema is set correctly. Can improve testing after getting feedback on this approach.

          REVISION DETAIL
          https://reviews.facebook.net/D2721

          AFFECTED FILES
          serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/ReflectionStructObjectInspector.java

          MANAGE HERALD DIFFERENTIAL RULES
          https://reviews.facebook.net/herald/view/differential/

          WHY DID I GET THIS EMAIL?
          https://reviews.facebook.net/herald/transcript/6213/

          Tip: use the X-Herald-Rules header to filter Herald messages in your client.

          Show
          Phabricator added a comment - travis requested code review of " HIVE-2941 [jira] Hive should expand nested structs when setting the table schema from thrift structs". Reviewers: JIRA Update ReflectionStructObjectInspector, when returning its type, to return an expanded struct containing fields and their types. When setting a table serde, the deserializer is queried for its schema, which is used to set the metastore table schema. The current implementation uses the class name stored in the field as the field type. By storing the class name as the field type, users cannot see the contents of a struct with "describe tblname". Applications that query HiveMetaStore for the table schema (specifically HCatalog in this case) see an unknown field type, rather than a struct containing known field types. Hive should store the expanded schema in the metastore so users browsing the schema see expanded fields, and applications querying metastore see familiar types. DETAILS Set the table serde to something like this. This serde uses the built-in ThriftStructObjectInspector. alter table foo_test set serde "com.twitter.elephantbird.hive.serde.ThriftSerDe" with serdeproperties ("serialization.class"="com.foo.Foo"); This causes a call to MetaStoreUtils.getFieldsFromDeserializer which returns a list of fields and their schemas. However, currently it does not handle nested structs, and if com.foo.Foo above contains a field com.foo.Bar, the class name com.foo.Bar would appear as the field type. Instead, nested structs should be expanded. TEST PLAN Manually verified table schema is set correctly. Can improve testing after getting feedback on this approach. REVISION DETAIL https://reviews.facebook.net/D2721 AFFECTED FILES serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/ReflectionStructObjectInspector.java MANAGE HERALD DIFFERENTIAL RULES https://reviews.facebook.net/herald/view/differential/ WHY DID I GET THIS EMAIL? https://reviews.facebook.net/herald/transcript/6213/ Tip: use the X-Herald-Rules header to filter Herald messages in your client.
          Phabricator made changes -
          Field Original Value New Value
          Attachment HIVE-2941.D2721.1.patch [ 12522159 ]
          Hide
          Travis Crawford added a comment -

          Here are some additional details about the issue. Consider the following create table statement. Columns will be discovered for the table by reflecting on the Person object (instead of explicitly specifying them).

          hive> create external table travis_test.person_test 
              >   partitioned by (part_dt string)
              >   row format serde "com.twitter.elephantbird.hive.serde.ThriftSerDe"
              >     with serdeproperties ("serialization.class"="com.twitter.elephantbird.examples.thrift.Person")
              >   stored as
              >     inputformat "com.twitter.elephantbird.mapred.input.HiveMultiInputFormat"
              >     outputformat "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat";
          

          Current behavior does not expand nested structures, listing the class name of nested structs as the field type. Users browsing the schema do not get a full definition of the table schema.

          hive> describe extended person_test;                                                                    
          OK
          name	com.twitter.elephantbird.examples.thrift.Name	from deserializer
          id	int	from deserializer
          email	string	from deserializer
          phones	array<com.twitter.elephantbird.examples.thrift.PhoneNumber>	from deserializer
          part_dt	string	
          

          This patch expands nested structures, showing the full table schema. Here's an example of what the table looks like with the patch:

          hive> describe extended person_test;
          OK
          name	struct<first_name:string,last_name:string>	from deserializer
          id	int	from deserializer
          email	string	from deserializer
          phones	array<struct<number:string,type:struct<value:int>>>	from deserializer
          part_dt	string	
          

          In both cases, the table storage descriptor is unchanged - both list the columns as cols:[].

          I believe the reflected table schema should be copied into the partition storage descriptor when adding a new partition, but that could be a separate change.

          Show
          Travis Crawford added a comment - Here are some additional details about the issue. Consider the following create table statement. Columns will be discovered for the table by reflecting on the Person object (instead of explicitly specifying them). hive> create external table travis_test.person_test > partitioned by (part_dt string) > row format serde "com.twitter.elephantbird.hive.serde.ThriftSerDe" > with serdeproperties ( "serialization.class" = "com.twitter.elephantbird.examples.thrift.Person" ) > stored as > inputformat "com.twitter.elephantbird.mapred.input.HiveMultiInputFormat" > outputformat "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat" ; Current behavior does not expand nested structures, listing the class name of nested structs as the field type. Users browsing the schema do not get a full definition of the table schema. hive> describe extended person_test; OK name com.twitter.elephantbird.examples.thrift.Name from deserializer id int from deserializer email string from deserializer phones array<com.twitter.elephantbird.examples.thrift.PhoneNumber> from deserializer part_dt string This patch expands nested structures, showing the full table schema. Here's an example of what the table looks like with the patch: hive> describe extended person_test; OK name struct<first_name:string,last_name:string> from deserializer id int from deserializer email string from deserializer phones array<struct<number:string,type:struct<value: int >>> from deserializer part_dt string In both cases, the table storage descriptor is unchanged - both list the columns as cols:[] . I believe the reflected table schema should be copied into the partition storage descriptor when adding a new partition, but that could be a separate change.
          Travis Crawford made changes -
          Link This issue blocks HIVE-2950 [ HIVE-2950 ]
          Travis Crawford made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          Travis Crawford added a comment -

          Would it be possible for someone to take a look at this?

          Show
          Travis Crawford added a comment - Would it be possible for someone to take a look at this?
          Hide
          Ashutosh Chauhan added a comment -

          +1 will commit if tests pass.

          Show
          Ashutosh Chauhan added a comment - +1 will commit if tests pass.
          Hide
          Ashutosh Chauhan added a comment -

          Good news : Your patch works. Bad news: Since now we have correct schema, following tests fail which were generating schema having class name as type. Can you update the patch with ant test -Dtestcase=TestCliDriver -Dqfile=case_sensitivity.q -Doverwrite=true and so on.

          • case_sensitivity.q
          • input17.q
          • input5.q
          • input_testxpath.q
          • input_testxpath2.q
          • inputddl8.q
          • join_thrift.q
          Show
          Ashutosh Chauhan added a comment - Good news : Your patch works. Bad news: Since now we have correct schema, following tests fail which were generating schema having class name as type. Can you update the patch with ant test -Dtestcase=TestCliDriver -Dqfile=case_sensitivity.q -Doverwrite=true and so on. case_sensitivity.q input17.q input5.q input_testxpath.q input_testxpath2.q inputddl8.q join_thrift.q
          Hide
          Travis Crawford added a comment -

          Looking... In general running the Hive tests are a hassle because they take so long. I poked around a bit and found https://buildhive.cloudbees.com/ which can build github projects. Ideally I could post the patch and link to CI showing tests pass.

          Show
          Travis Crawford added a comment - Looking... In general running the Hive tests are a hassle because they take so long. I poked around a bit and found https://buildhive.cloudbees.com/ which can build github projects. Ideally I could post the patch and link to CI showing tests pass.
          Hide
          Ashutosh Chauhan added a comment -

          Wow.. that service looks pretty neat. Would certainly try it out. I think you should send an email to dev-list so that other devs can also make use it.

          Show
          Ashutosh Chauhan added a comment - Wow.. that service looks pretty neat. Would certainly try it out. I think you should send an email to dev-list so that other devs can also make use it.
          Hide
          Travis Crawford added a comment -

          Diff updated:
          https://reviews.facebook.net/D3513

          CI failed due to {{org.apache.hadoop.hive.service.TestHiveServerSessions }} which I believe is unrelated:
          https://travis.ci.cloudbees.com/job/HIVE-2941/3/

          Show
          Travis Crawford added a comment - Diff updated: https://reviews.facebook.net/D3513 CI failed due to {{org.apache.hadoop.hive.service.TestHiveServerSessions }} which I believe is unrelated: https://travis.ci.cloudbees.com/job/HIVE-2941/3/
          Hide
          Ashutosh Chauhan added a comment -

          @Travis,
          Can you also upload the patch here on the jira?

          Show
          Ashutosh Chauhan added a comment - @Travis, Can you also upload the patch here on the jira?
          Travis Crawford made changes -
          Attachment HIVE-2941.D3513.1.patch [ 12531130 ]
          Hide
          Travis Crawford added a comment -

          Done.

          Show
          Travis Crawford added a comment - Done.
          Hide
          Ashutosh Chauhan added a comment -

          Committed to trunk. Thanks, Travis!

          Show
          Ashutosh Chauhan added a comment - Committed to trunk. Thanks, Travis!
          Ashutosh Chauhan made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Fix Version/s 0.10.0 [ 12320745 ]
          Resolution Fixed [ 1 ]
          Hide
          Hudson added a comment -

          Integrated in Hive-trunk-h0.21 #1470 (See https://builds.apache.org/job/Hive-trunk-h0.21/1470/)
          HIVE-2941: Hive should expand nested structs when setting the table schema from thrift structs (Travis Crawford via Ashutosh Chauhan) (Revision 1347212)

          Result = FAILURE
          hashutosh : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1347212
          Files :

          • /hive/trunk/ql/src/test/results/clientpositive/case_sensitivity.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/input17.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/input5.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/input_testxpath.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/input_testxpath2.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/inputddl8.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/join_thrift.q.out
          • /hive/trunk/serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/ReflectionStructObjectInspector.java
          Show
          Hudson added a comment - Integrated in Hive-trunk-h0.21 #1470 (See https://builds.apache.org/job/Hive-trunk-h0.21/1470/ ) HIVE-2941 : Hive should expand nested structs when setting the table schema from thrift structs (Travis Crawford via Ashutosh Chauhan) (Revision 1347212) Result = FAILURE hashutosh : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1347212 Files : /hive/trunk/ql/src/test/results/clientpositive/case_sensitivity.q.out /hive/trunk/ql/src/test/results/clientpositive/input17.q.out /hive/trunk/ql/src/test/results/clientpositive/input5.q.out /hive/trunk/ql/src/test/results/clientpositive/input_testxpath.q.out /hive/trunk/ql/src/test/results/clientpositive/input_testxpath2.q.out /hive/trunk/ql/src/test/results/clientpositive/inputddl8.q.out /hive/trunk/ql/src/test/results/clientpositive/join_thrift.q.out /hive/trunk/serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/ReflectionStructObjectInspector.java
          Bhushan Mandhani made changes -
          Link This issue relates to HIVE-3144 [ HIVE-3144 ]
          Hide
          Hudson added a comment -

          Integrated in Hive-trunk-hadoop2 #54 (See https://builds.apache.org/job/Hive-trunk-hadoop2/54/)
          HIVE-2941: Hive should expand nested structs when setting the table schema from thrift structs (Travis Crawford via Ashutosh Chauhan) (Revision 1347212)

          Result = ABORTED
          hashutosh : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1347212
          Files :

          • /hive/trunk/ql/src/test/results/clientpositive/case_sensitivity.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/input17.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/input5.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/input_testxpath.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/input_testxpath2.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/inputddl8.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/join_thrift.q.out
          • /hive/trunk/serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/ReflectionStructObjectInspector.java
          Show
          Hudson added a comment - Integrated in Hive-trunk-hadoop2 #54 (See https://builds.apache.org/job/Hive-trunk-hadoop2/54/ ) HIVE-2941 : Hive should expand nested structs when setting the table schema from thrift structs (Travis Crawford via Ashutosh Chauhan) (Revision 1347212) Result = ABORTED hashutosh : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1347212 Files : /hive/trunk/ql/src/test/results/clientpositive/case_sensitivity.q.out /hive/trunk/ql/src/test/results/clientpositive/input17.q.out /hive/trunk/ql/src/test/results/clientpositive/input5.q.out /hive/trunk/ql/src/test/results/clientpositive/input_testxpath.q.out /hive/trunk/ql/src/test/results/clientpositive/input_testxpath2.q.out /hive/trunk/ql/src/test/results/clientpositive/inputddl8.q.out /hive/trunk/ql/src/test/results/clientpositive/join_thrift.q.out /hive/trunk/serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/ReflectionStructObjectInspector.java
          Hide
          Ashutosh Chauhan added a comment -

          This issue is fixed and released as part of 0.10.0 release. If you find an issue which seems to be related to this one, please create a new jira and link this one with new jira.

          Show
          Ashutosh Chauhan added a comment - This issue is fixed and released as part of 0.10.0 release. If you find an issue which seems to be related to this one, please create a new jira and link this one with new jira.
          Ashutosh Chauhan made changes -
          Status Resolved [ 5 ] Closed [ 6 ]

            People

            • Assignee:
              Travis Crawford
              Reporter:
              Travis Crawford
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development