Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-1364

Increase the maximum length of various metastore fields, and remove TYPE_NAME from COLUMNS primary key

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.5.0
    • Fix Version/s: 0.6.0
    • Component/s: Metastore
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Tags:
      mysql serdeproperties jpox length

      Description

      The value component of a SERDEPROPERTIES key/value pair is currently limited
      to a maximum length of 767 characters. I believe that the motivation for limiting the length to
      767 characters is that this value is the maximum allowed length of an index in
      a MySQL database running on the InnoDB engine: http://bugs.mysql.com/bug.php?id=13315

      • The Metastore OR mapping currently limits many fields (including SERDEPROPERTIES.PARAM_VALUE) to a maximum length of 767 characters despite the fact that these fields are not indexed.
      • The maximum length of a VARCHAR value in MySQL 5.0.3 and later is 65,535.
      • We can expect many users to hit the 767 character limit on SERDEPROPERTIES.PARAM_VALUE when using the hbase.columns.mapping serdeproperty to map a table that has many columns.

      I propose increasing the maximum allowed length of SERDEPROPERTIES.PARAM_VALUE to 8192.

      1. HIVE-1364.patch
        5 kB
        Carl Steinbach
      2. HIVE-1364.2.patch.txt
        4 kB
        Carl Steinbach
      3. HIVE-1364.3.patch.txt
        10 kB
        Carl Steinbach
      4. HIVE-1364.3.backport-060.patch.txt
        9 kB
        Carl Steinbach
      5. HIVE-1364.4.patch.txt
        7 kB
        Carl Steinbach
      6. HIVE-1364.4.backport-060.patch.txt
        6 kB
        Carl Steinbach

        Issue Links

          Activity

          Hide
          teabot Elliot West added a comment -

          I created a new issue relating to my previous comment: HIVE-12274

          Show
          teabot Elliot West added a comment - I created a new issue relating to my previous comment: HIVE-12274
          Hide
          teabot Elliot West added a comment -

          Echoing David Beveridge's comment, we too are hitting this limit (also with nested JSON) and are having to alter these tables by hand. Given that the purpose of these columns is to hold very loosely defined configuration values it seems rather limiting to impose such a relatively low length bound. Can we not use CLOB-like types as for example as used by TBLS.VIEW_EXPANDED_TEXT?

          I'd suggest that the candidates for type change are:

          • COLUMNS_V2.TYPE_NAME
          • TABLE_PARAMS.PARAM_VALUE
          • SERDE_PARAMS.PARAM_VALUE
          • SD_PARAMS.PARAM_VALUE

          Finally, will this limitation persist in the work resulting from HIVE-9452?

          Show
          teabot Elliot West added a comment - Echoing David Beveridge 's comment, we too are hitting this limit (also with nested JSON) and are having to alter these tables by hand. Given that the purpose of these columns is to hold very loosely defined configuration values it seems rather limiting to impose such a relatively low length bound. Can we not use CLOB-like types as for example as used by TBLS.VIEW_EXPANDED_TEXT ? I'd suggest that the candidates for type change are: COLUMNS_V2.TYPE_NAME TABLE_PARAMS.PARAM_VALUE SERDE_PARAMS.PARAM_VALUE SD_PARAMS.PARAM_VALUE Finally, will this limitation persist in the work resulting from HIVE-9452 ?
          Hide
          dumpster David Beveridge added a comment -

          I think this should be re-opened.

          The current limit is at 4000. This is proving to be insufficient for our own big-data work, while using nested/hierarchical JSON files (some nodes have 16K of structs within them).

          If possible, a more generous limit should be set--our minimum would be 6500 as of today. We modified it to be 32000 (via MySQL) and it appears works well so far.

          Cheers!

          Show
          dumpster David Beveridge added a comment - I think this should be re-opened. The current limit is at 4000. This is proving to be insufficient for our own big-data work, while using nested/hierarchical JSON files (some nodes have 16K of structs within them). If possible, a more generous limit should be set--our minimum would be 6500 as of today. We modified it to be 32000 (via MySQL) and it appears works well so far. Cheers!
          Hide
          namit Namit Jain added a comment -

          Committed to both 0.6 and trunk. Thanks Carl

          Show
          namit Namit Jain added a comment - Committed to both 0.6 and trunk. Thanks Carl
          Hide
          namit Namit Jain added a comment -

          We tested this on our production metastore db.
          The downtime should be acceptable - I will start the tests

          +1

          Show
          namit Namit Jain added a comment - We tested this on our production metastore db. The downtime should be acceptable - I will start the tests +1
          Hide
          jvs John Sichi added a comment -

          +1 from me pending metastore testing by Paul.

          Show
          jvs John Sichi added a comment - +1 from me pending metastore testing by Paul.
          Hide
          hbasereviewboard HBase Review Board added a comment -

          Message from: "Carl Steinbach" <carl@cloudera.com>

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          http://review.cloudera.org/r/895/
          -----------------------------------------------------------

          (Updated 2010-09-23 14:45:46.673298)

          Review request for Hive Developers and John Sichi.

          Summary
          -------

          The patch increases the length of various properties in the Metastore OR mapping. Properties which are currently indexed, or which we may want to index in the future were increased to a length of 767 bytes. Properties which are not indexed and which we are unlikely to ever want to index were increased to a max length of 4000 bytes. I also removed the PK constraint on the COLUMNS.TYPE_NAME field.

          This addresses bug HIVE-1364.
          http://issues.apache.org/jira/browse/HIVE-1364

          Diffs (updated)


          metastore/src/model/package.jdo 527f4b2

          Diff: http://review.cloudera.org/r/895/diff

          Testing
          -------

          Thanks,

          Carl

          Show
          hbasereviewboard HBase Review Board added a comment - Message from: "Carl Steinbach" <carl@cloudera.com> ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: http://review.cloudera.org/r/895/ ----------------------------------------------------------- (Updated 2010-09-23 14:45:46.673298) Review request for Hive Developers and John Sichi. Summary ------- The patch increases the length of various properties in the Metastore OR mapping. Properties which are currently indexed, or which we may want to index in the future were increased to a length of 767 bytes. Properties which are not indexed and which we are unlikely to ever want to index were increased to a max length of 4000 bytes. I also removed the PK constraint on the COLUMNS.TYPE_NAME field. This addresses bug HIVE-1364 . http://issues.apache.org/jira/browse/HIVE-1364 Diffs (updated) metastore/src/model/package.jdo 527f4b2 Diff: http://review.cloudera.org/r/895/diff Testing ------- Thanks, Carl
          Hide
          cwsteinbach Carl Steinbach added a comment -

          Updated version of the patch with changes requested by John.

          Show
          cwsteinbach Carl Steinbach added a comment - Updated version of the patch with changes requested by John.
          Hide
          jvs John Sichi added a comment -

          Per discussion in IRC, we should not change the precision for identifiers. Here's the revert list (plus one change for MStorageDescriptor's TYPE_NAME).

          MFieldSchema
          FNAME

          MType
          TYPE_NAME
          FIELD_NAME

          MTable
          TBL_NAME
          PKEY_NAME
          PARAM_KEY
          TBL_TYPE

          MSerDeInfo
          NAME
          PARAM_KEY

          MOrder
          COL_NAME

          MStorageDescriptor
          COLUMN_NAME (all instances)
          PARAM_KEY
          TYPE_NAME should actually be 4000 since it's really a type signature, not a type name, and we're getting rid of the indexing for it

          MPartition
          PARAM_KEY

          MIndex
          INDEX_NAME
          PARAM_KEY

          Show
          jvs John Sichi added a comment - Per discussion in IRC, we should not change the precision for identifiers. Here's the revert list (plus one change for MStorageDescriptor's TYPE_NAME). MFieldSchema FNAME MType TYPE_NAME FIELD_NAME MTable TBL_NAME PKEY_NAME PARAM_KEY TBL_TYPE MSerDeInfo NAME PARAM_KEY MOrder COL_NAME MStorageDescriptor COLUMN_NAME (all instances) PARAM_KEY TYPE_NAME should actually be 4000 since it's really a type signature, not a type name, and we're getting rid of the indexing for it MPartition PARAM_KEY MIndex INDEX_NAME PARAM_KEY
          Hide
          hbasereviewboard HBase Review Board added a comment -

          Message from: "Carl Steinbach" <carl@cloudera.com>

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          http://review.cloudera.org/r/895/
          -----------------------------------------------------------

          Review request for Hive Developers and John Sichi.

          Summary
          -------

          The patch increases the length of various properties in the Metastore OR mapping. Properties which are currently indexed, or which we may want to index in the future were increased to a length of 767 bytes. Properties which are not indexed and which we are unlikely to ever want to index were increased to a max length of 4000 bytes. I also removed the PK constraint on the COLUMNS.TYPE_NAME field.

          This addresses bug HIVE-1364.
          http://issues.apache.org/jira/browse/HIVE-1364

          Diffs


          metastore/src/model/package.jdo 527f4b2

          Diff: http://review.cloudera.org/r/895/diff

          Testing
          -------

          Thanks,

          Carl

          Show
          hbasereviewboard HBase Review Board added a comment - Message from: "Carl Steinbach" <carl@cloudera.com> ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: http://review.cloudera.org/r/895/ ----------------------------------------------------------- Review request for Hive Developers and John Sichi. Summary ------- The patch increases the length of various properties in the Metastore OR mapping. Properties which are currently indexed, or which we may want to index in the future were increased to a length of 767 bytes. Properties which are not indexed and which we are unlikely to ever want to index were increased to a max length of 4000 bytes. I also removed the PK constraint on the COLUMNS.TYPE_NAME field. This addresses bug HIVE-1364 . http://issues.apache.org/jira/browse/HIVE-1364 Diffs metastore/src/model/package.jdo 527f4b2 Diff: http://review.cloudera.org/r/895/diff Testing ------- Thanks, Carl
          Hide
          jvs John Sichi added a comment -

          Canceling patch since I think we should widen TYPE_NAME and also drop it from the PRIMARY KEY on the COLUMNS table.

          Show
          jvs John Sichi added a comment - Canceling patch since I think we should widen TYPE_NAME and also drop it from the PRIMARY KEY on the COLUMNS table.
          Hide
          jvs John Sichi added a comment -

          (Actually TYPE_NAME is what was needed for HIVE-1632, not FTYPE, but I think we should address both.)

          Show
          jvs John Sichi added a comment - (Actually TYPE_NAME is what was needed for HIVE-1632 , not FTYPE, but I think we should address both.)
          Hide
          jvs John Sichi added a comment -

          Just hit a case where TYPE_NAME in the COLUMNS table (limit 128 characters) was waaay too short. Is there a good reason the type name is part of the primary key for the COLUMNS table?

          Show
          jvs John Sichi added a comment - Just hit a case where TYPE_NAME in the COLUMNS table (limit 128 characters) was waaay too short. Is there a good reason the type name is part of the primary key for the COLUMNS table?
          Hide
          jvs John Sichi added a comment -

          See HIVE-1632 for an ftype use case.

          Show
          jvs John Sichi added a comment - See HIVE-1632 for an ftype use case.
          Hide
          jvs John Sichi added a comment -

          I think we should go ahead with changing everything. Even ftype could get large for complex nested types, right Ashish?

          But I really think if we're going to do this, we should go to LOB unless I'm wrong about it being possible to do in-place.

          Side note: I recently found out that in InnoDB (at least the version in use at Facebook), the first 512 bytes of each LOB is stored inline in the row, and there's an overall inline row limit of 8K, or something like that. So a table with a lot of LOB columns can still hit a limit, which is annoying.

          Show
          jvs John Sichi added a comment - I think we should go ahead with changing everything. Even ftype could get large for complex nested types, right Ashish? But I really think if we're going to do this, we should go to LOB unless I'm wrong about it being possible to do in-place. Side note: I recently found out that in InnoDB (at least the version in use at Facebook), the first 512 bytes of each LOB is stored inline in the row, and there's an overall inline row limit of 8K, or something like that. So a table with a lot of LOB columns can still hit a limit, which is annoying.
          Hide
          cwsteinbach Carl Steinbach added a comment -

          MetaStore upgrade scripts are covered in HIVE-1427.

          @ John & Ashish: Please tell me how you want me to change this patch. I'm willing to revert these limits to their original values. Are there any properties besides ftype which you want me to revert? Should I revert everything except SERDEPROPERTIES.PARAM_VALUE?

          Show
          cwsteinbach Carl Steinbach added a comment - MetaStore upgrade scripts are covered in HIVE-1427 . @ John & Ashish: Please tell me how you want me to change this patch. I'm willing to revert these limits to their original values. Are there any properties besides ftype which you want me to revert? Should I revert everything except SERDEPROPERTIES.PARAM_VALUE?
          Hide
          cwsteinbach Carl Steinbach added a comment -

          Also why do we make everything 4000 bytes - I presume things like ftype will never hit that limit.

          Currently the ORM is the de facto enforcement mechanism for string length limitations. I think this is a bad approach since 1) users can work around it by manually altering the underlying tables, and 2) the limits are stated in terms of bytes so the actual length restriction in terms of number of characters will depend on the character set of the underlying DB. In light of this I bumped every size limit to 4000 bytes, and also because I did not want to try to predict which property length limit someone would next bump into. I'm willing to revert these limits to their original values. Are there any properties besides ftype which you want me to revert? Should I revert everything except SERDEPROPERTIES.PARAM_VALUE?

          Also changes to upgrade SQL should also be a part of the patch, no? Where are the scripts for the view change located?

          I'll update the patch with the necessary scripts. Should these go in bin/ or somewhere under metastore/ ?

          @John: Yes, I think this falls under the responsibility of the release manager. I will take care of it.

          I think the current approach of using the ORM as the de facto enforcement mechanism for checking

          Show
          cwsteinbach Carl Steinbach added a comment - Also why do we make everything 4000 bytes - I presume things like ftype will never hit that limit. Currently the ORM is the de facto enforcement mechanism for string length limitations. I think this is a bad approach since 1) users can work around it by manually altering the underlying tables, and 2) the limits are stated in terms of bytes so the actual length restriction in terms of number of characters will depend on the character set of the underlying DB. In light of this I bumped every size limit to 4000 bytes, and also because I did not want to try to predict which property length limit someone would next bump into. I'm willing to revert these limits to their original values. Are there any properties besides ftype which you want me to revert? Should I revert everything except SERDEPROPERTIES.PARAM_VALUE? Also changes to upgrade SQL should also be a part of the patch, no? Where are the scripts for the view change located? I'll update the patch with the necessary scripts. Should these go in bin/ or somewhere under metastore/ ? @John: Yes, I think this falls under the responsibility of the release manager. I will take care of it. I think the current approach of using the ORM as the de facto enforcement mechanism for checking
          Hide
          jvs John Sichi added a comment -

          Currently the view scripts are only in the wiki:

          http://wiki.apache.org/hadoop/Hive/ViewDev#Metastore_Upgrades

          Per discussion with Ashish, we should open a separate JIRA issue for (at a minimum) packaging up example MySQL migration scripts (cumulative across all schema changes from 0.5 to 0.6) and explaining what to do with them in the release notes. Carl, do you want to take that on as part of release mgmt?

          Show
          jvs John Sichi added a comment - Currently the view scripts are only in the wiki: http://wiki.apache.org/hadoop/Hive/ViewDev#Metastore_Upgrades Per discussion with Ashish, we should open a separate JIRA issue for (at a minimum) packaging up example MySQL migration scripts (cumulative across all schema changes from 0.5 to 0.6) and explaining what to do with them in the release notes. Carl, do you want to take that on as part of release mgmt?
          Hide
          athusoo Ashish Thusoo added a comment -

          Also why do we make everything 4000 bytes - I presume things like ftype will never hit that limit.

          Also changes to upgrade SQL should also be a part of the patch, no? Where are the scripts for the view change located?

          Show
          athusoo Ashish Thusoo added a comment - Also why do we make everything 4000 bytes - I presume things like ftype will never hit that limit. Also changes to upgrade SQL should also be a part of the patch, no? Where are the scripts for the view change located?
          Hide
          jvs John Sichi added a comment -

          I'm not sure we need to use the deprecation approach. In Java land, it's all just String regardless of the underlying character precision in the DB.

          For existing metastores, people are already going to need to run upgrade SQL commands against their metastore DB's when upgrading to 0.6 because of the new support for views. We can just add on to those scripts.

          Show
          jvs John Sichi added a comment - I'm not sure we need to use the deprecation approach. In Java land, it's all just String regardless of the underlying character precision in the DB. For existing metastores, people are already going to need to run upgrade SQL commands against their metastore DB's when upgrading to 0.6 because of the new support for views. We can just add on to those scripts.
          Hide
          cwsteinbach Carl Steinbach added a comment -

          @Prasad: It's possible that people who ran into problems were before were using a version of MySQL older than 5.0.3. These versions supported a 255 byte max length for VARCHARs. It's also possible that older versions of the package.jdo mapping contained more indexes, in which case the 767 byte limit holds. Also, UTF encoding should not make a difference since these are byte lengths, not character lengths.

          @John: I think using LOBs is the right approach, but perhaps we should handle that problem in a different ticket? I don't think we can just change the mapping to use LOB instead of VARCHAR, and will instead have to add a new LOB column, deprecate the old VARCHAR column, and create an accessor that is capable of using either column.

          Show
          cwsteinbach Carl Steinbach added a comment - @Prasad: It's possible that people who ran into problems were before were using a version of MySQL older than 5.0.3. These versions supported a 255 byte max length for VARCHARs. It's also possible that older versions of the package.jdo mapping contained more indexes, in which case the 767 byte limit holds. Also, UTF encoding should not make a difference since these are byte lengths, not character lengths. @John: I think using LOBs is the right approach, but perhaps we should handle that problem in a different ticket? I don't think we can just change the mapping to use LOB instead of VARCHAR, and will instead have to add a new LOB column, deprecate the old VARCHAR column, and create an accessor that is capable of using either column.
          Hide
          cwsteinbach Carl Steinbach added a comment -

          HIVE-1364.2.patch.txt:

          • Change PARTITIONS.PART_NAME max length back to 767
          Show
          cwsteinbach Carl Steinbach added a comment - HIVE-1364 .2.patch.txt: Change PARTITIONS.PART_NAME max length back to 767
          Hide
          jvs John Sichi added a comment -

          Also, PART_NAME in table PARTITIONS needs to remain as is, since it is covered by an index.

          Show
          jvs John Sichi added a comment - Also, PART_NAME in table PARTITIONS needs to remain as is, since it is covered by an index.
          Hide
          jvs John Sichi added a comment -

          Couldn't we just use a LOB? So far so good with views (except for the Oracle mapping which we still need to address to get it to use CLOB instead of LONG VARCHAR).

          Show
          jvs John Sichi added a comment - Couldn't we just use a LOB? So far so good with views (except for the Oracle mapping which we still need to address to get it to use CLOB instead of LONG VARCHAR).
          Hide
          prasadc Prasad Chakka added a comment -

          it used to be much higher in the beginning but quite a few users reported problems on some mysql dbs. 767 seemed to work most dbs. before committing this can someone test this on some different dbs (with and without UTF encoding)?

          Show
          prasadc Prasad Chakka added a comment - it used to be much higher in the beginning but quite a few users reported problems on some mysql dbs. 767 seemed to work most dbs. before committing this can someone test this on some different dbs (with and without UTF encoding)?
          Hide
          cwsteinbach Carl Steinbach added a comment -

          The max length of a VARCHAR in Oracle (9i/10g/11g) is 4000 bytes, so
          I suppose it makes sense to use 4000 as the default size for string properties
          that we don't need to index.

          Show
          cwsteinbach Carl Steinbach added a comment - The max length of a VARCHAR in Oracle (9i/10g/11g) is 4000 bytes, so I suppose it makes sense to use 4000 as the default size for string properties that we don't need to index.

            People

            • Assignee:
              cwsteinbach Carl Steinbach
              Reporter:
              cwsteinbach Carl Steinbach
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development