Details

      Description

      Similar to HIVE-4248, Parquet tries to write large very large "row groups". This causes Hive to run out of memory during dynamic partitions when a reducer may have many Parquet files open at a given time.

      As such, we should implement a memory manager which ensures that we don't run out of memory due to writing too many row groups within a single JVM.

      1. HIVE-7685.1.patch
        1 kB
        Dong Chen
      2. HIVE-7685.patch
        2 kB
        Dong Chen
      3. HIVE-7685.1.patch.ready
        3 kB
        Dong Chen
      4. HIVE-7685.patch.ready
        3 kB
        Dong Chen

        Issue Links

          Activity

          Hide
          sladymon Shannon Ladymon added a comment -

          HIVE-11598 addresses the general issue of adding more Parquet documentation to the Hive wiki.

          Show
          sladymon Shannon Ladymon added a comment - HIVE-11598 addresses the general issue of adding more Parquet documentation to the Hive wiki.
          Hide
          brocknoland Brock Noland added a comment -

          But should it be documented in Hive's wiki even though it's a Parquet parameter, since it's in HiveConf.java?

          Yes, this was implemented specifically for Hive users who cannot easily control the number of partitions being written so I think it makes sense to doc in the hive-parquet docs...

          Show
          brocknoland Brock Noland added a comment - But should it be documented in Hive's wiki even though it's a Parquet parameter, since it's in HiveConf.java? Yes, this was implemented specifically for Hive users who cannot easily control the number of partitions being written so I think it makes sense to doc in the hive-parquet docs...
          Hide
          leftylev Lefty Leverenz added a comment -

          Doc note: This adds parquet.memory.pool.ratio to HiveConf.java.

          This config parameter is defined in Parquet, so that it does not start with 'hive.'

          But should it be documented in Hive's wiki even though it's a Parquet parameter, since it's in HiveConf.java?

          Show
          leftylev Lefty Leverenz added a comment - Doc note: This adds parquet.memory.pool.ratio to HiveConf.java. This config parameter is defined in Parquet, so that it does not start with 'hive.' But should it be documented in Hive's wiki even though it's a Parquet parameter, since it's in HiveConf.java?
          Hide
          brocknoland Brock Noland added a comment -

          Thank you Dong Chen! I have committed this to trunk!

          Show
          brocknoland Brock Noland added a comment - Thank you Dong Chen ! I have committed this to trunk!
          Hide
          brocknoland Brock Noland added a comment -

          +1

          Show
          brocknoland Brock Noland added a comment - +1
          Hide
          dongc Dong Chen added a comment -

          The value is correctly passed down after verification.

          Show
          dongc Dong Chen added a comment - The value is correctly passed down after verification.
          Hide
          hiveqa Hive QA added a comment -

          Overall: -1 at least one tests failed

          Here are the results of testing the latest attachment:
          https://issues.apache.org/jira/secure/attachment/12689425/HIVE-7685.1.patch

          ERROR: -1 due to 2 failed/errored test(s), 6723 tests executed
          Failed tests:

          org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_optimize_nullscan
          org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_list_bucket_dml_10
          

          Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2217/testReport
          Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2217/console
          Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-2217/

          Messages:

          Executing org.apache.hive.ptest.execution.PrepPhase
          Executing org.apache.hive.ptest.execution.ExecutionPhase
          Executing org.apache.hive.ptest.execution.ReportingPhase
          Tests exited with: TestsFailedException: 2 tests failed
          

          This message is automatically generated.

          ATTACHMENT ID: 12689425 - PreCommit-HIVE-TRUNK-Build

          Show
          hiveqa Hive QA added a comment - Overall : -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12689425/HIVE-7685.1.patch ERROR: -1 due to 2 failed/errored test(s), 6723 tests executed Failed tests: org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_optimize_nullscan org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_list_bucket_dml_10 Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2217/testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2217/console Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-2217/ Messages: Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 2 tests failed This message is automatically generated. ATTACHMENT ID: 12689425 - PreCommit-HIVE-TRUNK-Build
          Hide
          dongc Dong Chen added a comment -

          Brock, thanks for your quick feedback!
          Yes, it is passed down, and verified in the Hive + Parquet integration env:
          1. check the value in log;
          2. use about 2G data with 5 partition to insert. It works fine with this change and OOM without the change.

          Since the check was done several days ago, I will double check it today and see the result.

          Show
          dongc Dong Chen added a comment - Brock, thanks for your quick feedback! Yes, it is passed down, and verified in the Hive + Parquet integration env: 1. check the value in log; 2. use about 2G data with 5 partition to insert. It works fine with this change and OOM without the change. Since the check was done several days ago, I will double check it today and see the result.
          Hide
          brocknoland Brock Noland added a comment -

          Thank you Dong Chen! I have not checked, are we are sure that the values in HiveConf are correctly passed down to the parquet writer?

          Show
          brocknoland Brock Noland added a comment - Thank you Dong Chen ! I have not checked, are we are sure that the values in HiveConf are correctly passed down to the parquet writer?
          Hide
          dongc Dong Chen added a comment -

          Hi Brock Noland,

          After PARQUET-108 resolved, I think this attached patch HIVE-7685.1.patch should be ok for Hive to use Parquet memory manager. Could you please help to review it?

          This patch add one parameter in HiveConf, and its name does not start with 'hive.', since it is actually defined in Parquet project.

          Show
          dongc Dong Chen added a comment - Hi Brock Noland , After PARQUET-108 resolved, I think this attached patch HIVE-7685 .1.patch should be ok for Hive to use Parquet memory manager. Could you please help to review it? This patch add one parameter in HiveConf, and its name does not start with 'hive.', since it is actually defined in Parquet project.
          Hide
          hiveqa Hive QA added a comment -

          Overall: -1 at least one tests failed

          Here are the results of testing the latest attachment:
          https://issues.apache.org/jira/secure/attachment/12685972/HIVE-7685.patch

          ERROR: -1 due to 2 failed/errored test(s), 6699 tests executed
          Failed tests:

          org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_optimize_nullscan
          org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_ql_rewrite_gbtoidx_cbo_1
          

          Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2006/testReport
          Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2006/console
          Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-2006/

          Messages:

          Executing org.apache.hive.ptest.execution.PrepPhase
          Executing org.apache.hive.ptest.execution.ExecutionPhase
          Executing org.apache.hive.ptest.execution.ReportingPhase
          Tests exited with: TestsFailedException: 2 tests failed
          

          This message is automatically generated.

          ATTACHMENT ID: 12685972 - PreCommit-HIVE-TRUNK-Build

          Show
          hiveqa Hive QA added a comment - Overall : -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12685972/HIVE-7685.patch ERROR: -1 due to 2 failed/errored test(s), 6699 tests executed Failed tests: org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_optimize_nullscan org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_ql_rewrite_gbtoidx_cbo_1 Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2006/testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2006/console Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-2006/ Messages: Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 2 tests failed This message is automatically generated. ATTACHMENT ID: 12685972 - PreCommit-HIVE-TRUNK-Build
          Hide
          dongc Dong Chen added a comment -

          Update the config definition name in HiveConf. It is defined in Parquet.
          This patch is ready and will be verified till PARQUET-108 is resolved.

          Show
          dongc Dong Chen added a comment - Update the config definition name in HiveConf. It is defined in Parquet. This patch is ready and will be verified till PARQUET-108 is resolved.
          Hide
          dongc Dong Chen added a comment -

          Thanks for this reminding.
          Since this patch could not pass building right now because of depending on PARQUET-108 resolved, I will rename it to trigger test later.

          Show
          dongc Dong Chen added a comment - Thanks for this reminding. Since this patch could not pass building right now because of depending on PARQUET-108 resolved, I will rename it to trigger test later.
          Hide
          Ferd Ferdinand Xu added a comment -

          Hi,
          I am afraid the file name extension ".ready" may not be able to trigger the hive qa CI test. Better change to "*.patch".

          Show
          Ferd Ferdinand Xu added a comment - Hi, I am afraid the file name extension ".ready" may not be able to trigger the hive qa CI test. Better change to "*.patch".
          Hide
          dongc Dong Chen added a comment -

          This patch adds a hook in Hive to use the Parquet memory manager in Parquet (PARQUET-108).

          When PARQUET-108 get committed into trunk and packaged in Maven (1.6.0 or 1.6.0rc3), this patch should work. I will track it then.

          Show
          dongc Dong Chen added a comment - This patch adds a hook in Hive to use the Parquet memory manager in Parquet ( PARQUET-108 ). When PARQUET-108 get committed into trunk and packaged in Maven (1.6.0 or 1.6.0rc3), this patch should work. I will track it then.
          Hide
          dongc Dong Chen added a comment -

          Sure, I will take PARQUET-108 and put the manager in Parquet.

          Show
          dongc Dong Chen added a comment - Sure, I will take PARQUET-108 and put the manager in Parquet.
          Hide
          brocknoland Brock Noland added a comment - - edited

          Hi Dong,

          Ok, thank you for the investigation. I think we can either put the parquet memory manager in Parquet or add API's to expose the information required to implement the memory manager in Hive. Either approach is fine by me, we can take this work up in PARQUET-108.

          Brock

          Show
          brocknoland Brock Noland added a comment - - edited Hi Dong, Ok, thank you for the investigation. I think we can either put the parquet memory manager in Parquet or add API's to expose the information required to implement the memory manager in Hive. Either approach is fine by me, we can take this work up in PARQUET-108 . Brock
          Hide
          dongc Dong Chen added a comment -

          Hi Brock,

          I think a brief design for this memory manager is:
          Every new writer registers itself to the manager. The manager has an overall view of all the writers. When a condition is up (such as every 1000 rows), it will notify the writers to check memory usage and flush if necessary.

          However, a problem for Parquet specifically is: Hive only has a wrapper for the ParquetRecordWriter, and even ParquetRecordWriter also wrap the real writer (InternalParquetRecordWriter) in Parquet project. Since the behaviors of measuring dynamic buffer size and flushing are private in the real writer, I think we also have to add code in InternalParquetRecordWriter to implement the memory manager functionality.

          It seems only changing Hive code cannot fix this Jira.
          Not sure whether we should put this problem in Parquet project and fix it there, if it is generic enough and not Hive specific?

          Any other ideas?

          Best Regards,
          Dong

          Show
          dongc Dong Chen added a comment - Hi Brock, I think a brief design for this memory manager is: Every new writer registers itself to the manager. The manager has an overall view of all the writers. When a condition is up (such as every 1000 rows), it will notify the writers to check memory usage and flush if necessary. However, a problem for Parquet specifically is: Hive only has a wrapper for the ParquetRecordWriter, and even ParquetRecordWriter also wrap the real writer (InternalParquetRecordWriter) in Parquet project. Since the behaviors of measuring dynamic buffer size and flushing are private in the real writer, I think we also have to add code in InternalParquetRecordWriter to implement the memory manager functionality. It seems only changing Hive code cannot fix this Jira. Not sure whether we should put this problem in Parquet project and fix it there, if it is generic enough and not Hive specific? Any other ideas? Best Regards, Dong

            People

            • Assignee:
              dongc Dong Chen
              Reporter:
              brocknoland Brock Noland
            • Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development