Hive
  1. Hive
  2. HIVE-4963

Support in memory PTF partitions

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.12.0
    • Component/s: PTF-Windowing
    • Labels:

      Description

      PTF partitions apply the defensive mode of assuming that partitions will not fit in memory. Because of this there is a significant deserialization overhead when accessing elements.

      Allow the user to specify that there is enough memory to hold partitions through a 'hive.ptf.partition.fits.in.mem' option.

      Savings depends on partition size and in case of windowing the number of UDAFs and the window ranges. For eg for the following (admittedly extreme) case the PTFOperator exec times went from 39 secs to 8 secs.

      select t, s, i, b, f, d,
      min(t) over(partition by 1 rows between unbounded preceding and current row), 
      min(s) over(partition by 1 rows between unbounded preceding and current row), 
      min(i) over(partition by 1 rows between unbounded preceding and current row), 
      min(b) over(partition by 1 rows between unbounded preceding and current row) 
      from over10k
      
      1. HIVE-4963.D11955.1.patch
        28 kB
        Phabricator
      2. HIVE-4963.D12279.1.patch
        68 kB
        Phabricator
      3. HIVE-4963.D12279.2.patch
        68 kB
        Phabricator
      4. HIVE-4963.D12279.3.patch
        80 kB
        Phabricator
      5. PTFRowContainer.patch
        19 kB
        Harish Butani

        Issue Links

          Activity

          Hide
          Lefty Leverenz added a comment -

          Could someone either document this on the Wiki or explain it to me?

          The wiki doesn't have a section about PTFs yet, and the description of hive.join.cache.size hasn't been changed since Hive 0.5.0: "How many rows in the joining tables (except the streaming table) should be cached in memory."

          So I'm adding a TODOC12 label. What should the wiki say?

          Show
          Lefty Leverenz added a comment - Could someone either document this on the Wiki or explain it to me? The wiki doesn't have a section about PTFs yet, and the description of hive.join.cache.size hasn't been changed since Hive 0.5.0: "How many rows in the joining tables (except the streaming table) should be cached in memory." So I'm adding a TODOC12 label. What should the wiki say?
          Hide
          Ashutosh Chauhan added a comment -

          This issue has been fixed and released as part of 0.12 release. If you find further issues, please create a new jira and link it to this one.

          Show
          Ashutosh Chauhan added a comment - This issue has been fixed and released as part of 0.12 release. If you find further issues, please create a new jira and link it to this one.
          Hide
          Harish Butani added a comment -

          Sorry forgot to respond.
          Original plan was to have the user give a hint on whether partitions fits in memory. This would aid in reducing serialization/deserialization cost when partitions fit in memory. But based on discussions with Ashutosh, we decided to move to using RowContainers for holding rows in a Partition; this way we share the same code as Joins; get the functionality and performance benefits of using RowContainers. PTFPartitions are now controlled by ConfVars.HIVEJOINCACHESIZE; use of ConfVars.HIVE_PTF_PARTITION_PERSISTENT_SIZE has been removed.

          Show
          Harish Butani added a comment - Sorry forgot to respond. Original plan was to have the user give a hint on whether partitions fits in memory. This would aid in reducing serialization/deserialization cost when partitions fit in memory. But based on discussions with Ashutosh, we decided to move to using RowContainers for holding rows in a Partition; this way we share the same code as Joins; get the functionality and performance benefits of using RowContainers. PTFPartitions are now controlled by ConfVars.HIVEJOINCACHESIZE; use of ConfVars.HIVE_PTF_PARTITION_PERSISTENT_SIZE has been removed.
          Hide
          Lars Francke added a comment -

          Could someone either document this on the Wiki or explain it to me? The proposed configuration parameter hive.ptf.partition.fits.in.mem does not seem to be added by this patch. Instead hive.join.cache.size, correct? What are the semantics of this?

          Show
          Lars Francke added a comment - Could someone either document this on the Wiki or explain it to me? The proposed configuration parameter hive.ptf.partition.fits.in.mem does not seem to be added by this patch. Instead hive.join.cache.size , correct? What are the semantics of this?
          Hide
          Hudson added a comment -

          ABORTED: Integrated in Hive-trunk-hadoop2 #380 (See https://builds.apache.org/job/Hive-trunk-hadoop2/380/)
          HIVE-4963 : Support in memory PTF partitions (Harish Butani via Ashutosh Chauhan) (hashutosh: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1517236)

          • /hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/PTFOperator.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/PTFPartition.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/PTFPersistence.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/PTFRowContainer.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/RowContainer.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/PTFTranslator.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/PTFDesc.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/PTFDeserializer.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFLeadLag.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/NPath.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/TableFunctionEvaluator.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/TableFunctionResolver.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/WindowingTableFunction.java
          • /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/persistence/TestPTFRowContainer.java
          • /hive/trunk/ql/src/test/queries/clientpositive/ptf_reuse_memstore.q
          • /hive/trunk/ql/src/test/queries/clientpositive/windowing_adjust_rowcontainer_sz.q
          • /hive/trunk/ql/src/test/results/clientpositive/ptf_reuse_memstore.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/windowing_adjust_rowcontainer_sz.q.out
          Show
          Hudson added a comment - ABORTED: Integrated in Hive-trunk-hadoop2 #380 (See https://builds.apache.org/job/Hive-trunk-hadoop2/380/ ) HIVE-4963 : Support in memory PTF partitions (Harish Butani via Ashutosh Chauhan) (hashutosh: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1517236 ) /hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/PTFOperator.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/PTFPartition.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/PTFPersistence.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/PTFRowContainer.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/RowContainer.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/PTFTranslator.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/PTFDesc.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/PTFDeserializer.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFLeadLag.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/NPath.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/TableFunctionEvaluator.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/TableFunctionResolver.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/WindowingTableFunction.java /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/persistence/TestPTFRowContainer.java /hive/trunk/ql/src/test/queries/clientpositive/ptf_reuse_memstore.q /hive/trunk/ql/src/test/queries/clientpositive/windowing_adjust_rowcontainer_sz.q /hive/trunk/ql/src/test/results/clientpositive/ptf_reuse_memstore.q.out /hive/trunk/ql/src/test/results/clientpositive/windowing_adjust_rowcontainer_sz.q.out
          Hide
          Hudson added a comment -

          FAILURE: Integrated in Hive-trunk-h0.21 #2288 (See https://builds.apache.org/job/Hive-trunk-h0.21/2288/)
          HIVE-4963 : Support in memory PTF partitions (Harish Butani via Ashutosh Chauhan) (hashutosh: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1517236)

          • /hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/PTFOperator.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/PTFPartition.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/PTFPersistence.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/PTFRowContainer.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/RowContainer.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/PTFTranslator.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/PTFDesc.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/PTFDeserializer.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFLeadLag.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/NPath.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/TableFunctionEvaluator.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/TableFunctionResolver.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/WindowingTableFunction.java
          • /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/persistence/TestPTFRowContainer.java
          • /hive/trunk/ql/src/test/queries/clientpositive/ptf_reuse_memstore.q
          • /hive/trunk/ql/src/test/queries/clientpositive/windowing_adjust_rowcontainer_sz.q
          • /hive/trunk/ql/src/test/results/clientpositive/ptf_reuse_memstore.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/windowing_adjust_rowcontainer_sz.q.out
          Show
          Hudson added a comment - FAILURE: Integrated in Hive-trunk-h0.21 #2288 (See https://builds.apache.org/job/Hive-trunk-h0.21/2288/ ) HIVE-4963 : Support in memory PTF partitions (Harish Butani via Ashutosh Chauhan) (hashutosh: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1517236 ) /hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/PTFOperator.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/PTFPartition.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/PTFPersistence.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/PTFRowContainer.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/RowContainer.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/PTFTranslator.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/PTFDesc.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/PTFDeserializer.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFLeadLag.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/NPath.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/TableFunctionEvaluator.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/TableFunctionResolver.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/WindowingTableFunction.java /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/persistence/TestPTFRowContainer.java /hive/trunk/ql/src/test/queries/clientpositive/ptf_reuse_memstore.q /hive/trunk/ql/src/test/queries/clientpositive/windowing_adjust_rowcontainer_sz.q /hive/trunk/ql/src/test/results/clientpositive/ptf_reuse_memstore.q.out /hive/trunk/ql/src/test/results/clientpositive/windowing_adjust_rowcontainer_sz.q.out
          Hide
          Hudson added a comment -

          FAILURE: Integrated in Hive-trunk-hadoop1-ptest #137 (See https://builds.apache.org/job/Hive-trunk-hadoop1-ptest/137/)
          HIVE-4963 : Support in memory PTF partitions (Harish Butani via Ashutosh Chauhan) (hashutosh: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1517236)

          • /hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/PTFOperator.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/PTFPartition.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/PTFPersistence.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/PTFRowContainer.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/RowContainer.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/PTFTranslator.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/PTFDesc.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/PTFDeserializer.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFLeadLag.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/NPath.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/TableFunctionEvaluator.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/TableFunctionResolver.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/WindowingTableFunction.java
          • /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/persistence/TestPTFRowContainer.java
          • /hive/trunk/ql/src/test/queries/clientpositive/ptf_reuse_memstore.q
          • /hive/trunk/ql/src/test/queries/clientpositive/windowing_adjust_rowcontainer_sz.q
          • /hive/trunk/ql/src/test/results/clientpositive/ptf_reuse_memstore.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/windowing_adjust_rowcontainer_sz.q.out
          Show
          Hudson added a comment - FAILURE: Integrated in Hive-trunk-hadoop1-ptest #137 (See https://builds.apache.org/job/Hive-trunk-hadoop1-ptest/137/ ) HIVE-4963 : Support in memory PTF partitions (Harish Butani via Ashutosh Chauhan) (hashutosh: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1517236 ) /hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/PTFOperator.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/PTFPartition.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/PTFPersistence.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/PTFRowContainer.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/RowContainer.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/PTFTranslator.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/PTFDesc.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/PTFDeserializer.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFLeadLag.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/NPath.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/TableFunctionEvaluator.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/TableFunctionResolver.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/WindowingTableFunction.java /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/persistence/TestPTFRowContainer.java /hive/trunk/ql/src/test/queries/clientpositive/ptf_reuse_memstore.q /hive/trunk/ql/src/test/queries/clientpositive/windowing_adjust_rowcontainer_sz.q /hive/trunk/ql/src/test/results/clientpositive/ptf_reuse_memstore.q.out /hive/trunk/ql/src/test/results/clientpositive/windowing_adjust_rowcontainer_sz.q.out
          Hide
          Hudson added a comment -

          FAILURE: Integrated in Hive-trunk-hadoop2-ptest #69 (See https://builds.apache.org/job/Hive-trunk-hadoop2-ptest/69/)
          HIVE-4963 : Support in memory PTF partitions (Harish Butani via Ashutosh Chauhan) (hashutosh: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1517236)

          • /hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/PTFOperator.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/PTFPartition.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/PTFPersistence.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/PTFRowContainer.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/RowContainer.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/PTFTranslator.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/PTFDesc.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/PTFDeserializer.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFLeadLag.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/NPath.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/TableFunctionEvaluator.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/TableFunctionResolver.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/WindowingTableFunction.java
          • /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/persistence/TestPTFRowContainer.java
          • /hive/trunk/ql/src/test/queries/clientpositive/ptf_reuse_memstore.q
          • /hive/trunk/ql/src/test/queries/clientpositive/windowing_adjust_rowcontainer_sz.q
          • /hive/trunk/ql/src/test/results/clientpositive/ptf_reuse_memstore.q.out
          • /hive/trunk/ql/src/test/results/clientpositive/windowing_adjust_rowcontainer_sz.q.out
          Show
          Hudson added a comment - FAILURE: Integrated in Hive-trunk-hadoop2-ptest #69 (See https://builds.apache.org/job/Hive-trunk-hadoop2-ptest/69/ ) HIVE-4963 : Support in memory PTF partitions (Harish Butani via Ashutosh Chauhan) (hashutosh: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1517236 ) /hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/PTFOperator.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/PTFPartition.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/PTFPersistence.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/PTFRowContainer.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/RowContainer.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/PTFTranslator.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/PTFDesc.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/PTFDeserializer.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFLeadLag.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/NPath.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/TableFunctionEvaluator.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/TableFunctionResolver.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/WindowingTableFunction.java /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/persistence/TestPTFRowContainer.java /hive/trunk/ql/src/test/queries/clientpositive/ptf_reuse_memstore.q /hive/trunk/ql/src/test/queries/clientpositive/windowing_adjust_rowcontainer_sz.q /hive/trunk/ql/src/test/results/clientpositive/ptf_reuse_memstore.q.out /hive/trunk/ql/src/test/results/clientpositive/windowing_adjust_rowcontainer_sz.q.out
          Hide
          Ashutosh Chauhan added a comment -

          Committed to trunk. Thanks, Harish!

          Show
          Ashutosh Chauhan added a comment - Committed to trunk. Thanks, Harish!
          Hide
          Phabricator added a comment -

          ashutoshc has accepted the revision "HIVE-4963 [jira] Support in memory PTF partitions".

          +1

          REVISION DETAIL
          https://reviews.facebook.net/D12279

          BRANCH
          HIVE-4963-2

          ARCANIST PROJECT
          hive

          To: JIRA, ashutoshc, hbutani

          Show
          Phabricator added a comment - ashutoshc has accepted the revision " HIVE-4963 [jira] Support in memory PTF partitions". +1 REVISION DETAIL https://reviews.facebook.net/D12279 BRANCH HIVE-4963 -2 ARCANIST PROJECT hive To: JIRA, ashutoshc, hbutani
          Hide
          Phabricator added a comment -

          hbutani updated the revision "HIVE-4963 [jira] Support in memory PTF partitions".

          • Merge branch 'trunk' into HIVE-4963-2
          • changes based on review.
          • fix lint issues

          Reviewers: JIRA, ashutoshc

          REVISION DETAIL
          https://reviews.facebook.net/D12279

          CHANGE SINCE LAST DIFF
          https://reviews.facebook.net/D12279?vs=38391&id=38745#toc

          AFFECTED FILES
          ql/src/java/org/apache/hadoop/hive/ql/exec/PTFOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/PTFPartition.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/PTFPersistence.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/PTFRowContainer.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/RowContainer.java
          ql/src/java/org/apache/hadoop/hive/ql/parse/PTFTranslator.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/PTFDesc.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/PTFDeserializer.java
          ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFLeadLag.java
          ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/NPath.java
          ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/TableFunctionEvaluator.java
          ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/TableFunctionResolver.java
          ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/WindowingTableFunction.java
          ql/src/test/org/apache/hadoop/hive/ql/exec/persistence/TestPTFRowContainer.java
          ql/src/test/queries/clientpositive/windowing_adjust_rowcontainer_sz.q
          ql/src/test/results/clientpositive/windowing_adjust_rowcontainer_sz.q.out

          To: JIRA, ashutoshc, hbutani

          Show
          Phabricator added a comment - hbutani updated the revision " HIVE-4963 [jira] Support in memory PTF partitions". Merge branch 'trunk' into HIVE-4963 -2 changes based on review. fix lint issues Reviewers: JIRA, ashutoshc REVISION DETAIL https://reviews.facebook.net/D12279 CHANGE SINCE LAST DIFF https://reviews.facebook.net/D12279?vs=38391&id=38745#toc AFFECTED FILES ql/src/java/org/apache/hadoop/hive/ql/exec/PTFOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/PTFPartition.java ql/src/java/org/apache/hadoop/hive/ql/exec/PTFPersistence.java ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/PTFRowContainer.java ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/RowContainer.java ql/src/java/org/apache/hadoop/hive/ql/parse/PTFTranslator.java ql/src/java/org/apache/hadoop/hive/ql/plan/PTFDesc.java ql/src/java/org/apache/hadoop/hive/ql/plan/PTFDeserializer.java ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFLeadLag.java ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/NPath.java ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/TableFunctionEvaluator.java ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/TableFunctionResolver.java ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/WindowingTableFunction.java ql/src/test/org/apache/hadoop/hive/ql/exec/persistence/TestPTFRowContainer.java ql/src/test/queries/clientpositive/windowing_adjust_rowcontainer_sz.q ql/src/test/results/clientpositive/windowing_adjust_rowcontainer_sz.q.out To: JIRA, ashutoshc, hbutani
          Hide
          Edward Capriolo added a comment -

          Yes this is much more work to do. More importantly, its not PTF specific either, its in existing code which Harish has chosen to reuse. I dont think its fair to hold on to this patch for this. It can be done in a follow-up.

          Agreed. If extending an existing component that already does it this way, changing both is out-of-scope.

          Show
          Edward Capriolo added a comment - Yes this is much more work to do. More importantly, its not PTF specific either, its in existing code which Harish has chosen to reuse. I dont think its fair to hold on to this patch for this. It can be done in a follow-up. Agreed. If extending an existing component that already does it this way, changing both is out-of-scope.
          Hide
          Ashutosh Chauhan added a comment -

          Harish, Also can you get rid of config variables in HiveConf which were about size of persistence byte list, those will become relevant after this patch.
          Also, do you think we can word title of this jira better so it helps folks to understand this work better.

          Show
          Ashutosh Chauhan added a comment - Harish, Also can you get rid of config variables in HiveConf which were about size of persistence byte list, those will become relevant after this patch. Also, do you think we can word title of this jira better so it helps folks to understand this work better.
          Hide
          Ashutosh Chauhan added a comment -

          Yes this is much more work to do. More importantly, its not PTF specific either, its in existing code which Harish has chosen to reuse. I dont think its fair to hold on to this patch for this. It can be done in a follow-up.

          Show
          Ashutosh Chauhan added a comment - Yes this is much more work to do. More importantly, its not PTF specific either, its in existing code which Harish has chosen to reuse. I dont think its fair to hold on to this patch for this. It can be done in a follow-up.
          Hide
          Edward Capriolo added a comment -

          ql/src/java/org/apache/hadoop/hive/ql/exec/PTFPartition.java:89 This I think you need to do because current RowContainers can only hold crisp java objects. Seems like we can improve this by writing RowContainer which can hold writables, thus avoiding unnecessary deserialization and mem-cpy here. Something worth exploring as follow-up issue.

          Is it much more work to do this now? There are already a number of PTF -to-be-cleaned-ups and I would hate to add more.

          Show
          Edward Capriolo added a comment - ql/src/java/org/apache/hadoop/hive/ql/exec/PTFPartition.java:89 This I think you need to do because current RowContainers can only hold crisp java objects. Seems like we can improve this by writing RowContainer which can hold writables, thus avoiding unnecessary deserialization and mem-cpy here. Something worth exploring as follow-up issue. Is it much more work to do this now? There are already a number of PTF -to-be-cleaned-ups and I would hate to add more.
          Hide
          Phabricator added a comment -

          ashutoshc has commented on the revision "HIVE-4963 [jira] Support in memory PTF partitions".

          Seems like there are more opportunities to make this efficient, but those can be digged into later. This patch is a step in a right direction by reusing existing infra. Any improvements we now make may benefit other spilling operators like join too. Really makes me happy : )
          Apart from code comments, I will also request you to add a testcase which sets the config value (cachesize) to zero, so that it spills for every record and exercise all these new codepath.

          INLINE COMMENTS
          ql/src/java/org/apache/hadoop/hive/ql/exec/PTFPartition.java:89 This I think you need to do because current RowContainers can only hold crisp java objects. Seems like we can improve this by writing RowContainer which can hold writables, thus avoiding unnecessary deserialization and mem-cpy here. Something worth exploring as follow-up issue.
          ql/src/java/org/apache/hadoop/hive/ql/exec/PTFPartition.java:57 this config should really govern how much memory we are willing to allocate (in bytes), not in number of rows, but thats a topic for another jira since you are reusing existing code.
          ql/src/java/org/apache/hadoop/hive/ql/exec/PTFPartition.java:148 This sanity check is in tight loop. Ideally we should not have such checks in inner loop. But lets leave it here till we get more confidence in the code. Will be good to add a note about what will be the assumption if we are to get rid of this check in future.
          ql/src/java/org/apache/hadoop/hive/ql/exec/PTFPartition.java:137 Instead of try-catch-rethrow, shall we just add throws in method signature, makes code readable and arguably faster.
          ql/src/java/org/apache/hadoop/hive/ql/exec/PTFPartition.java:160 Similar comment about try-catch-rethrow.
          ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/PTFRowContainer.java:80 Awesome comments!
          ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/PTFRowContainer.java:94 If I get this right, this function will again do serialization before spilling, so in case of memory pressure, we are doing a round trip of ser-deser without performing useful work. This ties back to my earlier comment on eager deserialization.
          This whole mechanism is worth exploring later.

          REVISION DETAIL
          https://reviews.facebook.net/D12279

          To: JIRA, ashutoshc, hbutani

          Show
          Phabricator added a comment - ashutoshc has commented on the revision " HIVE-4963 [jira] Support in memory PTF partitions". Seems like there are more opportunities to make this efficient, but those can be digged into later. This patch is a step in a right direction by reusing existing infra. Any improvements we now make may benefit other spilling operators like join too. Really makes me happy : ) Apart from code comments, I will also request you to add a testcase which sets the config value (cachesize) to zero, so that it spills for every record and exercise all these new codepath. INLINE COMMENTS ql/src/java/org/apache/hadoop/hive/ql/exec/PTFPartition.java:89 This I think you need to do because current RowContainers can only hold crisp java objects. Seems like we can improve this by writing RowContainer which can hold writables, thus avoiding unnecessary deserialization and mem-cpy here. Something worth exploring as follow-up issue. ql/src/java/org/apache/hadoop/hive/ql/exec/PTFPartition.java:57 this config should really govern how much memory we are willing to allocate (in bytes), not in number of rows, but thats a topic for another jira since you are reusing existing code. ql/src/java/org/apache/hadoop/hive/ql/exec/PTFPartition.java:148 This sanity check is in tight loop. Ideally we should not have such checks in inner loop. But lets leave it here till we get more confidence in the code. Will be good to add a note about what will be the assumption if we are to get rid of this check in future. ql/src/java/org/apache/hadoop/hive/ql/exec/PTFPartition.java:137 Instead of try-catch-rethrow, shall we just add throws in method signature, makes code readable and arguably faster. ql/src/java/org/apache/hadoop/hive/ql/exec/PTFPartition.java:160 Similar comment about try-catch-rethrow. ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/PTFRowContainer.java:80 Awesome comments! ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/PTFRowContainer.java:94 If I get this right, this function will again do serialization before spilling, so in case of memory pressure, we are doing a round trip of ser-deser without performing useful work. This ties back to my earlier comment on eager deserialization. This whole mechanism is worth exploring later. REVISION DETAIL https://reviews.facebook.net/D12279 To: JIRA, ashutoshc, hbutani
          Hide
          Edward Capriolo added a comment -

          I have a couple small comments.

          The variable sz i do not think we need it. Cant we determine the size from the collection. A couple places were we are using array list on the left side.

          Show
          Edward Capriolo added a comment - I have a couple small comments. The variable sz i do not think we need it. Cant we determine the size from the collection. A couple places were we are using array list on the left side.
          Hide
          Phabricator added a comment -

          hbutani updated the revision "HIVE-4963 [jira] Support in memory PTF partitions".

          • Merge remote-tracking branch 'origin' into HIVE-4963-2
          • update RowContainer based on template parameter change from Row to ROW

          Reviewers: JIRA, ashutoshc

          REVISION DETAIL
          https://reviews.facebook.net/D12279

          CHANGE SINCE LAST DIFF
          https://reviews.facebook.net/D12279?vs=37983&id=38391#toc

          AFFECTED FILES
          ql/src/java/org/apache/hadoop/hive/ql/exec/PTFOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/PTFPartition.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/PTFPersistence.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/PTFRowContainer.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/RowContainer.java
          ql/src/java/org/apache/hadoop/hive/ql/parse/PTFTranslator.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/PTFDesc.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/PTFDeserializer.java
          ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/TableFunctionEvaluator.java
          ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/TableFunctionResolver.java
          ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/WindowingTableFunction.java

          To: JIRA, ashutoshc, hbutani

          Show
          Phabricator added a comment - hbutani updated the revision " HIVE-4963 [jira] Support in memory PTF partitions". Merge remote-tracking branch 'origin' into HIVE-4963 -2 update RowContainer based on template parameter change from Row to ROW Reviewers: JIRA, ashutoshc REVISION DETAIL https://reviews.facebook.net/D12279 CHANGE SINCE LAST DIFF https://reviews.facebook.net/D12279?vs=37983&id=38391#toc AFFECTED FILES ql/src/java/org/apache/hadoop/hive/ql/exec/PTFOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/PTFPartition.java ql/src/java/org/apache/hadoop/hive/ql/exec/PTFPersistence.java ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/PTFRowContainer.java ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/RowContainer.java ql/src/java/org/apache/hadoop/hive/ql/parse/PTFTranslator.java ql/src/java/org/apache/hadoop/hive/ql/plan/PTFDesc.java ql/src/java/org/apache/hadoop/hive/ql/plan/PTFDeserializer.java ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/TableFunctionEvaluator.java ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/TableFunctionResolver.java ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/WindowingTableFunction.java To: JIRA, ashutoshc, hbutani
          Hide
          Harish Butani added a comment -

          No XMLEncoder doesn't honor the transient qualifier.
          http://www.oracle.com/technetwork/java/persistence4-140124.html#transient

          Show
          Harish Butani added a comment - No XMLEncoder doesn't honor the transient qualifier. http://www.oracle.com/technetwork/java/persistence4-140124.html#transient
          Hide
          Edward Capriolo added a comment -

          Why cant we mark the fields as transient? Do they need to be serialized in other contexts? If they need to be serialized sometimes and not others maybe what we need is two different fields?

          Show
          Edward Capriolo added a comment - Why cant we mark the fields as transient? Do they need to be serialized in other contexts? If they need to be serialized sometimes and not others maybe what we need is two different fields?
          Hide
          Harish Butani added a comment -

          This is to get around the issue of XMLEncoder trying to serialize all fields with accessors.

          Show
          Harish Butani added a comment - This is to get around the issue of XMLEncoder trying to serialize all fields with accessors.
          Hide
          Edward Capriolo added a comment -

          Can you please describe why these calls are needed

            PTFUtils.makeTransient(PTFDesc.class, "llInfo");
          		59		​    PTFUtils.makeTransient(PTFDesc.class, "cfg");
          

          This looks like a code-smell. Is there any other way of handling this?

          Show
          Edward Capriolo added a comment - Can you please describe why these calls are needed PTFUtils.makeTransient(PTFDesc.class, "llInfo"); 59 ​ PTFUtils.makeTransient(PTFDesc.class, "cfg"); This looks like a code-smell. Is there any other way of handling this?
          Hide
          Phabricator added a comment -

          hbutani requested code review of "HIVE-4963 [jira] Support in memory PTF partitions".

          Reviewers: JIRA, ashutoshc

          fix lint issues

          PTF partitions apply the defensive mode of assuming that partitions will not fit in memory. Because of this there is a significant deserialization overhead when accessing elements.

          Allow the user to specify that there is enough memory to hold partitions through a 'hive.ptf.partition.fits.in.mem' option.

          Savings depends on partition size and in case of windowing the number of UDAFs and the window ranges. For eg for the following (admittedly extreme) case the PTFOperator exec times went from 39 secs to 8 secs.

          select t, s, i, b, f, d,
          min(t) over(partition by 1 rows between unbounded preceding and current row),
          min(s) over(partition by 1 rows between unbounded preceding and current row),
          min over(partition by 1 rows between unbounded preceding and current row),
          min(b) over(partition by 1 rows between unbounded preceding and current row)
          from over10k

          TEST PLAN
          EMPTY

          REVISION DETAIL
          https://reviews.facebook.net/D12279

          AFFECTED FILES
          ql/src/java/org/apache/hadoop/hive/ql/exec/PTFOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/PTFPartition.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/PTFPersistence.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/PTFRowContainer.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/RowContainer.java
          ql/src/java/org/apache/hadoop/hive/ql/parse/PTFTranslator.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/PTFDesc.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/PTFDeserializer.java
          ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/TableFunctionEvaluator.java
          ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/TableFunctionResolver.java
          ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/WindowingTableFunction.java

          MANAGE HERALD RULES
          https://reviews.facebook.net/herald/view/differential/

          WHY DID I GET THIS EMAIL?
          https://reviews.facebook.net/herald/transcript/29349/

          To: JIRA, ashutoshc, hbutani

          Show
          Phabricator added a comment - hbutani requested code review of " HIVE-4963 [jira] Support in memory PTF partitions". Reviewers: JIRA, ashutoshc fix lint issues PTF partitions apply the defensive mode of assuming that partitions will not fit in memory. Because of this there is a significant deserialization overhead when accessing elements. Allow the user to specify that there is enough memory to hold partitions through a 'hive.ptf.partition.fits.in.mem' option. Savings depends on partition size and in case of windowing the number of UDAFs and the window ranges. For eg for the following (admittedly extreme) case the PTFOperator exec times went from 39 secs to 8 secs. select t, s, i, b, f, d, min(t) over(partition by 1 rows between unbounded preceding and current row), min(s) over(partition by 1 rows between unbounded preceding and current row), min over(partition by 1 rows between unbounded preceding and current row), min(b) over(partition by 1 rows between unbounded preceding and current row) from over10k TEST PLAN EMPTY REVISION DETAIL https://reviews.facebook.net/D12279 AFFECTED FILES ql/src/java/org/apache/hadoop/hive/ql/exec/PTFOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/PTFPartition.java ql/src/java/org/apache/hadoop/hive/ql/exec/PTFPersistence.java ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/PTFRowContainer.java ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/RowContainer.java ql/src/java/org/apache/hadoop/hive/ql/parse/PTFTranslator.java ql/src/java/org/apache/hadoop/hive/ql/plan/PTFDesc.java ql/src/java/org/apache/hadoop/hive/ql/plan/PTFDeserializer.java ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/TableFunctionEvaluator.java ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/TableFunctionResolver.java ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/WindowingTableFunction.java MANAGE HERALD RULES https://reviews.facebook.net/herald/view/differential/ WHY DID I GET THIS EMAIL? https://reviews.facebook.net/herald/transcript/29349/ To: JIRA, ashutoshc, hbutani
          Hide
          Ashutosh Chauhan added a comment -

          Thanks for explanation. Sounds good. Lets proceed with this.

          Show
          Ashutosh Chauhan added a comment - Thanks for explanation. Sounds good. Lets proceed with this.
          Hide
          Harish Butani added a comment -
          • PTFRecordWriter is needed to provide access to the underlying SeqFile.Writer. So that at the time of writing to the Container, we can record the locations in the file where the individual Blocks start.
          • PTFHiveSequenceFileOutputFormat is there so that on getHiveRecordWriter call, we return the PTFRecordWriter.
          • PTFSequenceFileRecordReader allows the PTFRowContainer to seek to the startOffset of the block. So a getAt request that needs to fetch data, first figures out the Split to read and then seeks to the startOffset, from where the RecordReader should start.
          • PTFSequenceFileInputFormat is needed to to return PTFSequenceFileRecordReader in the getRecordReader call.
          Show
          Harish Butani added a comment - PTFRecordWriter is needed to provide access to the underlying SeqFile.Writer. So that at the time of writing to the Container, we can record the locations in the file where the individual Blocks start. PTFHiveSequenceFileOutputFormat is there so that on getHiveRecordWriter call, we return the PTFRecordWriter. PTFSequenceFileRecordReader allows the PTFRowContainer to seek to the startOffset of the block. So a getAt request that needs to fetch data, first figures out the Split to read and then seeks to the startOffset, from where the RecordReader should start. PTFSequenceFileInputFormat is needed to to return PTFSequenceFileRecordReader in the getRecordReader call.
          Hide
          Ashutosh Chauhan added a comment -

          Thanks a lot Harish Butani for digging into this. Much appreciated. I think this is the right direction to go. We should eventually get rid of ByteBasedList and friends and use this approach.

          One implementation question I have is why you needed to have PTFRecordWriter, PTFOutputFormat, PTFInputFormat etc. It seems they don't have any special logic. Whats the reason we need those and simply cant use HiveSequenceFileOutFormat and friends.

          Show
          Ashutosh Chauhan added a comment - Thanks a lot Harish Butani for digging into this. Much appreciated. I think this is the right direction to go. We should eventually get rid of ByteBasedList and friends and use this approach. One implementation question I have is why you needed to have PTFRecordWriter, PTFOutputFormat, PTFInputFormat etc. It seems they don't have any special logic. Whats the reason we need those and simply cant use HiveSequenceFileOutFormat and friends.
          Hide
          Harish Butani added a comment -

          Ashutosh Chauhan have attached a patch with PTFRowContainer that extends RowContainer. PTFRowContainer is different because need to provide random access to rows. PTFRowContainer would replace classes in PTFPersistence: ByteBasedList, PartitionedByteBasedList... PTFRowContainer does utilize a lot of the code from RowContainer; another advantage is that all data is in 1 SeqFile. Can you please take a look to see this approach is acceptable. Will work on connecting PTFPartition to PTFRowContainer.

          Show
          Harish Butani added a comment - Ashutosh Chauhan have attached a patch with PTFRowContainer that extends RowContainer. PTFRowContainer is different because need to provide random access to rows. PTFRowContainer would replace classes in PTFPersistence: ByteBasedList, PartitionedByteBasedList... PTFRowContainer does utilize a lot of the code from RowContainer; another advantage is that all data is in 1 SeqFile. Can you please take a look to see this approach is acceptable. Will work on connecting PTFPartition to PTFRowContainer.
          Hide
          Ashutosh Chauhan added a comment -

          I would also suggest to take a look at how Join Operator handles this. It has a same problem to solve and it solves nearly in same fashion (atleast conceptually). Instead of building an alternative infra for spilling to disk under memory load, it will be better to reuse those classes and mechanism, if possible.

          Show
          Ashutosh Chauhan added a comment - I would also suggest to take a look at how Join Operator handles this. It has a same problem to solve and it solves nearly in same fashion (atleast conceptually). Instead of building an alternative infra for spilling to disk under memory load, it will be better to reuse those classes and mechanism, if possible.
          Hide
          Harish Butani added a comment -

          We already do this. The rows are accumulated in a ByteBasedList; when it fills up it is spilled to disk and a new ByteBasedList is added. So if there are less than 32Mb bytes needed(or whatever is set by the user), there is no I/O.
          The saving here comes from not holding the objects in a serialized form. Currently every field access goes through deserialization. InMemoryPartition was going to be the case where the user guarantees that there is enough memory so we just hold the deserialized objects. Am working on a Caching wrapper on the PTFPartition which would hold onto deserialized objects, but is backed by the serialized bytes in case we run out of memory.

          But yes it would be nice to merge these 2 concepts into one thing. There is an overhead in Caching over InMemoryPartition: at least an extra serialization, potentially more in both time and space. But the overhead may not matter that much. Give me a couple of days to work through this..

          Show
          Harish Butani added a comment - We already do this. The rows are accumulated in a ByteBasedList; when it fills up it is spilled to disk and a new ByteBasedList is added. So if there are less than 32Mb bytes needed(or whatever is set by the user), there is no I/O. The saving here comes from not holding the objects in a serialized form. Currently every field access goes through deserialization. InMemoryPartition was going to be the case where the user guarantees that there is enough memory so we just hold the deserialized objects. Am working on a Caching wrapper on the PTFPartition which would hold onto deserialized objects, but is backed by the serialized bytes in case we run out of memory. But yes it would be nice to merge these 2 concepts into one thing. There is an overhead in Caching over InMemoryPartition: at least an extra serialization, potentially more in both time and space. But the overhead may not matter that much. Give me a couple of days to work through this..
          Hide
          Ashutosh Chauhan added a comment -

          I have a question: While rows are accumulating we serialize and store them in PersistenceByteList (PBL), once they cross limit (32MB) we spill the list to disk. Now by adding this new config, we assume since accumulated data will fit into memory, we don't need PBL and create new type of PTFPartition. So, what we are saving is this serialization and deserialization out of this list. Is that correct? If so, I think better way might be to not write first 32MBs in PBL and just keep them in memory, once they cross the limit at that time serialize them and dump to disk.
          I dont like this new config knob, since user has no way of knowing when to turn the flag on, it depends both on query as well as data. If we can get rid of this knob and do this smartly that will be real cool.

          Show
          Ashutosh Chauhan added a comment - I have a question: While rows are accumulating we serialize and store them in PersistenceByteList (PBL), once they cross limit (32MB) we spill the list to disk. Now by adding this new config, we assume since accumulated data will fit into memory, we don't need PBL and create new type of PTFPartition. So, what we are saving is this serialization and deserialization out of this list. Is that correct? If so, I think better way might be to not write first 32MBs in PBL and just keep them in memory, once they cross the limit at that time serialize them and dump to disk. I dont like this new config knob, since user has no way of knowing when to turn the flag on, it depends both on query as well as data. If we can get rid of this knob and do this smartly that will be real cool.
          Hide
          Phabricator added a comment -

          hbutani requested code review of "HIVE-4963 [jira] Support in memory PTF partitions".

          Reviewers: JIRA, ashutoshc

          fix lint issues

          PTF partitions apply the defensive mode of assuming that partitions will not fit in memory. Because of this there is a significant deserialization overhead when accessing elements.

          Allow the user to specify that there is enough memory to hold partitions through a 'hive.ptf.partition.fits.in.mem' option.

          Savings depends on partition size and in case of windowing the number of UDAFs and the window ranges. For eg for the following (admittedly extreme) case the PTFOperator exec times went from 39 secs to 8 secs.

          select t, s, i, b, f, d,
          min(t) over(partition by 1 rows between unbounded preceding and current row),
          min(s) over(partition by 1 rows between unbounded preceding and current row),
          min over(partition by 1 rows between unbounded preceding and current row),
          min(b) over(partition by 1 rows between unbounded preceding and current row)
          from over10k

          TEST PLAN
          add a new test with option set to true

          REVISION DETAIL
          https://reviews.facebook.net/D11955

          AFFECTED FILES
          common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/PTFOperator.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/PTFPartition.java
          ql/src/java/org/apache/hadoop/hive/ql/parse/PTFTranslator.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/PTFDesc.java
          ql/src/java/org/apache/hadoop/hive/ql/plan/PTFDeserializer.java
          ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/TableFunctionEvaluator.java
          ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/TableFunctionResolver.java
          ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/WindowingTableFunction.java
          ql/src/test/queries/clientpositive/windowing_inmempart.q
          ql/src/test/results/clientpositive/windowing_inmempart.q.out

          MANAGE HERALD RULES
          https://reviews.facebook.net/herald/view/differential/

          WHY DID I GET THIS EMAIL?
          https://reviews.facebook.net/herald/transcript/28581/

          To: JIRA, ashutoshc, hbutani

          Show
          Phabricator added a comment - hbutani requested code review of " HIVE-4963 [jira] Support in memory PTF partitions". Reviewers: JIRA, ashutoshc fix lint issues PTF partitions apply the defensive mode of assuming that partitions will not fit in memory. Because of this there is a significant deserialization overhead when accessing elements. Allow the user to specify that there is enough memory to hold partitions through a 'hive.ptf.partition.fits.in.mem' option. Savings depends on partition size and in case of windowing the number of UDAFs and the window ranges. For eg for the following (admittedly extreme) case the PTFOperator exec times went from 39 secs to 8 secs. select t, s, i, b, f, d, min(t) over(partition by 1 rows between unbounded preceding and current row), min(s) over(partition by 1 rows between unbounded preceding and current row), min over(partition by 1 rows between unbounded preceding and current row), min(b) over(partition by 1 rows between unbounded preceding and current row) from over10k TEST PLAN add a new test with option set to true REVISION DETAIL https://reviews.facebook.net/D11955 AFFECTED FILES common/src/java/org/apache/hadoop/hive/conf/HiveConf.java ql/src/java/org/apache/hadoop/hive/ql/exec/PTFOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/PTFPartition.java ql/src/java/org/apache/hadoop/hive/ql/parse/PTFTranslator.java ql/src/java/org/apache/hadoop/hive/ql/plan/PTFDesc.java ql/src/java/org/apache/hadoop/hive/ql/plan/PTFDeserializer.java ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/TableFunctionEvaluator.java ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/TableFunctionResolver.java ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/WindowingTableFunction.java ql/src/test/queries/clientpositive/windowing_inmempart.q ql/src/test/results/clientpositive/windowing_inmempart.q.out MANAGE HERALD RULES https://reviews.facebook.net/herald/view/differential/ WHY DID I GET THIS EMAIL? https://reviews.facebook.net/herald/transcript/28581/ To: JIRA, ashutoshc, hbutani

            People

            • Assignee:
              Harish Butani
              Reporter:
              Harish Butani
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development