Hive
  1. Hive
  2. HIVE-4639

Add has null flag to ORC internal index

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.1.0
    • Component/s: File Formats
    • Labels:
      None
    • Release Note:
      Support for hasNull flag in ORC row group index.

      Description

      It would enable more predicate pushdown if we added a flag to the index entry recording if there were any null values in the column for the 10k rows.

      1. HIVE-4639.3.patch
        284 kB
        Prasanth Jayachandran
      2. HIVE-4639.2.patch
        273 kB
        Prasanth Jayachandran
      3. HIVE-4639.1.patch
        147 kB
        Prasanth Jayachandran

        Issue Links

          Activity

          Hide
          Prasanth Jayachandran added a comment -

          Good catch! Lefty Leverenz. Updated the docs!

          Show
          Prasanth Jayachandran added a comment - Good catch! Lefty Leverenz . Updated the docs!
          Hide
          Lefty Leverenz added a comment -

          Doc note: Prasanth Jayachandran documented this in the ORC wiki.

          But it says the hasNull flag is added in 1.2.0 – shouldn't that be 1.1.0, since this jira's fix version is 0.15?

          Show
          Lefty Leverenz added a comment - Doc note: Prasanth Jayachandran documented this in the ORC wiki. ORC – Column Statistics But it says the hasNull flag is added in 1.2.0 – shouldn't that be 1.1.0, since this jira's fix version is 0.15?
          Hide
          Lefty Leverenz added a comment -

          Thanks Gopal V. I assume that means no documentation is needed, since this is internal and backward-compatible.

          Show
          Lefty Leverenz added a comment - Thanks Gopal V . I assume that means no documentation is needed, since this is internal and backward-compatible.
          Hide
          Gopal V added a comment -

          for the sake of documentation this does not change the ORC format version (i.e ORC files with hasNull flags can be read by hive-14).

          Lefty Leverenz: FYI.

          Show
          Gopal V added a comment - for the sake of documentation this does not change the ORC format version (i.e ORC files with hasNull flags can be read by hive-14). Lefty Leverenz : FYI.
          Hide
          Prasanth Jayachandran added a comment -

          Committed to trunk. Thanks Gopal V for the review and test run!

          Show
          Prasanth Jayachandran added a comment - Committed to trunk. Thanks Gopal V for the review and test run!
          Hide
          Hive QA added a comment -

          Overall: -1 at least one tests failed

          Here are the results of testing the latest attachment:
          https://issues.apache.org/jira/secure/attachment/12691023/HIVE-4639.3.patch

          ERROR: -1 due to 2 failed/errored test(s), 6747 tests executed
          Failed tests:

          org.apache.hadoop.hive.cli.TestHBaseCliDriver.testCliDriver_hbase_joins
          org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_optimize_nullscan
          

          Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2311/testReport
          Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2311/console
          Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-2311/

          Messages:

          Executing org.apache.hive.ptest.execution.PrepPhase
          Executing org.apache.hive.ptest.execution.ExecutionPhase
          Executing org.apache.hive.ptest.execution.ReportingPhase
          Tests exited with: TestsFailedException: 2 tests failed
          

          This message is automatically generated.

          ATTACHMENT ID: 12691023 - PreCommit-HIVE-TRUNK-Build

          Show
          Hive QA added a comment - Overall : -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12691023/HIVE-4639.3.patch ERROR: -1 due to 2 failed/errored test(s), 6747 tests executed Failed tests: org.apache.hadoop.hive.cli.TestHBaseCliDriver.testCliDriver_hbase_joins org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_optimize_nullscan Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2311/testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2311/console Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-2311/ Messages: Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 2 tests failed This message is automatically generated. ATTACHMENT ID: 12691023 - PreCommit-HIVE-TRUNK-Build
          Hide
          Prasanth Jayachandran added a comment -

          I missed out few test failure diffs in previous patch. Added them in this patch.

          Show
          Prasanth Jayachandran added a comment - I missed out few test failure diffs in previous patch. Added them in this patch.
          Hide
          Hive QA added a comment -

          Overall: -1 at least one tests failed

          Here are the results of testing the latest attachment:
          https://issues.apache.org/jira/secure/attachment/12690690/HIVE-4639.2.patch

          ERROR: -1 due to 8 failed/errored test(s), 6747 tests executed
          Failed tests:

          org.apache.hadoop.hive.ql.io.orc.TestOrcNullOptimization.testColumnsWithNullAndCompression
          org.apache.hadoop.hive.ql.io.orc.TestOrcNullOptimization.testMultiStripeWithNull
          org.apache.hadoop.hive.ql.io.orc.TestOrcNullOptimization.testMultiStripeWithoutNull
          org.apache.hadoop.hive.ql.io.orc.TestOrcSerDeStats.testOrcSerDeStatsComplex
          org.apache.hadoop.hive.ql.io.orc.TestOrcSerDeStats.testOrcSerDeStatsComplexOldFormat
          org.apache.hadoop.hive.ql.io.orc.TestOrcSerDeStats.testSerdeStatsOldFormat
          org.apache.hadoop.hive.ql.io.orc.TestOrcSerDeStats.testStringAndBinaryStatistics
          org.apache.hive.hcatalog.streaming.TestStreaming.testEndpointConnection
          

          Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2296/testReport
          Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2296/console
          Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-2296/

          Messages:

          Executing org.apache.hive.ptest.execution.PrepPhase
          Executing org.apache.hive.ptest.execution.ExecutionPhase
          Executing org.apache.hive.ptest.execution.ReportingPhase
          Tests exited with: TestsFailedException: 8 tests failed
          

          This message is automatically generated.

          ATTACHMENT ID: 12690690 - PreCommit-HIVE-TRUNK-Build

          Show
          Hive QA added a comment - Overall : -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12690690/HIVE-4639.2.patch ERROR: -1 due to 8 failed/errored test(s), 6747 tests executed Failed tests: org.apache.hadoop.hive.ql.io.orc.TestOrcNullOptimization.testColumnsWithNullAndCompression org.apache.hadoop.hive.ql.io.orc.TestOrcNullOptimization.testMultiStripeWithNull org.apache.hadoop.hive.ql.io.orc.TestOrcNullOptimization.testMultiStripeWithoutNull org.apache.hadoop.hive.ql.io.orc.TestOrcSerDeStats.testOrcSerDeStatsComplex org.apache.hadoop.hive.ql.io.orc.TestOrcSerDeStats.testOrcSerDeStatsComplexOldFormat org.apache.hadoop.hive.ql.io.orc.TestOrcSerDeStats.testSerdeStatsOldFormat org.apache.hadoop.hive.ql.io.orc.TestOrcSerDeStats.testStringAndBinaryStatistics org.apache.hive.hcatalog.streaming.TestStreaming.testEndpointConnection Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2296/testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2296/console Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-2296/ Messages: Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 8 tests failed This message is automatically generated. ATTACHMENT ID: 12690690 - PreCommit-HIVE-TRUNK-Build
          Hide
          Gopal V added a comment -

          Added this patch to my daily TPC-H 1Tb ETL & reloaded lineitem with the new format.

          Testing select * from lineitem where l_shipdate is null;.

          Before: 66.728 seconds (208774320430 bytes read)
          After: 7.87 seconds (539046900 bytes read)

          LGTM - +1.

          Show
          Gopal V added a comment - Added this patch to my daily TPC-H 1Tb ETL & reloaded lineitem with the new format. Testing select * from lineitem where l_shipdate is null; . Before: 66.728 seconds (208774320430 bytes read) After: 7.87 seconds (539046900 bytes read) LGTM - +1.
          Hide
          Prasanth Jayachandran added a comment -

          Fixes test failures. All of them are file size diffs.

          Show
          Prasanth Jayachandran added a comment - Fixes test failures. All of them are file size diffs.
          Hide
          Prasanth Jayachandran added a comment -

          As Gopal mentioned, we can infer the other stats from the existing information
          all_nulls -> min = null
          no_nulls -> hasNull = false
          some_nulls -> hasNull = true, min != null

          Show
          Prasanth Jayachandran added a comment - As Gopal mentioned, we can infer the other stats from the existing information all_nulls -> min = null no_nulls -> hasNull = false some_nulls -> hasNull = true, min != null
          Hide
          Gopal V added a comment -

          Yes, we have that granularity locked up in two states (as a tri-state, now - all_nulls, some_nulls, no_nulls).

          We actually have all_nulls/no_values encoded as "min=null/max=null". This patch is the "some_nulls/no_nulls" boolean on top of that - though, that information is in somewhat non-obvious detail.

          Another thought occurs, that since we have a whole long stream of IS_PRESENT already, I suspect storing the actual NULL count would be somewhat helpful, if we need to have a heuristic for IS_NULL row-level predicate evaluation for wide de-normalized tables (i.e read filter col first and then avoid creating large vector batches for the rest).

          Show
          Gopal V added a comment - Yes, we have that granularity locked up in two states (as a tri-state, now - all_nulls, some_nulls, no_nulls). We actually have all_nulls/no_values encoded as "min=null/max=null". This patch is the "some_nulls/no_nulls" boolean on top of that - though, that information is in somewhat non-obvious detail. Another thought occurs, that since we have a whole long stream of IS_PRESENT already, I suspect storing the actual NULL count would be somewhat helpful, if we need to have a heuristic for IS_NULL row-level predicate evaluation for wide de-normalized tables (i.e read filter col first and then avoid creating large vector batches for the rest).
          Hide
          Owen O'Malley added a comment -

          You should encode four values:
          no_values, all_nulls, some_nulls, no_nulls

          This will allow you to support a richer set of sargs.

          Show
          Owen O'Malley added a comment - You should encode four values: no_values, all_nulls, some_nulls, no_nulls This will allow you to support a richer set of sargs.
          Hide
          Hive QA added a comment -

          Overall: -1 at least one tests failed

          Here are the results of testing the latest attachment:
          https://issues.apache.org/jira/secure/attachment/12690444/HIVE-4639.1.patch

          ERROR: -1 due to 32 failed/errored test(s), 6731 tests executed
          Failed tests:

          org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_alter_merge_orc
          org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_alter_merge_stats_orc
          org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_annotate_stats_part
          org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_annotate_stats_table
          org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_dynpart_sort_opt_vectorization
          org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_dynpart_sort_optimization2
          org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_extrapolate_part_stats_full
          org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_extrapolate_part_stats_partial
          org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_orc_analyze
          org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_orc_predicate_pushdown
          org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_vectorized_ptf
          org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_alter_merge_orc
          org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_alter_merge_stats_orc
          org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_dynpart_sort_opt_vectorization
          org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_dynpart_sort_optimization2
          org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_optimize_nullscan
          org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_orc_analyze
          org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_vectorized_ptf
          org.apache.hadoop.hive.ql.io.orc.TestInputOutputFormat.testCombinationInputFormatWithAcid
          org.apache.hadoop.hive.ql.io.orc.TestOrcFile.test1[0]
          org.apache.hadoop.hive.ql.io.orc.TestOrcFile.test1[1]
          org.apache.hadoop.hive.ql.io.orc.TestOrcFile.testReadFormat_0_11[0]
          org.apache.hadoop.hive.ql.io.orc.TestOrcFile.testReadFormat_0_11[1]
          org.apache.hadoop.hive.ql.io.orc.TestOrcFile.testStringAndBinaryStatistics[0]
          org.apache.hadoop.hive.ql.io.orc.TestOrcFile.testStringAndBinaryStatistics[1]
          org.apache.hadoop.hive.ql.io.orc.TestOrcNullOptimization.testColumnsWithNullAndCompression
          org.apache.hadoop.hive.ql.io.orc.TestOrcNullOptimization.testMultiStripeWithNull
          org.apache.hadoop.hive.ql.io.orc.TestOrcNullOptimization.testMultiStripeWithoutNull
          org.apache.hadoop.hive.ql.io.orc.TestOrcSerDeStats.testOrcSerDeStatsComplex
          org.apache.hadoop.hive.ql.io.orc.TestOrcSerDeStats.testOrcSerDeStatsComplexOldFormat
          org.apache.hadoop.hive.ql.io.orc.TestOrcSerDeStats.testSerdeStatsOldFormat
          org.apache.hadoop.hive.ql.io.orc.TestOrcSerDeStats.testStringAndBinaryStatistics
          

          Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2274/testReport
          Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2274/console
          Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-2274/

          Messages:

          Executing org.apache.hive.ptest.execution.PrepPhase
          Executing org.apache.hive.ptest.execution.ExecutionPhase
          Executing org.apache.hive.ptest.execution.ReportingPhase
          Tests exited with: TestsFailedException: 32 tests failed
          

          This message is automatically generated.

          ATTACHMENT ID: 12690444 - PreCommit-HIVE-TRUNK-Build

          Show
          Hive QA added a comment - Overall : -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12690444/HIVE-4639.1.patch ERROR: -1 due to 32 failed/errored test(s), 6731 tests executed Failed tests: org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_alter_merge_orc org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_alter_merge_stats_orc org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_annotate_stats_part org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_annotate_stats_table org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_dynpart_sort_opt_vectorization org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_dynpart_sort_optimization2 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_extrapolate_part_stats_full org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_extrapolate_part_stats_partial org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_orc_analyze org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_orc_predicate_pushdown org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_vectorized_ptf org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_alter_merge_orc org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_alter_merge_stats_orc org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_dynpart_sort_opt_vectorization org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_dynpart_sort_optimization2 org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_optimize_nullscan org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_orc_analyze org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_vectorized_ptf org.apache.hadoop.hive.ql.io.orc.TestInputOutputFormat.testCombinationInputFormatWithAcid org.apache.hadoop.hive.ql.io.orc.TestOrcFile.test1[0] org.apache.hadoop.hive.ql.io.orc.TestOrcFile.test1[1] org.apache.hadoop.hive.ql.io.orc.TestOrcFile.testReadFormat_0_11[0] org.apache.hadoop.hive.ql.io.orc.TestOrcFile.testReadFormat_0_11[1] org.apache.hadoop.hive.ql.io.orc.TestOrcFile.testStringAndBinaryStatistics[0] org.apache.hadoop.hive.ql.io.orc.TestOrcFile.testStringAndBinaryStatistics[1] org.apache.hadoop.hive.ql.io.orc.TestOrcNullOptimization.testColumnsWithNullAndCompression org.apache.hadoop.hive.ql.io.orc.TestOrcNullOptimization.testMultiStripeWithNull org.apache.hadoop.hive.ql.io.orc.TestOrcNullOptimization.testMultiStripeWithoutNull org.apache.hadoop.hive.ql.io.orc.TestOrcSerDeStats.testOrcSerDeStatsComplex org.apache.hadoop.hive.ql.io.orc.TestOrcSerDeStats.testOrcSerDeStatsComplexOldFormat org.apache.hadoop.hive.ql.io.orc.TestOrcSerDeStats.testSerdeStatsOldFormat org.apache.hadoop.hive.ql.io.orc.TestOrcSerDeStats.testStringAndBinaryStatistics Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2274/testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2274/console Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-2274/ Messages: Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 32 tests failed This message is automatically generated. ATTACHMENT ID: 12690444 - PreCommit-HIVE-TRUNK-Build
          Hide
          Prasanth Jayachandran added a comment -

          Owen O'Malleyare you working on this issue? If not I can take over this issue.

          Show
          Prasanth Jayachandran added a comment - Owen O'Malley are you working on this issue? If not I can take over this issue.

            People

            • Assignee:
              Prasanth Jayachandran
              Reporter:
              Owen O'Malley
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development