Uploaded image for project: 'ORC'
  1. ORC
  2. ORC-397

ORC should allow selectively disabling dictionary-encoding on specified columns

Details

    • New Feature
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 1.5.3, 1.6.0
    • None
    • None

    Description

      Just as ORC allows the choice of columns to enable bloom-filters on, it would be nice to have a way to specify which columns DICTIONARY_V2 encoding should be disabled on.

      Currently, the choice of dictionary-encoding depends on the results of sampling the first row-stride within a stripe. If the user knows that a column's cardinality is bound to prevent an effective dictionary, she might choose to simply disable it on just that column, and avoid the cost of sampling in the first row-stride.

      Attachments

        1. HIVE-18608.1-branch-2.2.patch
          10 kB
          Mithun Radhakrishnan
        2. HIVE-18608.2-branch-2.2.patch
          10 kB
          Mithun Radhakrishnan

        Issue Links

          Activity

            mithun Mithun Radhakrishnan added a comment - - edited

            I've attached an initial implementation, where dictionary encoding might be disabled via a table-property ('orc.skip.dictionary.for.columns').

            Note: I've only added support for top-level columns. Specifying this on a STRUCT will disable dictionary encoding for the entire sub-tree (i.e. all members of the STRUCT, recursively).

            It might be good to support selection at an arbitrary depth, at a later date.
            E.g. myInfoArray._elem_.emailBody.

            mithun Mithun Radhakrishnan added a comment - - edited I've attached an initial implementation, where dictionary encoding might be disabled via a table-property ( 'orc.skip.dictionary.for.columns' ). Note: I've only added support for top-level columns. Specifying this on a STRUCT will disable dictionary encoding for the entire sub-tree (i.e. all members of the STRUCT , recursively). It might be good to support selection at an arbitrary depth, at a later date. E.g. myInfoArray._ elem _.emailBody .

            (Submitting for tests.)

            mithun Mithun Radhakrishnan added a comment - (Submitting for tests.)
            omalley Owen O'Malley added a comment -

            I'd suggest making the property:
            orc.column.encoding.direct =col10,col20

            omalley Owen O'Malley added a comment - I'd suggest making the property: orc.column.encoding.direct =col10,col20
            omalley Owen O'Malley added a comment - - edited

            I've just opened a jira and a pull request that is useful to this and other changes that need to specify column names.

            https://issues.apache.org/jira/browse/ORC-308

            it allows you to specify subfields by name such as your example: myInfoArray._elem.emailBody.

            omalley Owen O'Malley added a comment - - edited I've just opened a jira and a pull request that is useful to this and other changes that need to specify column names. https://issues.apache.org/jira/browse/ORC-308 it allows you to specify subfields by name such as your example: myInfoArray._elem.emailBody.

            Hey, owen.omalley. I've renamed that property, as you've suggested.

            Also, thanks for ORC-308. :]

            mithun Mithun Radhakrishnan added a comment - Hey, owen.omalley . I've renamed that property, as you've suggested. Also, thanks for ORC-308 . :]
            hiveqa Hive QA added a comment -

            Here are the results of testing the latest attachment:
            https://issues.apache.org/jira/secure/attachment/12913421/HIVE-18608.2-branch-2.2.patch

            SUCCESS: +1 due to 1 test(s) being added or modified.

            ERROR: -1 due to 59 failed/errored test(s), 9944 tests executed
            Failed tests:

            TestDerbyConnector - did not produce a TEST-*.xml file (likely timed out) (batchId=244)
            TestJdbcDriver2 - did not produce a TEST-*.xml file (likely timed out) (batchId=225)
            TestMiniLlapLocalCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=167)
            	[acid_globallimit.q,alter_merge_2_orc.q]
            TestMiniSparkOnYarnCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=173)
            	[infer_bucket_sort_reducers_power_two.q,list_bucket_dml_10.q,orc_merge9.q,orc_merge6.q,leftsemijoin_mr.q,bucket6.q,bucketmapjoin7.q,uber_reduce.q,empty_dir_in_table.q,vector_outer_join3.q,index_bitmap_auto.q,vector_outer_join2.q,vector_outer_join1.q,orc_merge1.q,orc_merge_diff_fs.q,load_hdfs_file_with_space_in_the_name.q,scriptfile1_win.q,quotedid_smb.q,truncate_column_buckets.q,orc_merge3.q]
            TestMiniSparkOnYarnCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=174)
            	[infer_bucket_sort_num_buckets.q,gen_udf_example_add10.q,insert_overwrite_directory2.q,orc_merge5.q,bucketmapjoin6.q,import_exported_table.q,vector_outer_join0.q,orc_merge4.q,temp_table_external.q,orc_merge_incompat1.q,root_dir_external_table.q,constprog_semijoin.q,auto_sortmerge_join_16.q,schemeAuthority.q,index_bitmap3.q,external_table_with_space_in_location_path.q,parallel_orderby.q,infer_bucket_sort_map_operators.q,bucketizedhiveinputformat.q,remote_script.q]
            TestMiniSparkOnYarnCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=175)
            	[scriptfile1.q,vector_outer_join5.q,file_with_header_footer.q,bucket4.q,input16_cc.q,bucket5.q,infer_bucket_sort_merge.q,constprog_partitioner.q,orc_merge2.q,reduce_deduplicate.q,schemeAuthority2.q,load_fs2.q,orc_merge8.q,orc_merge_incompat2.q,infer_bucket_sort_bucketed_table.q,vector_outer_join4.q,disable_merge_for_bucketing.q,vector_inner_join.q,orc_merge7.q]
            TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=118)
            	[bucketmapjoin4.q,bucket_map_join_spark4.q,union21.q,groupby2_noskew.q,timestamp_2.q,date_join1.q,mergejoins.q,smb_mapjoin_11.q,auto_sortmerge_join_3.q,mapjoin_test_outer.q,vectorization_9.q,merge2.q,groupby6_noskew.q,auto_join_without_localtask.q,multi_join_union.q]
            TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=119)
            	[join_cond_pushdown_unqual4.q,union_remove_7.q,join13.q,join_vc.q,groupby_cube1.q,bucket_map_join_spark2.q,sample3.q,smb_mapjoin_19.q,stats16.q,union23.q,union.q,union31.q,cbo_udf_udaf.q,ptf_decimal.q,bucketmapjoin2.q]
            TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=120)
            	[parallel_join1.q,union27.q,union12.q,groupby7_map_multi_single_reducer.q,varchar_join1.q,join7.q,join_reorder4.q,skewjoinopt2.q,bucketsortoptimize_insert_2.q,smb_mapjoin_17.q,script_env_var1.q,groupby7_map.q,groupby3.q,bucketsortoptimize_insert_8.q,union20.q]
            TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=121)
            	[ptf_general_queries.q,auto_join_reordering_values.q,sample2.q,join1.q,decimal_join.q,mapjoin_subquery2.q,join32_lessSize.q,mapjoin1.q,order2.q,skewjoinopt18.q,union_remove_18.q,join25.q,groupby9.q,bucketsortoptimize_insert_6.q,ctas.q]
            TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=122)
            	[groupby_map_ppr.q,nullgroup4_multi_distinct.q,join_rc.q,union14.q,smb_mapjoin_12.q,vector_cast_constant.q,union_remove_4.q,auto_join11.q,load_dyn_part7.q,udaf_collect_set.q,vectorization_12.q,groupby_sort_skew_1.q,groupby_sort_skew_1_23.q,smb_mapjoin_25.q,skewjoinopt12.q]
            TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=123)
            	[skewjoinopt15.q,auto_join18.q,list_bucket_dml_2.q,input1_limit.q,load_dyn_part3.q,union_remove_14.q,auto_sortmerge_join_14.q,auto_sortmerge_join_15.q,union10.q,bucket_map_join_tez2.q,groupby5_map_skew.q,join_reorder.q,sample1.q,bucketmapjoin8.q,union34.q]
            TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=124)
            	[avro_joins.q,skewjoinopt16.q,auto_join14.q,vectorization_14.q,auto_join26.q,stats1.q,cbo_stats.q,auto_sortmerge_join_6.q,union22.q,union_remove_24.q,union_view.q,smb_mapjoin_22.q,stats15.q,ptf_matchpath.q,transform_ppr1.q]
            TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=125)
            	[limit_pushdown2.q,skewjoin_noskew.q,leftsemijoin_mr.q,bucket3.q,skewjoinopt13.q,bucketmapjoin9.q,auto_join15.q,ptf.q,join22.q,vectorized_nested_mapjoin.q,sample4.q,union18.q,multi_insert_gby.q,join33.q,join_cond_pushdown_unqual2.q]
            TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=126)
            	[vector_decimal_aggregate.q,ppd_join3.q,auto_join23.q,join10.q,union_remove_11.q,union_ppr.q,join32.q,groupby_multi_single_reducer2.q,input18.q,stats3.q,cbo_simple_select.q,parquet_join.q,join26.q,groupby1.q,join_reorder2.q]
            TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=127)
            	[skewjoinopt19.q,order.q,join_merge_multi_expressions.q,skewjoinopt10.q,insert_into1.q,vectorized_math_funcs.q,vectorization_4.q,vectorization_2.q,skewjoinopt6.q,union_remove_19.q,decimal_1_1.q,join14.q,outer_join_ppr.q,rcfile_bigdata.q,load_dyn_part10.q]
            TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=128)
            	[skewjoinopt3.q,smb_mapjoin_4.q,timestamp_comparison.q,union_remove_10.q,mapreduce2.q,bucketmapjoin_negative.q,udf_in_file.q,union5.q,auto_join12.q,skewjoin.q,vector_left_outer_join.q,semijoin.q,skewjoinopt9.q,smb_mapjoin_3.q,stats10.q]
            TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=129)
            	[bucketsortoptimize_insert_4.q,multi_insert_mixed.q,vectorization_10.q,auto_join18_multi_distinct.q,join_cond_pushdown_3.q,custom_input_output_format.q,skewjoinopt5.q,vectorization_part_project.q,vector_count_distinct.q,skewjoinopt4.q,count.q,parallel.q,union33.q,union_lateralview.q,nullgroup4.q]
            TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=130)
            	[skewjoin_union_remove_2.q,avro_decimal_native.q,skewjoinopt8.q,bucketmapjoin_negative3.q,stats6.q,groupby2_map.q,stats_only_null.q,insert_into3.q,join18_multi_distinct.q,vectorization_6.q,cross_join.q,stats9.q,auto_join7.q,timestamp_1.q,join24.q]
            TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=131)
            	[auto_join30.q,timestamp_null.q,union32.q,join16.q,groupby_ppr.q,bucketmapjoin7.q,smb_mapjoin_18.q,join19.q,vector_varchar_4.q,union6.q,cbo_subq_in.q,vectorization_part.q,sample8.q,vectorized_timestamp_funcs.q,join_star.q]
            TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=132)
            	[union_remove_1.q,ppd_outer_join2.q,date_udf.q,groupby1_noskew.q,join20.q,smb_mapjoin_13.q,groupby_rollup1.q,temp_table_gb1.q,vector_string_concat.q,smb_mapjoin_6.q,metadata_only_queries.q,auto_sortmerge_join_12.q,groupby_bigdata.q,groupby3_map_multi_distinct.q,innerjoin.q]
            TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=133)
            	[groupby_grouping_id2.q,input17.q,bucketmapjoin12.q,ppd_gby_join.q,auto_join10.q,ptf_rcfile.q,vector_elt.q,multi_insert.q,ppd_join5.q,ppd_join.q,join_filters_overlap.q,join_cond_pushdown_1.q,timestamp_3.q,load_dyn_part6.q,stats_noscan_2.q]
            TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=134)
            	[tez_joins_explain.q,vectorized_rcfile_columnar.q,transform2.q,cbo_semijoin.q,bucketmapjoin13.q,union_remove_6_subq.q,groupby2_map_multi_distinct.q,load_dyn_part9.q,multi_insert_gby2.q,vectorization_11.q,groupby_position.q,avro_compression_enabled_native.q,smb_mapjoin_8.q,join21.q,auto_join16.q]
            TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=135)
            	[enforce_order.q,smb_mapjoin_21.q,load_dyn_part15.q,udf_min.q,groupby_resolution.q,mapjoin_memcheck.q,subquery_exists.q,groupby5.q,join27.q,alter_merge_stats_orc.q,union_remove_2.q,vector_orderby_5.q,groupby6_map_skew.q,join12.q,union9.q]
            TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=136)
            	[vectorization_16.q,join_casesensitive.q,transform_ppr2.q,join23.q,groupby7_map_skew.q,ppd_join2.q,ppd_outer_join5.q,create_merge_compressed.q,louter_join_ppr.q,sample9.q,smb_mapjoin_16.q,vectorization_not.q,having.q,ppd_outer_join1.q,union_remove_12.q]
            TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=137)
            	[bucketmapjoin3.q,load_dyn_part5.q,union_date.q,cbo_gby.q,auto_join31.q,auto_sortmerge_join_1.q,join_cond_pushdown_unqual1.q,ppd_outer_join3.q,bucket_map_join_spark3.q,union28.q,statsfs.q,escape_sortby1.q,leftsemijoin.q,union_remove_6.q,join29.q]
            TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=138)
            	[escape_distributeby1.q,join9.q,groupby2.q,groupby4_map.q,udf_max.q,vectorization_pushdown.q,cbo_gby_empty.q,join_cond_pushdown_unqual3.q,vectorization_short_regress.q,join8.q,sample10.q,cross_product_check_1.q,auto_join_stats.q,input_part2.q,groupby_multi_single_reducer3.q]
            TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=139)
            	[groupby_map_ppr_multi_distinct.q,vectorization_13.q,mapjoin_mapjoin.q,union2.q,join41.q,groupby8_map.q,cbo_subq_not_in.q,identity_project_remove_skip.q,stats5.q,groupby8_map_skew.q,nullgroup2.q,mapjoin_subquery.q,bucket2.q,smb_mapjoin_1.q,union_remove_8.q]
            TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=140)
            	[join39.q,bucketsortoptimize_insert_7.q,vector_distinct_2.q,bucketmapjoin10.q,join11.q,union13.q,auto_sortmerge_join_16.q,windowing.q,union_remove_3.q,skewjoinopt7.q,stats7.q,annotate_stats_join.q,multi_insert_lateral_view.q,ptf_streaming.q,join_1to1.q]
            TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=141)
            	[timestamp_lazy.q,union29.q,runtime_skewjoin_mapjoin_spark.q,auto_join22.q,union8.q,groupby5_map.q,dynamic_rdd_cache.q,auto_join29.q,groupby6.q,merge1.q,mapjoin_distinct.q,vector_decimal_mapjoin.q,sample5.q,multi_insert_move_tasks_share_dependencies.q,join_array.q]
            TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=142)
            	[load_dyn_part2.q,smb_mapjoin_7.q,vectorization_5.q,smb_mapjoin_2.q,ppd_join_filter.q,column_access_stats.q,stats0.q,vector_between_in.q,vectorized_string_funcs.q,bucket_map_join_2.q,groupby4_map_skew.q,groupby_ppr_multi_distinct.q,temp_table_join1.q,vectorized_case.q,stats_noscan_1.q]
            TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=143)
            	[groupby4_noskew.q,groupby3_map_skew.q,join_cond_pushdown_2.q,union19.q,union24.q,union_remove_5.q,groupby7_noskew_multi_single_reducer.q,vectorization_1.q,index_auto_self_join.q,auto_smb_mapjoin_14.q,script_env_var2.q,pcr.q,auto_join_filters.q,join0.q,join37.q]
            TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=144)
            	[stats12.q,groupby4.q,union_top_level.q,stats2.q,groupby10.q,mapjoin_filter_on_outerjoin.q,auto_sortmerge_join_4.q,limit_partition_metadataonly.q,load_dyn_part4.q,union3.q,groupby_multi_single_reducer.q,smb_mapjoin_14.q,groupby3_noskew_multi_distinct.q,stats18.q,union_remove_21.q]
            TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=145)
            	[auto_sortmerge_join_13.q,join4.q,join35.q,udf_percentile.q,join_reorder3.q,subquery_in.q,auto_join19.q,stats14.q,vectorization_15.q,union7.q,vectorization_nested_udf.q,vector_groupby_3.q,vectorized_ptf.q,auto_join2.q,groupby1_map_skew.q]
            TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=146)
            	[groupby3_map.q,union26.q,mapreduce1.q,mapjoin_addjar.q,bucket_map_join_spark1.q,udf_example_add.q,multi_insert_with_join.q,sample7.q,auto_join_nulls.q,ppd_outer_join4.q,load_dyn_part8.q,alter_merge_orc.q,sample6.q,bucket_map_join_1.q,auto_sortmerge_join_9.q]
            TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=147)
            	[groupby_complex_types.q,multigroupby_singlemr.q,union11.q,groupby7.q,join5.q,bucketmapjoin_negative2.q,vectorization_div0.q,union_script.q,add_part_multiple.q,limit_pushdown.q,union_remove_17.q,uniquejoin.q,metadata_only_queries_with_filters.q,union25.q,load_dyn_part13.q]
            TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=148)
            	[table_access_keys_stats.q,bucketmapjoin11.q,auto_join4.q,mapjoin_decimal.q,join34.q,nullgroup.q,mergejoins_mixed.q,sort.q,stats8.q,auto_join28.q,join17.q,union17.q,skewjoinopt11.q,groupby1_map.q,load_dyn_part11.q]
            TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=149)
            	[ptf_seqfile.q,union_remove_23.q,parallel_join0.q,union_remove_9.q,join_nullsafe.q,skewjoinopt14.q,vectorized_mapjoin.q,union4.q,auto_join5.q,vectorized_shufflejoin.q,smb_mapjoin_20.q,groupby8_noskew.q,auto_sortmerge_join_10.q,groupby11.q,union_remove_16.q]
            TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=150)
            	[smb_mapjoin_15.q,script_pipe.q,auto_join24.q,filter_join_breaktask.q,bucket4.q,ppd_multi_insert.q,skewjoinopt20.q,join_thrift.q,multi_insert_gby3.q,groupby8.q,join_map_ppr.q,auto_sortmerge_join_8.q,escape_clusterby1.q,groupby_multi_insert_common_distinct.q,join6.q]
            TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=151)
            	[ppd_transform.q,auto_join9.q,auto_join1.q,vector_data_types.q,input13.q,input14.q,input12.q,union_remove_22.q,vectorization_3.q,groupby1_map_nomap.q,cbo_union.q,disable_merge_for_bucketing.q,reduce_deduplicate_exclude_join.q,filter_join_breaktask2.q,join30.q]
            TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=152)
            	[router_join_ppr.q,auto_join13.q,union30.q,vector_mapjoin_reduce.q,ptf_register_tblfn.q,join_merging.q,union_date_trim.q,groupby3_noskew.q,optimize_nullscan.q,join3.q,join38.q,skewjoinopt1.q,join_alt_syntax.q,groupby_sort_1_23.q,timestamp_udf.q]
            TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=153)
            	[groupby6_map.q,stats13.q,groupby2_noskew_multi_distinct.q,load_dyn_part12.q,join15.q,auto_join17.q,join_hive_626.q,tez_join_tests.q,auto_join21.q,join_view.q,join_cond_pushdown_4.q,vectorization_0.q,union_null.q,auto_join3.q,vectorization_decimal_date.q]
            TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=154)
            	[union_remove_15.q,bucket_map_join_tez1.q,scriptfile1.q,groupby7_noskew.q,bucketmapjoin1.q,subquery_multiinsert.q,auto_join8.q,auto_join6.q,groupby2_map_skew.q,lateral_view_explode2.q,join28.q,load_dyn_part1.q,skewjoinopt17.q,union_remove_20.q,bucketmapjoin5.q]
            TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=155)
            	[join2.q,join36.q,avro_joins_native.q,join18.q,smb_mapjoin_10.q,temp_table.q,union_remove_13.q,auto_sortmerge_join_5.q,groupby5_noskew.q,auto_join0.q,vectorization_17.q,auto_join_stats2.q,skewjoin_union_remove_1.q,union16.q,join_literals.q]
            TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=156)
            	[auto_sortmerge_join_7.q,auto_join20.q,smb_mapjoin_5.q,vector_char_4.q,cross_product_check_2.q,union15.q,union_remove_25.q,insert_into2.q,join31.q,auto_join27.q,escape_orderby1.q,cbo_limit.q,stats_partscan_1_23.q,groupby_complex_types_multi_single_reducer.q,load_dyn_part14.q]
            TestSparkNegativeCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=242)
            org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[acid_globallimit] (batchId=27)
            org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[avrocountemptytbl] (batchId=74)
            org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[columnStatsUpdateForStatsOptimizer_1] (batchId=31)
            org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[selectindate] (batchId=57)
            org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[union_fast_stats] (batchId=47)
            org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[vector_number_compare_projection] (batchId=10)
            org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainanalyze_5] (batchId=94)
            org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainuser_3] (batchId=109)
            org.apache.hadoop.hive.ql.TestTxnCommands2.testNonAcidToAcidConversion02 (batchId=268)
            org.apache.hive.beeline.TestBeeLineWithArgs.testQueryProgressParallel (batchId=222)
            org.apache.hive.beeline.TestBeelineArgParsing.testAddLocalJarWithoutAddDriverClazz[0] (batchId=181)
            org.apache.hive.beeline.TestBeelineArgParsing.testAddLocalJar[0] (batchId=181)
            org.apache.hive.beeline.TestBeelineArgParsing.testAddLocalJar[1] (batchId=181)
            

            Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/9547/testReport
            Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/9547/console
            Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-9547/

            Messages:

            Executing org.apache.hive.ptest.execution.TestCheckPhase
            Executing org.apache.hive.ptest.execution.PrepPhase
            Executing org.apache.hive.ptest.execution.YetusPhase
            Executing org.apache.hive.ptest.execution.ExecutionPhase
            Executing org.apache.hive.ptest.execution.ReportingPhase
            Tests exited with: TestsFailedException: 59 tests failed
            

            This message is automatically generated.

            ATTACHMENT ID: 12913421 - PreCommit-HIVE-Build

            hiveqa Hive QA added a comment - Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12913421/HIVE-18608.2-branch-2.2.patch SUCCESS: +1 due to 1 test(s) being added or modified. ERROR: -1 due to 59 failed/errored test(s), 9944 tests executed Failed tests: TestDerbyConnector - did not produce a TEST-*.xml file (likely timed out) (batchId=244) TestJdbcDriver2 - did not produce a TEST-*.xml file (likely timed out) (batchId=225) TestMiniLlapLocalCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=167) [acid_globallimit.q,alter_merge_2_orc.q] TestMiniSparkOnYarnCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=173) [infer_bucket_sort_reducers_power_two.q,list_bucket_dml_10.q,orc_merge9.q,orc_merge6.q,leftsemijoin_mr.q,bucket6.q,bucketmapjoin7.q,uber_reduce.q,empty_dir_in_table.q,vector_outer_join3.q,index_bitmap_auto.q,vector_outer_join2.q,vector_outer_join1.q,orc_merge1.q,orc_merge_diff_fs.q,load_hdfs_file_with_space_in_the_name.q,scriptfile1_win.q,quotedid_smb.q,truncate_column_buckets.q,orc_merge3.q] TestMiniSparkOnYarnCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=174) [infer_bucket_sort_num_buckets.q,gen_udf_example_add10.q,insert_overwrite_directory2.q,orc_merge5.q,bucketmapjoin6.q,import_exported_table.q,vector_outer_join0.q,orc_merge4.q,temp_table_external.q,orc_merge_incompat1.q,root_dir_external_table.q,constprog_semijoin.q,auto_sortmerge_join_16.q,schemeAuthority.q,index_bitmap3.q,external_table_with_space_in_location_path.q,parallel_orderby.q,infer_bucket_sort_map_operators.q,bucketizedhiveinputformat.q,remote_script.q] TestMiniSparkOnYarnCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=175) [scriptfile1.q,vector_outer_join5.q,file_with_header_footer.q,bucket4.q,input16_cc.q,bucket5.q,infer_bucket_sort_merge.q,constprog_partitioner.q,orc_merge2.q,reduce_deduplicate.q,schemeAuthority2.q,load_fs2.q,orc_merge8.q,orc_merge_incompat2.q,infer_bucket_sort_bucketed_table.q,vector_outer_join4.q,disable_merge_for_bucketing.q,vector_inner_join.q,orc_merge7.q] TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=118) [bucketmapjoin4.q,bucket_map_join_spark4.q,union21.q,groupby2_noskew.q,timestamp_2.q,date_join1.q,mergejoins.q,smb_mapjoin_11.q,auto_sortmerge_join_3.q,mapjoin_test_outer.q,vectorization_9.q,merge2.q,groupby6_noskew.q,auto_join_without_localtask.q,multi_join_union.q] TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=119) [join_cond_pushdown_unqual4.q,union_remove_7.q,join13.q,join_vc.q,groupby_cube1.q,bucket_map_join_spark2.q,sample3.q,smb_mapjoin_19.q,stats16.q,union23.q,union.q,union31.q,cbo_udf_udaf.q,ptf_decimal.q,bucketmapjoin2.q] TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=120) [parallel_join1.q,union27.q,union12.q,groupby7_map_multi_single_reducer.q,varchar_join1.q,join7.q,join_reorder4.q,skewjoinopt2.q,bucketsortoptimize_insert_2.q,smb_mapjoin_17.q,script_env_var1.q,groupby7_map.q,groupby3.q,bucketsortoptimize_insert_8.q,union20.q] TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=121) [ptf_general_queries.q,auto_join_reordering_values.q,sample2.q,join1.q,decimal_join.q,mapjoin_subquery2.q,join32_lessSize.q,mapjoin1.q,order2.q,skewjoinopt18.q,union_remove_18.q,join25.q,groupby9.q,bucketsortoptimize_insert_6.q,ctas.q] TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=122) [groupby_map_ppr.q,nullgroup4_multi_distinct.q,join_rc.q,union14.q,smb_mapjoin_12.q,vector_cast_constant.q,union_remove_4.q,auto_join11.q,load_dyn_part7.q,udaf_collect_set.q,vectorization_12.q,groupby_sort_skew_1.q,groupby_sort_skew_1_23.q,smb_mapjoin_25.q,skewjoinopt12.q] TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=123) [skewjoinopt15.q,auto_join18.q,list_bucket_dml_2.q,input1_limit.q,load_dyn_part3.q,union_remove_14.q,auto_sortmerge_join_14.q,auto_sortmerge_join_15.q,union10.q,bucket_map_join_tez2.q,groupby5_map_skew.q,join_reorder.q,sample1.q,bucketmapjoin8.q,union34.q] TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=124) [avro_joins.q,skewjoinopt16.q,auto_join14.q,vectorization_14.q,auto_join26.q,stats1.q,cbo_stats.q,auto_sortmerge_join_6.q,union22.q,union_remove_24.q,union_view.q,smb_mapjoin_22.q,stats15.q,ptf_matchpath.q,transform_ppr1.q] TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=125) [limit_pushdown2.q,skewjoin_noskew.q,leftsemijoin_mr.q,bucket3.q,skewjoinopt13.q,bucketmapjoin9.q,auto_join15.q,ptf.q,join22.q,vectorized_nested_mapjoin.q,sample4.q,union18.q,multi_insert_gby.q,join33.q,join_cond_pushdown_unqual2.q] TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=126) [vector_decimal_aggregate.q,ppd_join3.q,auto_join23.q,join10.q,union_remove_11.q,union_ppr.q,join32.q,groupby_multi_single_reducer2.q,input18.q,stats3.q,cbo_simple_select.q,parquet_join.q,join26.q,groupby1.q,join_reorder2.q] TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=127) [skewjoinopt19.q,order.q,join_merge_multi_expressions.q,skewjoinopt10.q,insert_into1.q,vectorized_math_funcs.q,vectorization_4.q,vectorization_2.q,skewjoinopt6.q,union_remove_19.q,decimal_1_1.q,join14.q,outer_join_ppr.q,rcfile_bigdata.q,load_dyn_part10.q] TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=128) [skewjoinopt3.q,smb_mapjoin_4.q,timestamp_comparison.q,union_remove_10.q,mapreduce2.q,bucketmapjoin_negative.q,udf_in_file.q,union5.q,auto_join12.q,skewjoin.q,vector_left_outer_join.q,semijoin.q,skewjoinopt9.q,smb_mapjoin_3.q,stats10.q] TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=129) [bucketsortoptimize_insert_4.q,multi_insert_mixed.q,vectorization_10.q,auto_join18_multi_distinct.q,join_cond_pushdown_3.q,custom_input_output_format.q,skewjoinopt5.q,vectorization_part_project.q,vector_count_distinct.q,skewjoinopt4.q,count.q,parallel.q,union33.q,union_lateralview.q,nullgroup4.q] TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=130) [skewjoin_union_remove_2.q,avro_decimal_native.q,skewjoinopt8.q,bucketmapjoin_negative3.q,stats6.q,groupby2_map.q,stats_only_null.q,insert_into3.q,join18_multi_distinct.q,vectorization_6.q,cross_join.q,stats9.q,auto_join7.q,timestamp_1.q,join24.q] TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=131) [auto_join30.q,timestamp_null.q,union32.q,join16.q,groupby_ppr.q,bucketmapjoin7.q,smb_mapjoin_18.q,join19.q,vector_varchar_4.q,union6.q,cbo_subq_in.q,vectorization_part.q,sample8.q,vectorized_timestamp_funcs.q,join_star.q] TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=132) [union_remove_1.q,ppd_outer_join2.q,date_udf.q,groupby1_noskew.q,join20.q,smb_mapjoin_13.q,groupby_rollup1.q,temp_table_gb1.q,vector_string_concat.q,smb_mapjoin_6.q,metadata_only_queries.q,auto_sortmerge_join_12.q,groupby_bigdata.q,groupby3_map_multi_distinct.q,innerjoin.q] TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=133) [groupby_grouping_id2.q,input17.q,bucketmapjoin12.q,ppd_gby_join.q,auto_join10.q,ptf_rcfile.q,vector_elt.q,multi_insert.q,ppd_join5.q,ppd_join.q,join_filters_overlap.q,join_cond_pushdown_1.q,timestamp_3.q,load_dyn_part6.q,stats_noscan_2.q] TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=134) [tez_joins_explain.q,vectorized_rcfile_columnar.q,transform2.q,cbo_semijoin.q,bucketmapjoin13.q,union_remove_6_subq.q,groupby2_map_multi_distinct.q,load_dyn_part9.q,multi_insert_gby2.q,vectorization_11.q,groupby_position.q,avro_compression_enabled_native.q,smb_mapjoin_8.q,join21.q,auto_join16.q] TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=135) [enforce_order.q,smb_mapjoin_21.q,load_dyn_part15.q,udf_min.q,groupby_resolution.q,mapjoin_memcheck.q,subquery_exists.q,groupby5.q,join27.q,alter_merge_stats_orc.q,union_remove_2.q,vector_orderby_5.q,groupby6_map_skew.q,join12.q,union9.q] TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=136) [vectorization_16.q,join_casesensitive.q,transform_ppr2.q,join23.q,groupby7_map_skew.q,ppd_join2.q,ppd_outer_join5.q,create_merge_compressed.q,louter_join_ppr.q,sample9.q,smb_mapjoin_16.q,vectorization_not.q,having.q,ppd_outer_join1.q,union_remove_12.q] TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=137) [bucketmapjoin3.q,load_dyn_part5.q,union_date.q,cbo_gby.q,auto_join31.q,auto_sortmerge_join_1.q,join_cond_pushdown_unqual1.q,ppd_outer_join3.q,bucket_map_join_spark3.q,union28.q,statsfs.q,escape_sortby1.q,leftsemijoin.q,union_remove_6.q,join29.q] TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=138) [escape_distributeby1.q,join9.q,groupby2.q,groupby4_map.q,udf_max.q,vectorization_pushdown.q,cbo_gby_empty.q,join_cond_pushdown_unqual3.q,vectorization_short_regress.q,join8.q,sample10.q,cross_product_check_1.q,auto_join_stats.q,input_part2.q,groupby_multi_single_reducer3.q] TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=139) [groupby_map_ppr_multi_distinct.q,vectorization_13.q,mapjoin_mapjoin.q,union2.q,join41.q,groupby8_map.q,cbo_subq_not_in.q,identity_project_remove_skip.q,stats5.q,groupby8_map_skew.q,nullgroup2.q,mapjoin_subquery.q,bucket2.q,smb_mapjoin_1.q,union_remove_8.q] TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=140) [join39.q,bucketsortoptimize_insert_7.q,vector_distinct_2.q,bucketmapjoin10.q,join11.q,union13.q,auto_sortmerge_join_16.q,windowing.q,union_remove_3.q,skewjoinopt7.q,stats7.q,annotate_stats_join.q,multi_insert_lateral_view.q,ptf_streaming.q,join_1to1.q] TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=141) [timestamp_lazy.q,union29.q,runtime_skewjoin_mapjoin_spark.q,auto_join22.q,union8.q,groupby5_map.q,dynamic_rdd_cache.q,auto_join29.q,groupby6.q,merge1.q,mapjoin_distinct.q,vector_decimal_mapjoin.q,sample5.q,multi_insert_move_tasks_share_dependencies.q,join_array.q] TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=142) [load_dyn_part2.q,smb_mapjoin_7.q,vectorization_5.q,smb_mapjoin_2.q,ppd_join_filter.q,column_access_stats.q,stats0.q,vector_between_in.q,vectorized_string_funcs.q,bucket_map_join_2.q,groupby4_map_skew.q,groupby_ppr_multi_distinct.q,temp_table_join1.q,vectorized_case.q,stats_noscan_1.q] TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=143) [groupby4_noskew.q,groupby3_map_skew.q,join_cond_pushdown_2.q,union19.q,union24.q,union_remove_5.q,groupby7_noskew_multi_single_reducer.q,vectorization_1.q,index_auto_self_join.q,auto_smb_mapjoin_14.q,script_env_var2.q,pcr.q,auto_join_filters.q,join0.q,join37.q] TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=144) [stats12.q,groupby4.q,union_top_level.q,stats2.q,groupby10.q,mapjoin_filter_on_outerjoin.q,auto_sortmerge_join_4.q,limit_partition_metadataonly.q,load_dyn_part4.q,union3.q,groupby_multi_single_reducer.q,smb_mapjoin_14.q,groupby3_noskew_multi_distinct.q,stats18.q,union_remove_21.q] TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=145) [auto_sortmerge_join_13.q,join4.q,join35.q,udf_percentile.q,join_reorder3.q,subquery_in.q,auto_join19.q,stats14.q,vectorization_15.q,union7.q,vectorization_nested_udf.q,vector_groupby_3.q,vectorized_ptf.q,auto_join2.q,groupby1_map_skew.q] TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=146) [groupby3_map.q,union26.q,mapreduce1.q,mapjoin_addjar.q,bucket_map_join_spark1.q,udf_example_add.q,multi_insert_with_join.q,sample7.q,auto_join_nulls.q,ppd_outer_join4.q,load_dyn_part8.q,alter_merge_orc.q,sample6.q,bucket_map_join_1.q,auto_sortmerge_join_9.q] TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=147) [groupby_complex_types.q,multigroupby_singlemr.q,union11.q,groupby7.q,join5.q,bucketmapjoin_negative2.q,vectorization_div0.q,union_script.q,add_part_multiple.q,limit_pushdown.q,union_remove_17.q,uniquejoin.q,metadata_only_queries_with_filters.q,union25.q,load_dyn_part13.q] TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=148) [table_access_keys_stats.q,bucketmapjoin11.q,auto_join4.q,mapjoin_decimal.q,join34.q,nullgroup.q,mergejoins_mixed.q,sort.q,stats8.q,auto_join28.q,join17.q,union17.q,skewjoinopt11.q,groupby1_map.q,load_dyn_part11.q] TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=149) [ptf_seqfile.q,union_remove_23.q,parallel_join0.q,union_remove_9.q,join_nullsafe.q,skewjoinopt14.q,vectorized_mapjoin.q,union4.q,auto_join5.q,vectorized_shufflejoin.q,smb_mapjoin_20.q,groupby8_noskew.q,auto_sortmerge_join_10.q,groupby11.q,union_remove_16.q] TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=150) [smb_mapjoin_15.q,script_pipe.q,auto_join24.q,filter_join_breaktask.q,bucket4.q,ppd_multi_insert.q,skewjoinopt20.q,join_thrift.q,multi_insert_gby3.q,groupby8.q,join_map_ppr.q,auto_sortmerge_join_8.q,escape_clusterby1.q,groupby_multi_insert_common_distinct.q,join6.q] TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=151) [ppd_transform.q,auto_join9.q,auto_join1.q,vector_data_types.q,input13.q,input14.q,input12.q,union_remove_22.q,vectorization_3.q,groupby1_map_nomap.q,cbo_union.q,disable_merge_for_bucketing.q,reduce_deduplicate_exclude_join.q,filter_join_breaktask2.q,join30.q] TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=152) [router_join_ppr.q,auto_join13.q,union30.q,vector_mapjoin_reduce.q,ptf_register_tblfn.q,join_merging.q,union_date_trim.q,groupby3_noskew.q,optimize_nullscan.q,join3.q,join38.q,skewjoinopt1.q,join_alt_syntax.q,groupby_sort_1_23.q,timestamp_udf.q] TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=153) [groupby6_map.q,stats13.q,groupby2_noskew_multi_distinct.q,load_dyn_part12.q,join15.q,auto_join17.q,join_hive_626.q,tez_join_tests.q,auto_join21.q,join_view.q,join_cond_pushdown_4.q,vectorization_0.q,union_null.q,auto_join3.q,vectorization_decimal_date.q] TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=154) [union_remove_15.q,bucket_map_join_tez1.q,scriptfile1.q,groupby7_noskew.q,bucketmapjoin1.q,subquery_multiinsert.q,auto_join8.q,auto_join6.q,groupby2_map_skew.q,lateral_view_explode2.q,join28.q,load_dyn_part1.q,skewjoinopt17.q,union_remove_20.q,bucketmapjoin5.q] TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=155) [join2.q,join36.q,avro_joins_native.q,join18.q,smb_mapjoin_10.q,temp_table.q,union_remove_13.q,auto_sortmerge_join_5.q,groupby5_noskew.q,auto_join0.q,vectorization_17.q,auto_join_stats2.q,skewjoin_union_remove_1.q,union16.q,join_literals.q] TestSparkCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=156) [auto_sortmerge_join_7.q,auto_join20.q,smb_mapjoin_5.q,vector_char_4.q,cross_product_check_2.q,union15.q,union_remove_25.q,insert_into2.q,join31.q,auto_join27.q,escape_orderby1.q,cbo_limit.q,stats_partscan_1_23.q,groupby_complex_types_multi_single_reducer.q,load_dyn_part14.q] TestSparkNegativeCliDriver - did not produce a TEST-*.xml file (likely timed out) (batchId=242) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[acid_globallimit] (batchId=27) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[avrocountemptytbl] (batchId=74) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[columnStatsUpdateForStatsOptimizer_1] (batchId=31) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[selectindate] (batchId=57) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[union_fast_stats] (batchId=47) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[vector_number_compare_projection] (batchId=10) org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainanalyze_5] (batchId=94) org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainuser_3] (batchId=109) org.apache.hadoop.hive.ql.TestTxnCommands2.testNonAcidToAcidConversion02 (batchId=268) org.apache.hive.beeline.TestBeeLineWithArgs.testQueryProgressParallel (batchId=222) org.apache.hive.beeline.TestBeelineArgParsing.testAddLocalJarWithoutAddDriverClazz[0] (batchId=181) org.apache.hive.beeline.TestBeelineArgParsing.testAddLocalJar[0] (batchId=181) org.apache.hive.beeline.TestBeelineArgParsing.testAddLocalJar[1] (batchId=181) Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/9547/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/9547/console Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-9547/ Messages: Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.YetusPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 59 tests failed This message is automatically generated. ATTACHMENT ID: 12913421 - PreCommit-HIVE-Build
            githubbot ASF GitHub Bot added a comment -

            GitHub user omalley opened a pull request:

            https://github.com/apache/orc/pull/304

            ORC-397. Allow selective disabling of dictionary encoding.

            I've forward ported the patch from Mithun.

            You can merge this pull request into a Git repository by running:

            $ git pull https://github.com/omalley/orc orc-397

            Alternatively you can review and apply these changes as the patch at:

            https://github.com/apache/orc/pull/304.patch

            To close this pull request, make a commit to your master/trunk branch
            with (at least) the following in the commit message:

            This closes #304


            commit b00a52e668efa9fb06821d7da52706ace81e31c0
            Author: Owen O'Malley <omalley@...>
            Date: 2018-08-28T23:12:05Z

            ORC-397. Allow selective disabling of dictionary encoding.


            githubbot ASF GitHub Bot added a comment - GitHub user omalley opened a pull request: https://github.com/apache/orc/pull/304 ORC-397 . Allow selective disabling of dictionary encoding. I've forward ported the patch from Mithun. You can merge this pull request into a Git repository by running: $ git pull https://github.com/omalley/orc orc-397 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/orc/pull/304.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #304 commit b00a52e668efa9fb06821d7da52706ace81e31c0 Author: Owen O'Malley <omalley@...> Date: 2018-08-28T23:12:05Z ORC-397 . Allow selective disabling of dictionary encoding.
            githubbot ASF GitHub Bot added a comment -

            Github user omalley commented on the issue:

            https://github.com/apache/orc/pull/304

            Oh, rather than fix WriterImplV2 I simplified it to remove the duplicated code.

            githubbot ASF GitHub Bot added a comment - Github user omalley commented on the issue: https://github.com/apache/orc/pull/304 Oh, rather than fix WriterImplV2 I simplified it to remove the duplicated code.
            githubbot ASF GitHub Bot added a comment -

            Github user omalley commented on the issue:

            https://github.com/apache/orc/pull/304

            I pushed an update that simplified the test case a bit.

            githubbot ASF GitHub Bot added a comment - Github user omalley commented on the issue: https://github.com/apache/orc/pull/304 I pushed an update that simplified the test case a bit.
            githubbot ASF GitHub Bot added a comment -

            Github user wgtmac commented on a diff in the pull request:

            https://github.com/apache/orc/pull/304#discussion_r213542675

            — Diff: java/core/src/test/org/apache/orc/TestStringDictionary.java —
            @@ -409,4 +411,77 @@ public void testTooManyDistinctV11AlwaysDictionary() throws Exception {

            }

            + /**
            + * Test that dictionaries can be disabled, per column. In this test, we want to disable DICTIONARY_V2 for the
            + * `longString` column (presumably for a low hit-ratio), while preserving DICTIONARY_V2 for `shortString`.
            + * @throws Exception on unexpected failure
            + */
            + @Test
            + public void testDisableDictionaryForSpecificColumn() throws Exception {
            + final String SHORT_STRING_VALUE = "foo";
            + final String LONG_STRING_VALUE = "BAAAAAAAAR!!";
            +
            + TypeDescription schema =
            + TypeDescription.fromString("struct<shortString:string,longString:string>");
            +
            + Writer writer = OrcFile.createWriter(
            + testFilePath,
            + OrcFile.writerOptions(conf).setSchema(schema)
            + .compress(CompressionKind.NONE)
            + .bufferSize(10000)
            + .directEncodingColumns("longString"));
            — End diff –

            Is it better to support specifying columns which use dictionary encoding?

            githubbot ASF GitHub Bot added a comment - Github user wgtmac commented on a diff in the pull request: https://github.com/apache/orc/pull/304#discussion_r213542675 — Diff: java/core/src/test/org/apache/orc/TestStringDictionary.java — @@ -409,4 +411,77 @@ public void testTooManyDistinctV11AlwaysDictionary() throws Exception { } + /** + * Test that dictionaries can be disabled, per column. In this test, we want to disable DICTIONARY_V2 for the + * `longString` column (presumably for a low hit-ratio), while preserving DICTIONARY_V2 for `shortString`. + * @throws Exception on unexpected failure + */ + @Test + public void testDisableDictionaryForSpecificColumn() throws Exception { + final String SHORT_STRING_VALUE = "foo"; + final String LONG_STRING_VALUE = "BAAAAAAAAR!!"; + + TypeDescription schema = + TypeDescription.fromString("struct<shortString:string,longString:string>"); + + Writer writer = OrcFile.createWriter( + testFilePath, + OrcFile.writerOptions(conf).setSchema(schema) + .compress(CompressionKind.NONE) + .bufferSize(10000) + .directEncodingColumns("longString")); — End diff – Is it better to support specifying columns which use dictionary encoding?
            githubbot ASF GitHub Bot added a comment -

            Github user omalley commented on a diff in the pull request:

            https://github.com/apache/orc/pull/304#discussion_r213566036

            — Diff: java/core/src/test/org/apache/orc/TestStringDictionary.java —
            @@ -409,4 +411,77 @@ public void testTooManyDistinctV11AlwaysDictionary() throws Exception {

            }

            + /**
            + * Test that dictionaries can be disabled, per column. In this test, we want to disable DICTIONARY_V2 for the
            + * `longString` column (presumably for a low hit-ratio), while preserving DICTIONARY_V2 for `shortString`.
            + * @throws Exception on unexpected failure
            + */
            + @Test
            + public void testDisableDictionaryForSpecificColumn() throws Exception {
            + final String SHORT_STRING_VALUE = "foo";
            + final String LONG_STRING_VALUE = "BAAAAAAAAR!!";
            +
            + TypeDescription schema =
            + TypeDescription.fromString("struct<shortString:string,longString:string>");
            +
            + Writer writer = OrcFile.createWriter(
            + testFilePath,
            + OrcFile.writerOptions(conf).setSchema(schema)
            + .compress(CompressionKind.NONE)
            + .bufferSize(10000)
            + .directEncodingColumns("longString"));
            — End diff –

            We shouldn't change the default behavior of deciding between direct/dictionary encoding based on the data. With the parameter being the columns that are being forced to be direct, the default is the empty string, which is straight-forward.

            Does that make sense?

            githubbot ASF GitHub Bot added a comment - Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/304#discussion_r213566036 — Diff: java/core/src/test/org/apache/orc/TestStringDictionary.java — @@ -409,4 +411,77 @@ public void testTooManyDistinctV11AlwaysDictionary() throws Exception { } + /** + * Test that dictionaries can be disabled, per column. In this test, we want to disable DICTIONARY_V2 for the + * `longString` column (presumably for a low hit-ratio), while preserving DICTIONARY_V2 for `shortString`. + * @throws Exception on unexpected failure + */ + @Test + public void testDisableDictionaryForSpecificColumn() throws Exception { + final String SHORT_STRING_VALUE = "foo"; + final String LONG_STRING_VALUE = "BAAAAAAAAR!!"; + + TypeDescription schema = + TypeDescription.fromString("struct<shortString:string,longString:string>"); + + Writer writer = OrcFile.createWriter( + testFilePath, + OrcFile.writerOptions(conf).setSchema(schema) + .compress(CompressionKind.NONE) + .bufferSize(10000) + .directEncodingColumns("longString")); — End diff – We shouldn't change the default behavior of deciding between direct/dictionary encoding based on the data. With the parameter being the columns that are being forced to be direct, the default is the empty string, which is straight-forward. Does that make sense?
            githubbot ASF GitHub Bot added a comment -

            Github user wgtmac commented on a diff in the pull request:

            https://github.com/apache/orc/pull/304#discussion_r213744153

            — Diff: java/core/src/test/org/apache/orc/TestStringDictionary.java —
            @@ -409,4 +411,77 @@ public void testTooManyDistinctV11AlwaysDictionary() throws Exception {

            }

            + /**
            + * Test that dictionaries can be disabled, per column. In this test, we want to disable DICTIONARY_V2 for the
            + * `longString` column (presumably for a low hit-ratio), while preserving DICTIONARY_V2 for `shortString`.
            + * @throws Exception on unexpected failure
            + */
            + @Test
            + public void testDisableDictionaryForSpecificColumn() throws Exception {
            + final String SHORT_STRING_VALUE = "foo";
            + final String LONG_STRING_VALUE = "BAAAAAAAAR!!";
            +
            + TypeDescription schema =
            + TypeDescription.fromString("struct<shortString:string,longString:string>");
            +
            + Writer writer = OrcFile.createWriter(
            + testFilePath,
            + OrcFile.writerOptions(conf).setSchema(schema)
            + .compress(CompressionKind.NONE)
            + .bufferSize(10000)
            + .directEncodingColumns("longString"));
            — End diff –

            That makes sense. I will also port current dictionary encoding to C++ writer shortly.
            BTW, we plan to do some testing about global dictionary which is shared by all stripes in that file. Can we come up with a design in ORC V2? I can propose a prototype after gathering certain experiment results.

            githubbot ASF GitHub Bot added a comment - Github user wgtmac commented on a diff in the pull request: https://github.com/apache/orc/pull/304#discussion_r213744153 — Diff: java/core/src/test/org/apache/orc/TestStringDictionary.java — @@ -409,4 +411,77 @@ public void testTooManyDistinctV11AlwaysDictionary() throws Exception { } + /** + * Test that dictionaries can be disabled, per column. In this test, we want to disable DICTIONARY_V2 for the + * `longString` column (presumably for a low hit-ratio), while preserving DICTIONARY_V2 for `shortString`. + * @throws Exception on unexpected failure + */ + @Test + public void testDisableDictionaryForSpecificColumn() throws Exception { + final String SHORT_STRING_VALUE = "foo"; + final String LONG_STRING_VALUE = "BAAAAAAAAR!!"; + + TypeDescription schema = + TypeDescription.fromString("struct<shortString:string,longString:string>"); + + Writer writer = OrcFile.createWriter( + testFilePath, + OrcFile.writerOptions(conf).setSchema(schema) + .compress(CompressionKind.NONE) + .bufferSize(10000) + .directEncodingColumns("longString")); — End diff – That makes sense. I will also port current dictionary encoding to C++ writer shortly. BTW, we plan to do some testing about global dictionary which is shared by all stripes in that file. Can we come up with a design in ORC V2? I can propose a prototype after gathering certain experiment results.
            githubbot ASF GitHub Bot added a comment -

            Github user asfgit closed the pull request at:

            https://github.com/apache/orc/pull/304

            githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/orc/pull/304
            omalley Owen O'Malley added a comment -

            Released in ORC 1.5.3

            omalley Owen O'Malley added a comment - Released in ORC 1.5.3
            leftyl Lefty Leverenz added a comment -

            Has orc.column.encoding.direct been documented in the wiki?

            leftyl Lefty Leverenz added a comment - Has orc.column.encoding.direct been documented in the wiki?

            People

              mithun Mithun Radhakrishnan
              mithun Mithun Radhakrishnan
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: