Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-14970

repeated insert into is broken for buckets (incorrect results for tablesample, BucketingSortingReduceSinkOptimizer)

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Critical
    • Resolution: Unresolved
    • None
    • None
    • None
    • None

    Description

      Running on a regular CLI driver

      CREATE TABLE src_bucket(key STRING, value STRING) CLUSTERED BY (key) SORTED BY (key) INTO 2 BUCKETS;
      
      insert into table src_bucket select key,value from srcpart limit 10;
      dfs -ls ${hiveconf:hive.metastore.warehouse.dir}/src_bucket/;
      select *, INPUT__FILE__NAME from src_bucket;
      select * from src_bucket tablesample (bucket 1 out of 2) s;
      select * from src_bucket tablesample (bucket 2 out of 2) s;
      
      insert into table src_bucket select key,value from srcpart limit 10;
      dfs -ls ${hiveconf:hive.metastore.warehouse.dir}/src_bucket/;
      select *, INPUT__FILE__NAME from src_bucket;
      select * from src_bucket tablesample (bucket 1 out of 2) s;
      select * from src_bucket tablesample (bucket 2 out of 2) s;
      

      Results in the following (with masking disabled and grepping away the noise).
      Looks like bucket mapping completely breaks due to extra files, which may have implications for all the optimizations that depend on them.
      This should work or at least fail if this is not supported.

      PREHOOK: query: CREATE TABLE src_bucket(key STRING, value STRING) CLUSTERED BY (key) SORTED BY (key) INTO 2 BUCKETS
      PREHOOK: query: insert into table src_bucket select key,value from srcpart limit 10
      Found 2 items
      -rwxr-xr-x   1 sergey staff         46 2016-10-14 16:09 pfile:///Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/000000_0
      -rwxr-xr-x   1 sergey staff         68 2016-10-14 16:09 pfile:///Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/000001_0
      PREHOOK: query: select *, INPUT__FILE__NAME from src_bucket
      165	val_165	pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/000000_0
      255	val_255	pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/000000_0
      484	val_484	pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/000000_0
      86	val_86	pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/000000_0
      238	val_238	pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/000001_0
      27	val_27	pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/000001_0
      278	val_278	pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/000001_0
      311	val_311	pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/000001_0
      409	val_409	pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/000001_0
      98	val_98	pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/000001_0
      PREHOOK: query: select * from src_bucket tablesample (bucket 1 out of 2) s
      165	val_165
      255	val_255
      484	val_484
      86	val_86
      PREHOOK: query: select * from src_bucket tablesample (bucket 2 out of 2) s
      238	val_238
      27	val_27
      278	val_278
      311	val_311
      409	val_409
      98	val_98
      

      So far so good.

      PREHOOK: query: insert into table src_bucket select key,value from srcpart limit 10
      Found 4 items
      -rwxr-xr-x   1 sergey staff         46 2016-10-14 16:09 pfile:///Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/000000_0
      -rwxr-xr-x   1 sergey staff         46 2016-10-14 16:09 pfile:///Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/000000_0_copy_1
      -rwxr-xr-x   1 sergey staff         68 2016-10-14 16:09 pfile:///Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/000001_0
      -rwxr-xr-x   1 sergey staff         68 2016-10-14 16:09 pfile:///Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/000001_0_copy_1
      PREHOOK: query: select *, INPUT__FILE__NAME from src_bucket
      165	val_165	pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/000000_0
      255	val_255	pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/000000_0
      484	val_484	pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/000000_0
      86	val_86	pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/000000_0
      165	val_165	pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/000000_0_copy_1
      255	val_255	pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/000000_0_copy_1
      484	val_484	pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/000000_0_copy_1
      86	val_86	pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/000000_0_copy_1
      238	val_238	pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/000001_0
      27	val_27	pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/000001_0
      278	val_278	pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/000001_0
      311	val_311	pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/000001_0
      409	val_409	pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/000001_0
      98	val_98	pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/000001_0
      238	val_238	pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/000001_0_copy_1
      27	val_27	pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/000001_0_copy_1
      278	val_278	pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/000001_0_copy_1
      311	val_311	pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/000001_0_copy_1
      409	val_409	pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/000001_0_copy_1
      98	val_98	pfile:/Users/sergey/git/hive/itests/qtest/target/warehouse/src_bucket/000001_0_copy_1
      PREHOOK: query: select * from src_bucket tablesample (bucket 1 out of 2) s
      165	val_165
      255	val_255
      484	val_484
      86	val_86
      PREHOOK: query: select * from src_bucket tablesample (bucket 2 out of 2) s
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            sershe Sergey Shelukhin
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: