Details
-
Bug
-
Status: Closed
-
Blocker
-
Resolution: Fixed
-
0.14.0
-
None
Description
create table T(a int, b int) clustered by (a) into 2 buckets stored as orc TBLPROPERTIES('transactional'='false') insert into T(a,b) values(1,2) insert into T(a,b) values(1,3) alter table T SET TBLPROPERTIES ('transactional'='true')
//we should now have bucket files 000001_0 and 000001_0_copy_1
but OrcRawRecordMerger.OriginalReaderPair.next() doesn't know that there can be copy_N files and numbers rows in each bucket from 0 thus generating duplicate IDs
select ROW__ID, INPUT__FILE__NAME, a, b from T
produces
{"transactionid":0,"bucketid":1,"rowid":0},file:/Users/ekoifman/dev/hiverwgit/ql/target/tmp/org.apache.hadoop.hive.ql.TestTxnCommands.../warehouse/nonacidorctbl/000001_0,1,2 {"transactionid\":0,"bucketid":1,"rowid":0},file:/Users/ekoifman/dev/hiverwgit/ql/target/tmp/org.apache.hadoop.hive.ql.TestTxnCommands.../warehouse/nonacidorctbl/000001_0_copy_1,1,3
[~owen.omalley], do you have any thoughts on a good way to handle this?
attached patch has a few changes to make Acid even recognize copy_N but this is just a pre-requisite. The new UT demonstrates the issue.
Futhermore,
alter table T compact 'major' select ROW__ID, INPUT__FILE__NAME, a, b from T order by b
produces
{"transactionid":0,"bucketid":1,"rowid":0} file:/Users/ekoifman/dev/hiverwgit/ql/target/tmp/org.apache.hadoop.hive.ql.TestTxnCommands....warehouse/nonacidorctbl/base_-9223372036854775808/bucket_00001 1 2
HIVE-16177.04.patch has TestTxnCommands.testNonAcidToAcidConversion0() demonstrating this
This is because compactor doesn't handle copy_N files either (skips them)
Attachments
Attachments
Issue Links
- blocks
-
HIVE-17069 Refactor OrcRawRecrodMerger.ReaderPair
- Closed
- is related to
-
HIVE-17526 Disable conversion to ACID if table has _copy_N files on branch-1
- Resolved
-
HIVE-16732 Transactional tables should block LOAD DATA
- Closed
- relates to
-
HIVE-15899 Make CTAS with acid target table and insert into acid_tbl select ... union all ... work
- Closed
-
HIVE-12724 ACID: Major compaction fails to include the original bucket files into MR job
- Closed
-
HIVE-13961 ACID: Major compaction fails to include the original bucket files if there's no delta directory
- Closed
-
HIVE-14366 Conversion of a Non-ACID table to an ACID table produces non-unique primary keys
- Closed
-
HIVE-11525 Bucket pruning
- Closed
- requires
-
HIVE-16964 _orc_acid_version file is missing
- Closed
- links to
Activity
Field | Original Value | New Value |
---|---|---|
Attachment | HIVE-16177.01.patch [ 12857391 ] |
Description |
insert into T(a,b) values(1,2) insert into T(a,b) values(1,3) //we should now have bucket files 000001_0 and 000001_0_copy_1 but OrcRawRecordMerger.OriginalReaderPair.next() doesn't know that there can be copy_N files and numbers rows in each bucket from 0 thus generating duplicate IDs [~owen.omalley], do you have any thoughts on a good way to handle this? attached patch has a few changes to make Acid even recognize copy_N but this is just a pre-requisite. The new UT demonstrates the issue. |
{noformat} create table T(a int, b int) clustered by (a) into 2 buckets stored as orc TBLPROPERTIES('transactional'='false') insert into T(a,b) values(1,2) insert into T(a,b) values(1,3) alter table T SET TBLPROPERTIES ('transactional'='true') {noformat} //we should now have bucket files 000001_0 and 000001_0_copy_1 but OrcRawRecordMerger.OriginalReaderPair.next() doesn't know that there can be copy_N files and numbers rows in each bucket from 0 thus generating duplicate IDs {noformat} select ROW__ID, INPUT__FILE__NAME, a, b from T {noformat} produces {noformat} {"transactionid":0,"bucketid":1,"rowid":0},file:/Users/ekoifman/dev/hiverwgit/ql/target/tmp/org.apache.hadoop.hive.ql.TestTxnCommands.../warehouse/nonacidorctbl/000001_0,1,2 {"transactionid\":0,"bucketid":1,"rowid":0},file:/Users/ekoifman/dev/hiverwgit/ql/target/tmp/org.apache.hadoop.hive.ql.TestTxnCommands.../warehouse/nonacidorctbl/000001_0_copy_1,1,3 {noformat} [~owen.omalley], do you have any thoughts on a good way to handle this? attached patch has a few changes to make Acid even recognize copy_N but this is just a pre-requisite. The new UT demonstrates the issue. |
Attachment | HIVE-16177.02.patch [ 12857407 ] |
Attachment | HIVE-16177.02.patch [ 12857410 ] |
Attachment |
|
Assignee | Eugene Koifman [ ekoifman ] |
Link |
This issue relates to |
Link |
This issue relates to |
Link |
This issue relates to |
Priority | Critical [ 2 ] | Blocker [ 1 ] |
Affects Version/s | 0.14.0 [ 12326450 ] |
Description |
{noformat} create table T(a int, b int) clustered by (a) into 2 buckets stored as orc TBLPROPERTIES('transactional'='false') insert into T(a,b) values(1,2) insert into T(a,b) values(1,3) alter table T SET TBLPROPERTIES ('transactional'='true') {noformat} //we should now have bucket files 000001_0 and 000001_0_copy_1 but OrcRawRecordMerger.OriginalReaderPair.next() doesn't know that there can be copy_N files and numbers rows in each bucket from 0 thus generating duplicate IDs {noformat} select ROW__ID, INPUT__FILE__NAME, a, b from T {noformat} produces {noformat} {"transactionid":0,"bucketid":1,"rowid":0},file:/Users/ekoifman/dev/hiverwgit/ql/target/tmp/org.apache.hadoop.hive.ql.TestTxnCommands.../warehouse/nonacidorctbl/000001_0,1,2 {"transactionid\":0,"bucketid":1,"rowid":0},file:/Users/ekoifman/dev/hiverwgit/ql/target/tmp/org.apache.hadoop.hive.ql.TestTxnCommands.../warehouse/nonacidorctbl/000001_0_copy_1,1,3 {noformat} [~owen.omalley], do you have any thoughts on a good way to handle this? attached patch has a few changes to make Acid even recognize copy_N but this is just a pre-requisite. The new UT demonstrates the issue. |
{noformat} create table T(a int, b int) clustered by (a) into 2 buckets stored as orc TBLPROPERTIES('transactional'='false') insert into T(a,b) values(1,2) insert into T(a,b) values(1,3) alter table T SET TBLPROPERTIES ('transactional'='true') {noformat} //we should now have bucket files 000001_0 and 000001_0_copy_1 but OrcRawRecordMerger.OriginalReaderPair.next() doesn't know that there can be copy_N files and numbers rows in each bucket from 0 thus generating duplicate IDs {noformat} select ROW__ID, INPUT__FILE__NAME, a, b from T {noformat} produces {noformat} {"transactionid":0,"bucketid":1,"rowid":0},file:/Users/ekoifman/dev/hiverwgit/ql/target/tmp/org.apache.hadoop.hive.ql.TestTxnCommands.../warehouse/nonacidorctbl/000001_0,1,2 {"transactionid\":0,"bucketid":1,"rowid":0},file:/Users/ekoifman/dev/hiverwgit/ql/target/tmp/org.apache.hadoop.hive.ql.TestTxnCommands.../warehouse/nonacidorctbl/000001_0_copy_1,1,3 {noformat} [~owen.omalley], do you have any thoughts on a good way to handle this? attached patch has a few changes to make Acid even recognize copy_N but this is just a pre-requisite. The new UT demonstrates the issue. Futhermore, {noformat} alter table T compact 'major' select ROW__ID, INPUT__FILE__NAME, a, b from T order by b {noformat} produces {noformat} {"transactionid":0,"bucketid":1,"rowid":0} file:/Users/ekoifman/dev/hiverwgit/ql/target/tmp/org.apache.hadoop.hive.ql.TestTxnCommands....warehouse/nonacidorctbl/base_-9223372036854775808/bucket_00001 1 2 {noformat} This is because compactor doesn't handle copy_N files either (skips them) |
Attachment | HIVE-16177.04.patch [ 12858916 ] |
Link |
This issue is broken by |
Link |
This issue is broken by |
Link |
This issue relates to |
Link |
This issue relates to |
Attachment | HIVE-16177.07.patch [ 12868598 ] |
Status | Open [ 1 ] | Patch Available [ 10002 ] |
Attachment | HIVE-16177.08.patch [ 12868654 ] |
Attachment | HIVE-16177.09.patch [ 12868978 ] |
Attachment | HIVE-16177.10.patch [ 12869080 ] |
Attachment | HIVE-16177.11.patch [ 12869126 ] |
Attachment | HIVE-16177.14.patch [ 12869322 ] |
Link |
This issue is related to |
Attachment | HIVE-16177.15.patch [ 12869348 ] |
Remote Link | This issue links to "Review Board (Web Link)" [ 83283 ] |
Attachment | HIVE-16177.16.patch [ 12874345 ] |
Link |
This issue requires |
Link |
This issue blocks |
Attachment | HIVE-16177.17.patch [ 12876519 ] |
Attachment |
|
Attachment | HIVE-16177.17.patch [ 12876521 ] |
Attachment | HIVE-16177.18.patch [ 12876537 ] |
Attachment | HIVE-16177.18-branch-2.patch [ 12876662 ] |
Attachment | HIVE-16177.19-branch-2.patch [ 12876695 ] |
Attachment | HIVE-16177.20-branch-2.patch [ 12876712 ] |
Fix Version/s | 3.0.0 [ 12340268 ] | |
Fix Version/s | 2.4.0 [ 12340338 ] | |
Resolution | Fixed [ 1 ] | |
Status | Patch Available [ 10002 ] | Resolved [ 5 ] |
Link |
This issue is related to |
Status | Resolved [ 5 ] | Closed [ 6 ] |
Workflow | no-reopen-closed, patch-avail [ 13265153 ] | Hive - no-reopen-closed, patch-avail [ 14131622 ] |
Bucket handling in Hive in general is completely screwed, and inconsistent in different places (e.g. sample and IIRC some other code would just take files in order, regardless of names, and if there are fewer or more files than needed).
Maybe there needs to be some work to enforce it better via some cental utility or manager class that would get all files for a bucket and validate buckets more strictly.