Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
1.4.6
-
None
-
None
-
None
-
Patch
Description
This is issue can be found in any kind of RMDBS, because the root cause is not on RMDBS. Steps to reproduce this issue:
1. create a mysql table: create table ora_test (id varchar(32) primary key not null);
2. insert 5 rows:
insert into ora_test values ('08125FC4C8FDA064E053C0A8028DA064');
insert into ora_test values ('4FFE68419D3502E2E0537F000001F3E8');
insert into ora_test values ('4FFF9CF5861E003EE0537F0000017FF7');
insert into ora_test values ('56DAC2D0F14901B0E0537F000001D3FA');
insert into ora_test values ('4 ABC');
3. import it to hive with sqoop import -m 32. (m=189 is also ok)。 Then you will get 7 rows in hive. Check screenshot-1.png
part-32 is duplicated with part-26.
so I print their split boundary values in unicode and plain text, check screenshot-2.png for part-26, screenshot-3.png for part-32.
According to boundary values, we can know that part-26 has no problem while part-32 is wrong, because '\u4\ud836' is larger than ‘4F', so part-32 should have no records.
So '?' in plain text of part-32 is suspicious, does its unicode is still '\ud836' when query on RMDBS?
So I do next test, check screenshot-4.png. Two different unicode characters are mapped to a same character in utf-8.
This caused the duplication.
How is happens?
1. split boundary values are unicode
2. when the import MR start to run, it read split boundary values to Text type. Text always use utf-8, so some characters are wrong, like above case.
My solution is convert sqoop generated split boundary values to utf-8 first, and resort them.
Attachments
Attachments
Issue Links
- relates to
-
SQOOP-2664 Duplicate records found when split-by column is of type char(n)
- Open