[SQOOP-3263] Duplicate rows found when split-by column is of textual type due to different charset difference of sqoop and hadoop - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 1.4.6
Fix Version/s: None
Component/s: None
Labels:
None

Flags:

Patch

Description

This is issue can be found in any kind of RMDBS, because the root cause is not on RMDBS. Steps to reproduce this issue:
1. create a mysql table: create table ora_test (id varchar(32) primary key not null);
2. insert 5 rows:
insert into ora_test values ('08125FC4C8FDA064E053C0A8028DA064');
insert into ora_test values ('4FFE68419D3502E2E0537F000001F3E8');
insert into ora_test values ('4FFF9CF5861E003EE0537F0000017FF7');
insert into ora_test values ('56DAC2D0F14901B0E0537F000001D3FA');
insert into ora_test values ('4 ABC');
3. import it to hive with sqoop import -m 32. (m=189 is also ok)。 Then you will get 7 rows in hive. Check screenshot-1.png
part-32 is duplicated with part-26.

so I print their split boundary values in unicode and plain text, check screenshot-2.png for part-26, screenshot-3.png for part-32.
According to boundary values, we can know that part-26 has no problem while part-32 is wrong, because '\u4\ud836' is larger than ‘4F', so part-32 should have no records.

So '?' in plain text of part-32 is suspicious, does its unicode is still '\ud836' when query on RMDBS？
So I do next test, check screenshot-4.png. Two different unicode characters are mapped to a same character in utf-8.
This caused the duplication.

How is happens?
1. split boundary values are unicode
2. when the import MR start to run, it read split boundary values to Text type. Text always use utf-8, so some characters are wrong, like above case.

My solution is convert sqoop generated split boundary values to utf-8 first, and resort them.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

screenshot-1.png
26/Nov/17 15:02
78 kB
Yulei Yang
screenshot-2.png
26/Nov/17 15:03
32 kB
Yulei Yang
screenshot-3.png
26/Nov/17 15:03
30 kB
Yulei Yang
screenshot-4.png
26/Nov/17 15:45
112 kB
Yulei Yang
sqoop-3263.patch
26/Nov/17 15:55
1 kB
Yulei Yang

Issue Links

relates to

SQOOP-2664 Duplicate records found when split-by column is of type char(n)

Open

Activity

People

Assignee:: Unassigned

Reporter:: Yulei Yang

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 26/Nov/17 14:56

Updated:: 26/Nov/17 16:11