Uploaded image for project: 'Sqoop (Retired)'
  1. Sqoop (Retired)
  2. SQOOP-3263

Duplicate rows found when split-by column is of textual type due to different charset difference of sqoop and hadoop

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.4.6
    • None
    • None
    • None
    • Patch

    Description

      This is issue can be found in any kind of RMDBS, because the root cause is not on RMDBS. Steps to reproduce this issue:
      1. create a mysql table: create table ora_test (id varchar(32) primary key not null);
      2. insert 5 rows:
      insert into ora_test values ('08125FC4C8FDA064E053C0A8028DA064');
      insert into ora_test values ('4FFE68419D3502E2E0537F000001F3E8');
      insert into ora_test values ('4FFF9CF5861E003EE0537F0000017FF7');
      insert into ora_test values ('56DAC2D0F14901B0E0537F000001D3FA');
      insert into ora_test values ('4 ABC');
      3. import it to hive with sqoop import -m 32. (m=189 is also ok)。 Then you will get 7 rows in hive. Check screenshot-1.png
      part-32 is duplicated with part-26.

      so I print their split boundary values in unicode and plain text, check screenshot-2.png for part-26, screenshot-3.png for part-32.
      According to boundary values, we can know that part-26 has no problem while part-32 is wrong, because '\u4\ud836' is larger than ‘4F', so part-32 should have no records.

      So '?' in plain text of part-32 is suspicious, does its unicode is still '\ud836' when query on RMDBS?
      So I do next test, check screenshot-4.png. Two different unicode characters are mapped to a same character in utf-8.
      This caused the duplication.

      How is happens?
      1. split boundary values are unicode
      2. when the import MR start to run, it read split boundary values to Text type. Text always use utf-8, so some characters are wrong, like above case.

      My solution is convert sqoop generated split boundary values to utf-8 first, and resort them.

      Attachments

        1. sqoop-3263.patch
          1 kB
          Yulei Yang
        2. screenshot-4.png
          112 kB
          Yulei Yang
        3. screenshot-3.png
          30 kB
          Yulei Yang
        4. screenshot-2.png
          32 kB
          Yulei Yang
        5. screenshot-1.png
          78 kB
          Yulei Yang

        Issue Links

          Activity

            People

              Unassigned Unassigned
              noatime Yulei Yang
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated: