Uploaded image for project: 'Sqoop'
  1. Sqoop
  2. SQOOP-3240

Hive table Row count mismatch(more records in hadoop/hive than oracle) when importing data from oracle using sqoop import with direct mode

    Details

    • Type: Bug
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: 1.4.6
    • Fix Version/s: None
    • Component/s: connectors/oracle
    • Labels:
      None
    • Environment:

      Sqoop 1.4.6-cdh5.8.2
      Oracle 11i
      oracle table Sqoop import with direct mode

      Description

      Hive/Hadoop row count was not matching on sqoop import of oracle table with direct mode, it's perfectly matching without direct mode.

      See the below examples.

      1. Row Count is varying based on fetch-size 10000 and direct mode.
      2. Row count is not consistent, sometimes it is 630869 and sometimes 639924 etc.,
      3. We are pulling data with direct mode for 1 year with monthly partition, data mismatch happening for couple of months not for all months.
      4. Hive/Hadoop table Count looks good, If we import the data from oracle without direct mode.

      Actual row count in oracle table for year 2015 and month 12 is : 609924, we are getting more records on sqoop import with direct mode : 630869. hive table count is matching with oracle count on sqoop import without direct mode : 609924

      1. sqoop import without direct mode, Row Count : 609924

      sqoop import -Dmapredue.job.queuename=queue -Doraoop.import.partitions=oracle_table_partition_201512 --connect jdbc:oracle:thin:@//xxxxx.yyyyy.com:2408/zzzz --username user_id -P --table oracle_db.oracle_table --split-by split_column --hive-import --hive-table hive_db.hive_table --where "TO_DATE(oracle_timestamp_column) BETWEEN TO_DATE('12/01/2015','mm/dd/yyyy') AND TO_DATE('12/31/2015','mm/dd/yyyy')" --fields-terminated-by '\001' --lines-terminated-by '\n' --as-textfile -m 4

      2. sqoop import with direct mode, Row Count : 630869

      sqoop import -Dmapredue.job.queuename=queue -Doraoop.import.partitions=oracle_table_partition_201512 --connect jdbc:oracle:thin:@//xxxxx.yyyyy.com:2408/zzzz --username user_id -P --table oracle_db.oracle_table --split-by split_column --hive-import --hive-table hive_db.hive_table --where "TO_DATE(oracle_timestamp_column) BETWEEN TO_DATE('12/01/2015','mm/dd/yyyy') AND TO_DATE('12/31/2015','mm/dd/yyyy')" --fields-terminated-by '\001' --lines-terminated-by '\n' --as-textfile -m 4 --direct

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              infosuresh2k Suresh Venkatesan
            • Votes:
              2 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated: