Uploaded image for project: 'Sqoop'
  1. Sqoop
  2. SQOOP-1302

Doesn't run the mapper for remaining splits, when split-by ROWNUM

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 1.4.3
    • Fix Version/s: 1.4.5
    • Component/s: build
    • Labels:
    • Environment:

      CDH 4.6.0-1

      Description

      when trying to sqoop, import the table from Oracle database to HDFS, I am using the ROWNUM to split by, since the table doesn't have a primary key. I am just getting the exactly the 1/4 of the data, remaining 3/4 not sqooped in with default 4 mappers. It is splitting correctly into 4 pieces but it runs mapper on only one.
      Eg: I have total of 28 records in table, The records ingested is is only 8
      records in hdfs.

      HDFS OUTPUT:

      part-m-00000 544 bytes
      part-m-00001 0 bytes
      part-m-00002 0 bytes
      part-m-00003 0 bytes

      LOG:
      Warning: /usr/lib/hcatalog does not exist! HCatalog jobs will fail.
      Please set $HCAT_HOME to the root of your HCatalog installation.
      Warning: /hdp2014poc/cloudera/parcels/CDH-4.6.0-1.cdh4.6.0.p0.26/bin/../lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
      Please set $ACCUMULO_HOME to the root of your Accumulo installation.
      14/04/02 12:38:22 INFO sqoop.Sqoop: Running Sqoop version: 1.4.3-cdh4.6.0
      14/04/02 12:38:22 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
      14/04/02 12:38:22 INFO manager.SqlManager: Using default fetchSize of 1000
      14/04/02 12:38:22 INFO tool.CodeGenTool: Beginning code generation
      14/04/02 12:38:23 INFO manager.OracleManager: Time zone has been set to GMT
      14/04/02 12:38:23 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM BILL_FEE_CDE_DIM t WHERE 1=0
      14/04/02 12:38:23 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /hdp2014poc/cloudera/parcels/CDH-4.6.0-1.cdh4.6.0.p0.26/bin/../lib/hadoop-0.20-mapreduce
      Note: /tmp/sqoop-dmadmin/compile/6a8ac204650278b4235d199bb6059358/BILL_FEE_CDE_DIM.java uses or overrides a deprecated API.
      Note: Recompile with -Xlint:deprecation for details.
      14/04/02 12:38:24 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-dmadmin/compile/6a8ac204650278b4235d199bb6059358/BILL_FEE_CDE_DIM.jar
      14/04/02 12:38:24 INFO mapreduce.ImportJobBase: Beginning import of BILL_FEE_CDE_DIM
      14/04/02 12:38:25 INFO manager.OracleManager: Time zone has been set to GMT
      14/04/02 12:38:25 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
      14/04/02 12:38:26 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(ROWNUM), MAX(ROWNUM) FROM BILL_FEE_CDE_DIM
      14/04/02 12:38:26 INFO mapred.JobClient: Running job: job_201403281357_0126
      14/04/02 12:38:27 INFO mapred.JobClient: map 0% reduce 0%
      14/04/02 12:38:38 INFO mapred.JobClient: map 50% reduce 0%
      14/04/02 12:38:39 INFO mapred.JobClient: map 75% reduce 0%
      14/04/02 12:38:47 INFO mapred.JobClient: map 100% reduce 0%
      14/04/02 12:38:49 INFO mapred.JobClient: Job complete: job_201403281357_0126
      14/04/02 12:38:49 INFO mapred.JobClient: Counters: 23
      14/04/02 12:38:49 INFO mapred.JobClient: File System Counters
      14/04/02 12:38:49 INFO mapred.JobClient: FILE: Number of bytes read=0
      14/04/02 12:38:49 INFO mapred.JobClient: FILE: Number of bytes written=765188
      14/04/02 12:38:49 INFO mapred.JobClient: FILE: Number of read operations=0
      14/04/02 12:38:49 INFO mapred.JobClient: FILE: Number of large read operations=0
      14/04/02 12:38:49 INFO mapred.JobClient: FILE: Number of write operations=0
      14/04/02 12:38:49 INFO mapred.JobClient: HDFS: Number of bytes read=435
      14/04/02 12:38:49 INFO mapred.JobClient: HDFS: Number of bytes written=544
      14/04/02 12:38:49 INFO mapred.JobClient: HDFS: Number of read operations=4
      14/04/02 12:38:49 INFO mapred.JobClient: HDFS: Number of large read operations=0
      14/04/02 12:38:49 INFO mapred.JobClient: HDFS: Number of write operations=4
      14/04/02 12:38:49 INFO mapred.JobClient: Job Counters
      14/04/02 12:38:49 INFO mapred.JobClient: Launched map tasks=4
      14/04/02 12:38:49 INFO mapred.JobClient: Total time spent by all maps in occupied slots (ms)=34570
      14/04/02 12:38:49 INFO mapred.JobClient: Total time spent by all reduces in occupied slots (ms)=0
      14/04/02 12:38:49 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
      14/04/02 12:38:49 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
      14/04/02 12:38:49 INFO mapred.JobClient: Map-Reduce Framework
      14/04/02 12:38:49 INFO mapred.JobClient: Map input records=8
      14/04/02 12:38:49 INFO mapred.JobClient: Map output records=8
      14/04/02 12:38:49 INFO mapred.JobClient: Input split bytes=435
      14/04/02 12:38:49 INFO mapred.JobClient: Spilled Records=0
      14/04/02 12:38:49 INFO mapred.JobClient: CPU time spent (ms)=8030
      14/04/02 12:38:49 INFO mapred.JobClient: Physical memory (bytes) snapshot=1160265728
      14/04/02 12:38:49 INFO mapred.JobClient: Virtual memory (bytes) snapshot=6454796288
      14/04/02 12:38:49 INFO mapred.JobClient: Total committed heap usage (bytes)=3032481792
      14/04/02 12:38:49 INFO mapreduce.ImportJobBase: Transferred 544 bytes in 24.8174 seconds (21.9201 bytes/sec)
      14/04/02 12:38:49 INFO mapreduce.ImportJobBase: Retrieved 8 records.

        Activity

        Hide
        venkatnrangan Venkat Ranganathan added a comment -

        ROWNUM should not be used as a split by column.

        To understand why please run the following query on Oracle on any table (it does not matter how many rows are there in the table)

        select * from <table> where rownum between 20 and 30;

        No rows will be returned by the query. Rownum is a pseudo colum. A row in an Oracle table does not have a number. There is no way to ask for row 15 from a table. And how the numbers are assigned to a set returned by a query causes this behavior.

        Please use another column from your table as the split by column. Note that split-by column need not be unique

        Show
        venkatnrangan Venkat Ranganathan added a comment - ROWNUM should not be used as a split by column. To understand why please run the following query on Oracle on any table (it does not matter how many rows are there in the table) select * from <table> where rownum between 20 and 30; No rows will be returned by the query. Rownum is a pseudo colum. A row in an Oracle table does not have a number. There is no way to ask for row 15 from a table. And how the numbers are assigned to a set returned by a query causes this behavior. Please use another column from your table as the split by column. Note that split-by column need not be unique
        Hide
        venkatnrangan Venkat Ranganathan added a comment - - edited

        I will add a warning to explain to the users that this is not the right use of ROWNUM in Oracle.

        Show
        venkatnrangan Venkat Ranganathan added a comment - - edited I will add a warning to explain to the users that this is not the right use of ROWNUM in Oracle.
        Hide
        praveenhm praveen m added a comment -

        Hi Venkat,

        Thanks for the clarification.

        `Praveen

        Show
        praveenhm praveen m added a comment - Hi Venkat, Thanks for the clarification. `Praveen

          People

          • Assignee:
            venkatnrangan Venkat Ranganathan
            Reporter:
            praveenhm praveen m
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - 396h
              396h
              Remaining:
              Remaining Estimate - 396h
              396h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development