Uploaded image for project: 'Sqoop'
  1. Sqoop
  2. SQOOP-2904

Oraoop does not distribute data evenly among mappers

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 1.4.6
    • Fix Version/s: None
    • Component/s: connectors/oracle
    • Labels:
      None
    • Environment:

      RedHat 6.7

      Description

      When executing sqoop command below with direct option and import data from Oracle

      sqoop import -Doracle.row.fetch.size=20000 -Doraoop.timestamp.string=false --connect jdbc:oracle:thin:@xxx.xx.xx.xxxx -m 50 --direct --username xxx --password xxxx --table my_table_name --fetch-size 20000 --target-dir /data/temp

      The stdout message shows:

      16/04/08 10:39:06 INFO oracle.OraOopDataDrivenDBInputFormat: The table being imported by sqoop has 138310664 blocks that have been divided into 101 chunks which will be processed in 50 splits. The chunks will be allocated to the splits using the method : ROUNDROBIN
      16/04/08 10:39:07 INFO mapreduce.JobSubmitter: number of splits:50

      Thus 49 mapper is going to work on 2 chunks while 1 mapper is going to work on 3 chunks. Because that single mapper takes 50% more data then rest of the mapper, it takes 50% longer time to finish.

      First of all, in the OraoopUtilities.java, it has a method getNumberOfDataChunksPerOracleDataFile

      public static int getNumberOfDataChunksPerOracleDataFile(

      int desiredNumberOfMappers, org.apache.hadoop.conf.Configuration conf) {
      final String MAPPER_MULTIPLIER = "oraoop.datachunk.mapper.multiplier";
      final String RESULT_INCREMENT = "oraoop.datachunk.result.increment";

      int numberToMultiplyMappersBy = conf.getInt(MAPPER_MULTIPLIER, 2);
      int numberToIncrementResultBy = conf.getInt(RESULT_INCREMENT, 1);

      // The number of chunks generated will not be a multiple of the number of
      // splits,
      // to ensure that each split doesn't always get data from the start of each
      // data-file...
      int numberOfDataChunksPerOracleDataFile =
      (desiredNumberOfMappers * numberToMultiplyMappersBy)
      + numberToIncrementResultBy;

      So it looks like it was designed this way on purpose so that the each split will not always get data from the start of each data file.

      I thought I could simply configure property oraoop.datachunk.result.increment=0 to solve the issue, but after testing, it seems it does not change the behavior. I then dig deeper and found this method is not actually called anywhere in the Sqoop. Instead, in class OraOopDataDrivenDBInputFormat (method getSplits), it implements the similar logic again, but this time using hard-coded values

      int desiredNumberOfMappers = getDesiredNumberOfMappers(jobContext);

      ...

      // The number of chunks generated will not be a multiple of the number
      // of splits,
      // to ensure that each split doesn't always get data from the start of
      // each data-file...
      int numberOfChunksPerOracleDataFile = (desiredNumberOfMappers * 2) + 1;

      Thus there is no way to change this behavior other than fixing the code.

      The proposed fixes are:

      1. Because the number of chunk is 2* number of mappers + 1, it causes data to be distributed unevenly across mappers, prolonging the whole Sqoop process by 50%. IMHO, the benefit gained by ensuring that each split doesn't always get data from the start of each data-file is insignificant compared to the drawback of uneven distribution of data.
      2. The getSplits method in OraOopDataDrivenDBInputFormat.java should call OraoopUtilities class getNumberOfDataChunksPerOracleDataFile so that this behavior can be controlled by customization of oraoop.datachunk.mapper.multiplier and raoop.datachunk.result.increment options

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              lanjiang Lan Jiang
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated: