Sqoop
  1. Sqoop
  2. SQOOP-1306

Allow Sqoop to move files from different FileSystems on incremental import

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 1.4.4
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Sqoop doesn't allow to move files from different FileSystems executing an import (--append or --incremental).

      Trying to import using a local temp-dir and a S3 target-dir, the operation is complete, however the file is not created in S3 bucket and this warning is raised:
      WARN - Cannot append files to target dir; no such directory: sqoop/15151724000000436_31417_localhost.localdomain<tablename>

      Looking into on source, I found that AppendUtils.java considers that tempDir and targetDir are at the same Filesystem.

      1. scenarios.png
        59 kB
        Rodrigo Matihara

        Activity

        Hide
        Jarek Jarcec Cecho added a comment -

        "Move" is only a metadata operation and hence it's indeed required that they are in the same filesystem. Otherwise you would need copy or distcopy to make the "move". This limitation is imposed by HDFS, not by Sqoop.

        Show
        Jarek Jarcec Cecho added a comment - "Move" is only a metadata operation and hence it's indeed required that they are in the same filesystem. Otherwise you would need copy or distcopy to make the "move". This limitation is imposed by HDFS, not by Sqoop.
        Hide
        Rodrigo Matihara added a comment -

        Hi Jarek,
        I know that the default of Sqoop is importing to HDFS, but setting target-dir to my s3n URI, I could import data from MySQL into Amazon S3.
        Using S3 instead HDFS, I came across this problem, cause I would like to create temp files locally and when the process get finished, send than to S3.
        I need to move the files from local FileSystem to Amazon S3 FileSystem.

        Show
        Rodrigo Matihara added a comment - Hi Jarek, I know that the default of Sqoop is importing to HDFS, but setting target-dir to my s3n URI, I could import data from MySQL into Amazon S3. Using S3 instead HDFS, I came across this problem, cause I would like to create temp files locally and when the process get finished, send than to S3. I need to move the files from local FileSystem to Amazon S3 FileSystem.
        Hide
        Jarek Jarcec Cecho added a comment -

        Yup, I hear you Rodrigo Matihara. I'm trying to express that what you are trying to achieve is currently outside of Sqoop's scope. You would need to run distcp from within Sqoop to "move" files across filesystem.

        Show
        Jarek Jarcec Cecho added a comment - Yup, I hear you Rodrigo Matihara . I'm trying to express that what you are trying to achieve is currently outside of Sqoop's scope. You would need to run distcp from within Sqoop to "move" files across filesystem.
        Hide
        Rodrigo Matihara added a comment -

        Jarek, I understand your point, but I think that distcp is not what I need.
        I created a diagram (attached file) with 2 scenarios, that I think that you will understand what I'm seeking for.
        The first scenario is working perfect today, I can import data from a MySQL to HDFS/S3 creating temp files on the target-dir.
        The second scenario is what I need, import data from MySQL to HDFS/S3, creating temp files locally and write final files to HDFS/S3.

        Show
        Rodrigo Matihara added a comment - Jarek, I understand your point, but I think that distcp is not what I need. I created a diagram (attached file) with 2 scenarios, that I think that you will understand what I'm seeking for. The first scenario is working perfect today, I can import data from a MySQL to HDFS/S3 creating temp files on the target-dir. The second scenario is what I need, import data from MySQL to HDFS/S3, creating temp files locally and write final files to HDFS/S3.
        Hide
        Rodrigo Matihara added a comment - - edited

        I had to code this to my project, so, on append() method of AppendUtils class, I did this:

        // Creating a fs for each dir
        FileSystem tempDirFs = tempDir.getFileSystem(options.getConf());
        FileSystem userDestDirFs = userDestDir.getFileSystem(options.getConf());
        
        // If has the same fs, use default moveFiles
        if (userDestDirFs.getClass().equals(tempDirFs.getClass())) {
            moveFiles(tempDirFs, tempDir, userDestDir, nextPartition);
        } else {
            // move files from different file systems
            moveFiles(tempDirFs, tempDir, userDestDirFs, userDestDir, nextPartition);
        }
        

        The second moveFiles, creates files on target-dir using FSDataInputStream and FSDataOutputStream and deletes temp files when it finish.
        I don't know if it is the best way to do this, but it's working now.

        Show
        Rodrigo Matihara added a comment - - edited I had to code this to my project, so, on append() method of AppendUtils class, I did this: // Creating a fs for each dir FileSystem tempDirFs = tempDir.getFileSystem(options.getConf()); FileSystem userDestDirFs = userDestDir.getFileSystem(options.getConf()); // If has the same fs, use default moveFiles if (userDestDirFs.getClass().equals(tempDirFs.getClass())) { moveFiles(tempDirFs, tempDir, userDestDir, nextPartition); } else { // move files from different file systems moveFiles(tempDirFs, tempDir, userDestDirFs, userDestDir, nextPartition); } The second moveFiles, creates files on target-dir using FSDataInputStream and FSDataOutputStream and deletes temp files when it finish. I don't know if it is the best way to do this, but it's working now.
        Hide
        Venkat Ranganathan added a comment -

        Rodrigo Matihara I concur with Jarek Jarcec Cecho. This seems to be something you would need to setup as a workflow. May be an oozie workflow with a sqoop and distcp action or java action to do what you are trying to do.

        Show
        Venkat Ranganathan added a comment - Rodrigo Matihara I concur with Jarek Jarcec Cecho . This seems to be something you would need to setup as a workflow. May be an oozie workflow with a sqoop and distcp action or java action to do what you are trying to do.
        Hide
        Rodrigo Matihara added a comment - - edited

        Good Venkat Ranganathan, I will try Oozie too.
        But, I steel thinking that could be a good improvement on Sqoop, allow the use of different file systems to temp dir and target dir.

        Supposing that I call this:

        sqoop import jdbc:mysql://xxxxx:3306/sqoop --username root --password root --table yyyyy --target-dir s3n://<bucket-name>/sqoop --incremental append --check-column id --last-value 120

        I saw this bug https://issues.apache.org/jira/browse/SQOOP-1303, that allows a different file system on target-dir, so this command is not working on sqoop-1.4.4, but will work on sqoop-1.4.5, its great, cause Sqoop will allow use different FileSystems to target-dir.

        But, if I call this:

        sqoop import jdbc:mysql://xxxxx:3306/sqoop --username root --password root --table yyyyy --target-dir s3n://<bucket-name>/sqoop --incremental append --check-column id --last-value 120

        and supposing that sqoop.test.import.rootDir=file:/tmp/sqoop/
        I will have problems, cause Sqoop will try to rename temp files that exists on /tmp/sqoop/... to s3n://<bucket-name>/sqoop/...

        I'm proposing to Sqoop allow, when temp-dir and target-dir have different FileSystems, really moves files from temp to target.
        I have coded this on a sqoop version that I'm studying, and I could send the class to you guys check.
        I'm sorry about my english, I think that it is not helping our discussion, but my intentions are just to help the community ;P

        Show
        Rodrigo Matihara added a comment - - edited Good Venkat Ranganathan, I will try Oozie too. But, I steel thinking that could be a good improvement on Sqoop, allow the use of different file systems to temp dir and target dir. Supposing that I call this: sqoop import jdbc:mysql: //xxxxx:3306/sqoop --username root --password root --table yyyyy --target-dir s3n://<bucket-name>/sqoop --incremental append --check-column id --last-value 120 I saw this bug https://issues.apache.org/jira/browse/SQOOP-1303 , that allows a different file system on target-dir, so this command is not working on sqoop-1.4.4, but will work on sqoop-1.4.5, its great, cause Sqoop will allow use different FileSystems to target-dir. But, if I call this: sqoop import jdbc:mysql: //xxxxx:3306/sqoop --username root --password root --table yyyyy --target-dir s3n://<bucket-name>/sqoop --incremental append --check-column id --last-value 120 and supposing that sqoop.test.import.rootDir= file:/tmp/sqoop/ I will have problems, cause Sqoop will try to rename temp files that exists on /tmp/sqoop/... to s3n://<bucket-name>/sqoop/... I'm proposing to Sqoop allow, when temp-dir and target-dir have different FileSystems, really moves files from temp to target. I have coded this on a sqoop version that I'm studying, and I could send the class to you guys check. I'm sorry about my english, I think that it is not helping our discussion, but my intentions are just to help the community ;P
        Hide
        Henrique Andrade added a comment -

        Jarek and Venkat,

        I think the functionality from Sqoop is pretty clear and Rodrigo is not talking about adding new functionalities or change current ones. The main point here, is cost. Right now the temp files are being written to S3 that means that we have the cost to upload those files using Internet connection and have several unnecessary read and write operations on S3. What Rodrigo is proposing is to keep the temp files generated from Sqoop on the local temp directory and then at the end of the process move the final file that to S3 as is doing right now.

        The change that Rodrigo performed is just to have the option to define a local temp dir or S3 temp dir. If the user defines a s3 temp dir, it will have the same activities that has on the current code if the user defines and local temp dir then the temp files will be placed on the local temp dir and at the end will be moved to S3.

        Makes sense?

        Show
        Henrique Andrade added a comment - Jarek and Venkat, I think the functionality from Sqoop is pretty clear and Rodrigo is not talking about adding new functionalities or change current ones. The main point here, is cost. Right now the temp files are being written to S3 that means that we have the cost to upload those files using Internet connection and have several unnecessary read and write operations on S3. What Rodrigo is proposing is to keep the temp files generated from Sqoop on the local temp directory and then at the end of the process move the final file that to S3 as is doing right now. The change that Rodrigo performed is just to have the option to define a local temp dir or S3 temp dir. If the user defines a s3 temp dir, it will have the same activities that has on the current code if the user defines and local temp dir then the temp files will be placed on the local temp dir and at the end will be moved to S3. Makes sense?

          People

          • Assignee:
            Unassigned
            Reporter:
            Rodrigo Matihara
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:

              Development