Sqoop
  1. Sqoop
  2. SQOOP-390

PostgreSQL connector for direct export with pg_bulkload

    Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.4.3
    • Component/s: None
    • Labels:
      None

      Description

      Features

      • Fast data export into PostgreSQL database with pg_bulkload.
      • User can get benefit of functionality of pg_bulkload such as
        • fast export bypassing shared bufferes and WAL,
        • removing invalid data which cause parse error,
        • ETL feature with filter functions.

      Implementation

      PGBulkloadExportMapper

      At first, each map tasks create their own staging table with names based on task attempt id because pg_bulkload holds table level lock.
      Arguments of pg_bulkload command can be passed via configuration object.
      The Mapper export data by invoking pg_bulkload command as external process.
      Commnad execution is done in the same way as DirectPostgresqlManager.

      PGBulkloadExportReducer

      Reducer migrates data from staging tables into destination table.
      Reducer gets the names of staging tables as map output values.
      In order to do migration in a transaction, the number of reducers must be 1. (It is set by ConnectionManager internally).
      Migration is done in same way as Sqoop defalt connectors using "INSERT INTO dst ( SELECT * FROM src )".
      In the cleanup method, staging tables are dropped.
      If exception is raised in the reducer, garbage staging tables are left.
      User can delete them by executing this connector with --clear-staging-table option.

      Requirements

      • pg_bulkload must be installed on DB server and all slave nodes.
      • PostgreSQL JDBC is also required on client node (same as PostgresqlManager).
      • Superuser role of PostgreSQL database is required for execution of pg_bulkload.

      Usage

      Currently there is no Factory class.
      Specify connection manager class name with --connection-manager option to use.

      $ sqoop export --connect jdbc:postgresql://localhost:5432/test \
                     --connection-manager com.cloudera.sqoop.manager.PGBulkloadManager \
                     --table test --username postgres --export-dir=/test -m 1
      

      You can also specify pg_bulkload configuration with Hadoop configuration properties.

      $ sqoop export \
          -Dpgbulkload.bin="/usr/local/bin/pg_bulkload" \
          -Dpgbulkload.input.field.delim=$'\t' \
          -Dpgbulkload.check.constraints="YES" \
          -Dpgbulkload.parse.errors="INFINITE" \
          -Dpgbulkload.duplicate.errors="INFINITE" \
          --connect jdbc:postgresql://localhost:5432/test \
          --connection-manager com.cloudera.sqoop.manager.PGBulkloadManager \
          --table test --username postgres --export-dir=/test -m 1
      

      Test

      There is test class named PGBulkloadManagerManualTest extending TestExport.
      This test expects that

      • The major version of PostgreSQL is 9.0,
      • The version of pg_bulkload is 3.1.1,
      • PostgreSQL server is running on localhost:5432 of testing node,
      • there is database named sqooptest,
      • super user role named sqooptest exists,
      • no password is set for the role, or .pgpass file is created.

      Tests can be invoked as below.

      # ant -Dtestcase=PGBulkloadManagerManualTest test
      

      Test on CentOS 6

      Install JDK and Ant:

      # rpm -ivh jdk-6u29-linux-amd64.rpm
      # yum install ant-junit
      

      Setup for PostgreSQL:

      # wget http://ftp.postgresql.org/pub/source/v9.0.7/postgresql-9.0.7.tar.bz2
      # tar jxf postgresql-9.0.7.tar.bz2
      # cd postgresql-9.0.7
      # ./configure --prefix=/usr/local
      # make
      # make install
      # useradd postgres
      # mkdir /var/pgdata
      # chown postgres:postgres /var/pgdata
      # chmod 700 /var/pgdata
      # sudo -u postgres /usr/local/bin/initdb  -D /var/pgdata -E utf-8
      # sudo -u postgres /usr/local/bin/pg_ctl start -D /var/pgdata -l /var/pgdata/postgresql.log
      # createuser -U postgres -s sqooptest
      # createdb -U sqooptest sqooptest
      

      Setup for pg_bulkload and PostgreSQL JDBC Driver:

      # cd ..
      # wget http://pgfoundry.org/frs/download.php/3176/pg_bulkload-3.1.1.tar.gz
      # tar zxf pg_bulkload-3.1.1.tar.gz
      # mv pg_bulkload-3.1.1 postgresql-9.0.7/contrib/
      # cd postgresql-9.0.7/contrib/pg_bulkload-3.1.1
      # make
      # make install
      # psql -U sqooptest -f /usr/local/share/postgresql/contrib/pg_bulkload.sql sqooptest
      # ln -s /usr/local/bin/pg_bulkload /usr/bin/
      

      Setup for PostgreSQL JDBC Driver:

      # wget http://jdbc.postgresql.org/download/postgresql-9.0-802.jdbc4.jar
      # cp postgresql-9.0-802.jdbc4.jar /usr/local/src/sqoop-trunk/lib/
      

      run test.

      # cd /usr/local/src/sqoop-trunk
      # ant -Dtestcase=PGBulkloadManagerManualTest test
      
      1. pgbulkload-connector-r6.patch
        58 kB
        Masatake Iwasaki
      2. SQOOP-390-1.patch
        37 kB
        Masatake Iwasaki

        Issue Links

          Activity

          Hide
          Masatake Iwasaki added a comment - - edited

          I would like review and suggestion for this connector.
          Thanks in advance.

          Show
          Masatake Iwasaki added a comment - - edited I would like review and suggestion for this connector. Thanks in advance.
          Hide
          Jarek Jarcec Cecho added a comment -

          Hi Masatake,
          thank you very much for your patch. We're currently using review board application running on https://reviews.apache.org/ for patch reviewing. would you be so kind and upload it there for the review, please?

          Thank you,
          Jarcec

          Show
          Jarek Jarcec Cecho added a comment - Hi Masatake, thank you very much for your patch. We're currently using review board application running on https://reviews.apache.org/ for patch reviewing. would you be so kind and upload it there for the review, please? Thank you, Jarcec
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/2724/
          -----------------------------------------------------------

          Review request for Sqoop.

          Summary
          -------

          Patch for SQOOP-390
          https://issues.apache.org/jira/browse/SQOOP-390

          This addresses bug SQOOP-390.
          https://issues.apache.org/jira/browse/SQOOP-390

          Diffs


          /src/java/com/cloudera/sqoop/manager/PGBulkloadManager.java PRE-CREATION
          /src/java/com/cloudera/sqoop/mapreduce/AutoProgressReducer.java PRE-CREATION
          /src/java/com/cloudera/sqoop/mapreduce/PGBulkloadExportJob.java PRE-CREATION
          /src/java/com/cloudera/sqoop/mapreduce/PGBulkloadExportMapper.java PRE-CREATION
          /src/java/com/cloudera/sqoop/mapreduce/PGBulkloadExportReducer.java PRE-CREATION
          /src/test/com/cloudera/sqoop/manager/PGBulkloadManagerTest.java PRE-CREATION

          Diff: https://reviews.apache.org/r/2724/diff

          Testing
          -------

          Thanks,

          Masatake

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/2724/ ----------------------------------------------------------- Review request for Sqoop. Summary ------- Patch for SQOOP-390 https://issues.apache.org/jira/browse/SQOOP-390 This addresses bug SQOOP-390 . https://issues.apache.org/jira/browse/SQOOP-390 Diffs /src/java/com/cloudera/sqoop/manager/PGBulkloadManager.java PRE-CREATION /src/java/com/cloudera/sqoop/mapreduce/AutoProgressReducer.java PRE-CREATION /src/java/com/cloudera/sqoop/mapreduce/PGBulkloadExportJob.java PRE-CREATION /src/java/com/cloudera/sqoop/mapreduce/PGBulkloadExportMapper.java PRE-CREATION /src/java/com/cloudera/sqoop/mapreduce/PGBulkloadExportReducer.java PRE-CREATION /src/test/com/cloudera/sqoop/manager/PGBulkloadManagerTest.java PRE-CREATION Diff: https://reviews.apache.org/r/2724/diff Testing ------- Thanks, Masatake
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/2724/
          -----------------------------------------------------------

          (Updated 2012-01-13 05:23:36.248155)

          Review request for Sqoop.

          Summary
          -------

          Patch for SQOOP-390
          https://issues.apache.org/jira/browse/SQOOP-390

          This addresses bug SQOOP-390.
          https://issues.apache.org/jira/browse/SQOOP-390

          Diffs


          /src/java/com/cloudera/sqoop/manager/PGBulkloadManager.java PRE-CREATION
          /src/java/com/cloudera/sqoop/mapreduce/AutoProgressReducer.java PRE-CREATION
          /src/java/com/cloudera/sqoop/mapreduce/PGBulkloadExportJob.java PRE-CREATION
          /src/java/com/cloudera/sqoop/mapreduce/PGBulkloadExportMapper.java PRE-CREATION
          /src/java/com/cloudera/sqoop/mapreduce/PGBulkloadExportReducer.java PRE-CREATION
          /src/test/com/cloudera/sqoop/manager/PGBulkloadManagerTest.java PRE-CREATION

          Diff: https://reviews.apache.org/r/2724/diff

          Testing (updated)
          -------

          This patch include the test class PGBulkloadManagerTest.
          I've tested "ant test" and passed.

          Thanks,

          Masatake

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/2724/ ----------------------------------------------------------- (Updated 2012-01-13 05:23:36.248155) Review request for Sqoop. Summary ------- Patch for SQOOP-390 https://issues.apache.org/jira/browse/SQOOP-390 This addresses bug SQOOP-390 . https://issues.apache.org/jira/browse/SQOOP-390 Diffs /src/java/com/cloudera/sqoop/manager/PGBulkloadManager.java PRE-CREATION /src/java/com/cloudera/sqoop/mapreduce/AutoProgressReducer.java PRE-CREATION /src/java/com/cloudera/sqoop/mapreduce/PGBulkloadExportJob.java PRE-CREATION /src/java/com/cloudera/sqoop/mapreduce/PGBulkloadExportMapper.java PRE-CREATION /src/java/com/cloudera/sqoop/mapreduce/PGBulkloadExportReducer.java PRE-CREATION /src/test/com/cloudera/sqoop/manager/PGBulkloadManagerTest.java PRE-CREATION Diff: https://reviews.apache.org/r/2724/diff Testing (updated) ------- This patch include the test class PGBulkloadManagerTest. I've tested "ant test" and passed. Thanks, Masatake
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/2724/#review6510
          -----------------------------------------------------------

          Hello Masatake, I have a quick question about your patch.

          Since you're adding new tests to test/com/cloudera/sqoop/manager, "ant test -Dthirdparty=true" will fail unless the test requirements are not installed (PostgreSql 9.1, pg_bulkload 3.1, etc).

          I think that you can do either:

          1) Not run your tests with "ant test -Dthirdparty=true".

          or

          2) Update the wiki page and user doc for setting up development env.

          But since your tests are for a specific connector, I think that #1 makes more sense. Please let me know what you think.

          Thanks,
          Cheolsoo

          • Cheolsoo

          On 2012-01-13 05:23:36, Masatake Iwasaki wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/2724/

          -----------------------------------------------------------

          (Updated 2012-01-13 05:23:36)

          Review request for Sqoop.

          Summary

          -------

          Patch for SQOOP-390

          https://issues.apache.org/jira/browse/SQOOP-390

          This addresses bug SQOOP-390.

          https://issues.apache.org/jira/browse/SQOOP-390

          Diffs

          -----

          /src/java/com/cloudera/sqoop/manager/PGBulkloadManager.java PRE-CREATION

          /src/java/com/cloudera/sqoop/mapreduce/AutoProgressReducer.java PRE-CREATION

          /src/java/com/cloudera/sqoop/mapreduce/PGBulkloadExportJob.java PRE-CREATION

          /src/java/com/cloudera/sqoop/mapreduce/PGBulkloadExportMapper.java PRE-CREATION

          /src/java/com/cloudera/sqoop/mapreduce/PGBulkloadExportReducer.java PRE-CREATION

          /src/test/com/cloudera/sqoop/manager/PGBulkloadManagerTest.java PRE-CREATION

          Diff: https://reviews.apache.org/r/2724/diff

          Testing

          -------

          This patch include the test class PGBulkloadManagerTest.

          I've tested "ant test" and passed.

          Thanks,

          Masatake

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/2724/#review6510 ----------------------------------------------------------- Hello Masatake, I have a quick question about your patch. Since you're adding new tests to test/com/cloudera/sqoop/manager, "ant test -Dthirdparty=true" will fail unless the test requirements are not installed (PostgreSql 9.1, pg_bulkload 3.1, etc). I think that you can do either: 1) Not run your tests with "ant test -Dthirdparty=true". or 2) Update the wiki page and user doc for setting up development env. But since your tests are for a specific connector, I think that #1 makes more sense. Please let me know what you think. Thanks, Cheolsoo Cheolsoo On 2012-01-13 05:23:36, Masatake Iwasaki wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/2724/ ----------------------------------------------------------- (Updated 2012-01-13 05:23:36) Review request for Sqoop. Summary ------- Patch for SQOOP-390 https://issues.apache.org/jira/browse/SQOOP-390 This addresses bug SQOOP-390 . https://issues.apache.org/jira/browse/SQOOP-390 Diffs ----- /src/java/com/cloudera/sqoop/manager/PGBulkloadManager.java PRE-CREATION /src/java/com/cloudera/sqoop/mapreduce/AutoProgressReducer.java PRE-CREATION /src/java/com/cloudera/sqoop/mapreduce/PGBulkloadExportJob.java PRE-CREATION /src/java/com/cloudera/sqoop/mapreduce/PGBulkloadExportMapper.java PRE-CREATION /src/java/com/cloudera/sqoop/mapreduce/PGBulkloadExportReducer.java PRE-CREATION /src/test/com/cloudera/sqoop/manager/PGBulkloadManagerTest.java PRE-CREATION Diff: https://reviews.apache.org/r/2724/diff Testing ------- This patch include the test class PGBulkloadManagerTest. I've tested "ant test" and passed. Thanks, Masatake
          Hide
          jiraposter@reviews.apache.org added a comment -

          On 2012-03-29 05:41:04, Cheolsoo Park wrote:

          > Hello Masatake, I have a quick question about your patch.

          >

          > Since you're adding new tests to test/com/cloudera/sqoop/manager, "ant test -Dthirdparty=true" will fail unless the test requirements are not installed (PostgreSql 9.1, pg_bulkload 3.1, etc).

          >

          > I think that you can do either:

          >

          > 1) Not run your tests with "ant test -Dthirdparty=true".

          >

          > or

          >

          > 2) Update the wiki page and user doc for setting up development env.

          >

          > But since your tests are for a specific connector, I think that #1 makes more sense. Please let me know what you think.

          >

          > Thanks,

          > Cheolsoo

          Hello Cheolsoo,
          I think it is not necessary to run the test in -Dthirdparty=true case.
          Renaming the test case class to *ManualTest is reasonable way to do so?

          Because there is already descrption for setting up PostgreSQL, updating user doc for developper is also possible.
          A few addition make it.

          • Masatake

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/2724/#review6510
          -----------------------------------------------------------

          On 2012-01-13 05:23:36, Masatake Iwasaki wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/2724/

          -----------------------------------------------------------

          (Updated 2012-01-13 05:23:36)

          Review request for Sqoop.

          Summary

          -------

          Patch for SQOOP-390

          https://issues.apache.org/jira/browse/SQOOP-390

          This addresses bug SQOOP-390.

          https://issues.apache.org/jira/browse/SQOOP-390

          Diffs

          -----

          /src/java/com/cloudera/sqoop/manager/PGBulkloadManager.java PRE-CREATION

          /src/java/com/cloudera/sqoop/mapreduce/AutoProgressReducer.java PRE-CREATION

          /src/java/com/cloudera/sqoop/mapreduce/PGBulkloadExportJob.java PRE-CREATION

          /src/java/com/cloudera/sqoop/mapreduce/PGBulkloadExportMapper.java PRE-CREATION

          /src/java/com/cloudera/sqoop/mapreduce/PGBulkloadExportReducer.java PRE-CREATION

          /src/test/com/cloudera/sqoop/manager/PGBulkloadManagerTest.java PRE-CREATION

          Diff: https://reviews.apache.org/r/2724/diff

          Testing

          -------

          This patch include the test class PGBulkloadManagerTest.

          I've tested "ant test" and passed.

          Thanks,

          Masatake

          Show
          jiraposter@reviews.apache.org added a comment - On 2012-03-29 05:41:04, Cheolsoo Park wrote: > Hello Masatake, I have a quick question about your patch. > > Since you're adding new tests to test/com/cloudera/sqoop/manager, "ant test -Dthirdparty=true" will fail unless the test requirements are not installed (PostgreSql 9.1, pg_bulkload 3.1, etc). > > I think that you can do either: > > 1) Not run your tests with "ant test -Dthirdparty=true". > > or > > 2) Update the wiki page and user doc for setting up development env. > > But since your tests are for a specific connector, I think that #1 makes more sense. Please let me know what you think. > > Thanks, > Cheolsoo Hello Cheolsoo, I think it is not necessary to run the test in -Dthirdparty=true case. Renaming the test case class to *ManualTest is reasonable way to do so? Because there is already descrption for setting up PostgreSQL, updating user doc for developper is also possible. A few addition make it. Masatake ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/2724/#review6510 ----------------------------------------------------------- On 2012-01-13 05:23:36, Masatake Iwasaki wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/2724/ ----------------------------------------------------------- (Updated 2012-01-13 05:23:36) Review request for Sqoop. Summary ------- Patch for SQOOP-390 https://issues.apache.org/jira/browse/SQOOP-390 This addresses bug SQOOP-390 . https://issues.apache.org/jira/browse/SQOOP-390 Diffs ----- /src/java/com/cloudera/sqoop/manager/PGBulkloadManager.java PRE-CREATION /src/java/com/cloudera/sqoop/mapreduce/AutoProgressReducer.java PRE-CREATION /src/java/com/cloudera/sqoop/mapreduce/PGBulkloadExportJob.java PRE-CREATION /src/java/com/cloudera/sqoop/mapreduce/PGBulkloadExportMapper.java PRE-CREATION /src/java/com/cloudera/sqoop/mapreduce/PGBulkloadExportReducer.java PRE-CREATION /src/test/com/cloudera/sqoop/manager/PGBulkloadManagerTest.java PRE-CREATION Diff: https://reviews.apache.org/r/2724/diff Testing ------- This patch include the test class PGBulkloadManagerTest. I've tested "ant test" and passed. Thanks, Masatake
          Hide
          jiraposter@reviews.apache.org added a comment -

          On 2012-03-29 05:41:04, Cheolsoo Park wrote:

          > Hello Masatake, I have a quick question about your patch.

          >

          > Since you're adding new tests to test/com/cloudera/sqoop/manager, "ant test -Dthirdparty=true" will fail unless the test requirements are not installed (PostgreSql 9.1, pg_bulkload 3.1, etc).

          >

          > I think that you can do either:

          >

          > 1) Not run your tests with "ant test -Dthirdparty=true".

          >

          > or

          >

          > 2) Update the wiki page and user doc for setting up development env.

          >

          > But since your tests are for a specific connector, I think that #1 makes more sense. Please let me know what you think.

          >

          > Thanks,

          > Cheolsoo

          Masatake Iwasaki wrote:

          Hello Cheolsoo,

          I think it is not necessary to run the test in -Dthirdparty=true case.

          Renaming the test case class to *ManualTest is reasonable way to do so?

          Because there is already descrption for setting up PostgreSQL, updating user doc for developper is also possible.

          A few addition make it.

          Hi Masatake,

          Indeed, renaming the test case to *ManualTest does the trick. Would you mind making that change to your patch?

          I am no committer so my review doesn't really matter. But I am going to try out your patch out of curiosity.

          Thanks,
          Cheolsoo

          • Cheolsoo

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/2724/#review6510
          -----------------------------------------------------------

          On 2012-01-13 05:23:36, Masatake Iwasaki wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/2724/

          -----------------------------------------------------------

          (Updated 2012-01-13 05:23:36)

          Review request for Sqoop.

          Summary

          -------

          Patch for SQOOP-390

          https://issues.apache.org/jira/browse/SQOOP-390

          This addresses bug SQOOP-390.

          https://issues.apache.org/jira/browse/SQOOP-390

          Diffs

          -----

          /src/java/com/cloudera/sqoop/manager/PGBulkloadManager.java PRE-CREATION

          /src/java/com/cloudera/sqoop/mapreduce/AutoProgressReducer.java PRE-CREATION

          /src/java/com/cloudera/sqoop/mapreduce/PGBulkloadExportJob.java PRE-CREATION

          /src/java/com/cloudera/sqoop/mapreduce/PGBulkloadExportMapper.java PRE-CREATION

          /src/java/com/cloudera/sqoop/mapreduce/PGBulkloadExportReducer.java PRE-CREATION

          /src/test/com/cloudera/sqoop/manager/PGBulkloadManagerTest.java PRE-CREATION

          Diff: https://reviews.apache.org/r/2724/diff

          Testing

          -------

          This patch include the test class PGBulkloadManagerTest.

          I've tested "ant test" and passed.

          Thanks,

          Masatake

          Show
          jiraposter@reviews.apache.org added a comment - On 2012-03-29 05:41:04, Cheolsoo Park wrote: > Hello Masatake, I have a quick question about your patch. > > Since you're adding new tests to test/com/cloudera/sqoop/manager, "ant test -Dthirdparty=true" will fail unless the test requirements are not installed (PostgreSql 9.1, pg_bulkload 3.1, etc). > > I think that you can do either: > > 1) Not run your tests with "ant test -Dthirdparty=true". > > or > > 2) Update the wiki page and user doc for setting up development env. > > But since your tests are for a specific connector, I think that #1 makes more sense. Please let me know what you think. > > Thanks, > Cheolsoo Masatake Iwasaki wrote: Hello Cheolsoo, I think it is not necessary to run the test in -Dthirdparty=true case. Renaming the test case class to *ManualTest is reasonable way to do so? Because there is already descrption for setting up PostgreSQL, updating user doc for developper is also possible. A few addition make it. Hi Masatake, Indeed, renaming the test case to *ManualTest does the trick. Would you mind making that change to your patch? I am no committer so my review doesn't really matter. But I am going to try out your patch out of curiosity. Thanks, Cheolsoo Cheolsoo ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/2724/#review6510 ----------------------------------------------------------- On 2012-01-13 05:23:36, Masatake Iwasaki wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/2724/ ----------------------------------------------------------- (Updated 2012-01-13 05:23:36) Review request for Sqoop. Summary ------- Patch for SQOOP-390 https://issues.apache.org/jira/browse/SQOOP-390 This addresses bug SQOOP-390 . https://issues.apache.org/jira/browse/SQOOP-390 Diffs ----- /src/java/com/cloudera/sqoop/manager/PGBulkloadManager.java PRE-CREATION /src/java/com/cloudera/sqoop/mapreduce/AutoProgressReducer.java PRE-CREATION /src/java/com/cloudera/sqoop/mapreduce/PGBulkloadExportJob.java PRE-CREATION /src/java/com/cloudera/sqoop/mapreduce/PGBulkloadExportMapper.java PRE-CREATION /src/java/com/cloudera/sqoop/mapreduce/PGBulkloadExportReducer.java PRE-CREATION /src/test/com/cloudera/sqoop/manager/PGBulkloadManagerTest.java PRE-CREATION Diff: https://reviews.apache.org/r/2724/diff Testing ------- This patch include the test class PGBulkloadManagerTest. I've tested "ant test" and passed. Thanks, Masatake
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/2724/
          -----------------------------------------------------------

          (Updated 2012-04-02 02:30:05.378389)

          Review request for Sqoop.

          Changes
          -------

          renamed test case class.

          Summary
          -------

          Patch for SQOOP-390
          https://issues.apache.org/jira/browse/SQOOP-390

          This addresses bug SQOOP-390.
          https://issues.apache.org/jira/browse/SQOOP-390

          Diffs (updated)


          /src/java/com/cloudera/sqoop/manager/PGBulkloadManager.java PRE-CREATION
          /src/java/com/cloudera/sqoop/mapreduce/AutoProgressReducer.java PRE-CREATION
          /src/java/com/cloudera/sqoop/mapreduce/PGBulkloadExportJob.java PRE-CREATION
          /src/java/com/cloudera/sqoop/mapreduce/PGBulkloadExportMapper.java PRE-CREATION
          /src/java/com/cloudera/sqoop/mapreduce/PGBulkloadExportReducer.java PRE-CREATION
          /src/test/com/cloudera/sqoop/manager/PGBulkloadManagerManualTest.java PRE-CREATION

          Diff: https://reviews.apache.org/r/2724/diff

          Testing
          -------

          This patch include the test class PGBulkloadManagerTest.
          I've tested "ant test" and passed.

          Thanks,

          Masatake

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/2724/ ----------------------------------------------------------- (Updated 2012-04-02 02:30:05.378389) Review request for Sqoop. Changes ------- renamed test case class. Summary ------- Patch for SQOOP-390 https://issues.apache.org/jira/browse/SQOOP-390 This addresses bug SQOOP-390 . https://issues.apache.org/jira/browse/SQOOP-390 Diffs (updated) /src/java/com/cloudera/sqoop/manager/PGBulkloadManager.java PRE-CREATION /src/java/com/cloudera/sqoop/mapreduce/AutoProgressReducer.java PRE-CREATION /src/java/com/cloudera/sqoop/mapreduce/PGBulkloadExportJob.java PRE-CREATION /src/java/com/cloudera/sqoop/mapreduce/PGBulkloadExportMapper.java PRE-CREATION /src/java/com/cloudera/sqoop/mapreduce/PGBulkloadExportReducer.java PRE-CREATION /src/test/com/cloudera/sqoop/manager/PGBulkloadManagerManualTest.java PRE-CREATION Diff: https://reviews.apache.org/r/2724/diff Testing ------- This patch include the test class PGBulkloadManagerTest. I've tested "ant test" and passed. Thanks, Masatake
          Hide
          Venkatesh Seetharam added a comment -

          pg_bulkload must be installed on DB server and all slave nodes.

          Is there a way to use DistributedCache to distribute the pg_bulkload tar file to all the slave nodes instead of requiring an install?

          Show
          Venkatesh Seetharam added a comment - pg_bulkload must be installed on DB server and all slave nodes. Is there a way to use DistributedCache to distribute the pg_bulkload tar file to all the slave nodes instead of requiring an install?
          Hide
          Jarek Jarcec Cecho added a comment -

          I'm afraid that you have to install the pg_bulkload manually on all nodes at the moment. There isn't possibility to distribute pg_bulkload with the job. But I believe it's interesting idea for sqoop 2 (and for direct connectors in general).

          Jarcec

          Show
          Jarek Jarcec Cecho added a comment - I'm afraid that you have to install the pg_bulkload manually on all nodes at the moment. There isn't possibility to distribute pg_bulkload with the job. But I believe it's interesting idea for sqoop 2 (and for direct connectors in general). Jarcec
          Hide
          Abhijeet Gaikwad added a comment -

          If this helps, giving the file path with '-files' option on Sqoop command-line copies the file into Distributed Cache. This file can be accessed by each mapper.

          Show
          Abhijeet Gaikwad added a comment - If this helps, giving the file path with '-files' option on Sqoop command-line copies the file into Distributed Cache. This file can be accessed by each mapper.
          Hide
          Masatake Iwasaki added a comment -

          Attached a reviewed patch. Thanks for your reviews, Jarek.

          Show
          Masatake Iwasaki added a comment - Attached a reviewed patch. Thanks for your reviews, Jarek.
          Hide
          Jarek Jarcec Cecho added a comment -

          Committed revision 1374923.

          Thanks for your contribution Masatake!

          Jarcec

          Show
          Jarek Jarcec Cecho added a comment - Committed revision 1374923. Thanks for your contribution Masatake! Jarcec
          Hide
          Hudson added a comment -

          Integrated in Sqoop-ant-jdk-1.6-hadoop23 #141 (See https://builds.apache.org/job/Sqoop-ant-jdk-1.6-hadoop23/141/)
          SQOOP-390. PostgreSQL connector for direct export with pg_bulkload.

          (Masatake Iwasaki via Jarek Jarcec Cecho) (Revision 1374923)

          Result = SUCCESS
          jarcec :
          Files :

          • /sqoop/trunk/src/docs/user/SqoopUserGuide.txt
          • /sqoop/trunk/src/docs/user/connectors.txt
          • /sqoop/trunk/src/java/org/apache/sqoop/manager/DirectPostgresqlManager.java
          • /sqoop/trunk/src/java/org/apache/sqoop/manager/PGBulkloadManager.java
          • /sqoop/trunk/src/java/org/apache/sqoop/mapreduce/AutoProgressMapper.java
          • /sqoop/trunk/src/java/org/apache/sqoop/mapreduce/AutoProgressReducer.java
          • /sqoop/trunk/src/java/org/apache/sqoop/mapreduce/PGBulkloadExportJob.java
          • /sqoop/trunk/src/java/org/apache/sqoop/mapreduce/PGBulkloadExportMapper.java
          • /sqoop/trunk/src/java/org/apache/sqoop/mapreduce/PGBulkloadExportReducer.java
          • /sqoop/trunk/src/java/org/apache/sqoop/mapreduce/ProgressThread.java
          • /sqoop/trunk/src/java/org/apache/sqoop/mapreduce/SqoopReducer.java
          • /sqoop/trunk/src/java/org/apache/sqoop/util/PostgreSQLUtils.java
          • /sqoop/trunk/src/test/com/cloudera/sqoop/manager/PGBulkloadManagerManualTest.java
          Show
          Hudson added a comment - Integrated in Sqoop-ant-jdk-1.6-hadoop23 #141 (See https://builds.apache.org/job/Sqoop-ant-jdk-1.6-hadoop23/141/ ) SQOOP-390 . PostgreSQL connector for direct export with pg_bulkload. (Masatake Iwasaki via Jarek Jarcec Cecho) (Revision 1374923) Result = SUCCESS jarcec : Files : /sqoop/trunk/src/docs/user/SqoopUserGuide.txt /sqoop/trunk/src/docs/user/connectors.txt /sqoop/trunk/src/java/org/apache/sqoop/manager/DirectPostgresqlManager.java /sqoop/trunk/src/java/org/apache/sqoop/manager/PGBulkloadManager.java /sqoop/trunk/src/java/org/apache/sqoop/mapreduce/AutoProgressMapper.java /sqoop/trunk/src/java/org/apache/sqoop/mapreduce/AutoProgressReducer.java /sqoop/trunk/src/java/org/apache/sqoop/mapreduce/PGBulkloadExportJob.java /sqoop/trunk/src/java/org/apache/sqoop/mapreduce/PGBulkloadExportMapper.java /sqoop/trunk/src/java/org/apache/sqoop/mapreduce/PGBulkloadExportReducer.java /sqoop/trunk/src/java/org/apache/sqoop/mapreduce/ProgressThread.java /sqoop/trunk/src/java/org/apache/sqoop/mapreduce/SqoopReducer.java /sqoop/trunk/src/java/org/apache/sqoop/util/PostgreSQLUtils.java /sqoop/trunk/src/test/com/cloudera/sqoop/manager/PGBulkloadManagerManualTest.java
          Hide
          Hudson added a comment -

          Integrated in Sqoop-ant-jdk-1.6-hadoop200 #2 (See https://builds.apache.org/job/Sqoop-ant-jdk-1.6-hadoop200/2/)
          SQOOP-390. PostgreSQL connector for direct export with pg_bulkload.

          (Masatake Iwasaki via Jarek Jarcec Cecho) (Revision 1374923)

          Result = SUCCESS
          jarcec :
          Files :

          • /sqoop/trunk/src/docs/user/SqoopUserGuide.txt
          • /sqoop/trunk/src/docs/user/connectors.txt
          • /sqoop/trunk/src/java/org/apache/sqoop/manager/DirectPostgresqlManager.java
          • /sqoop/trunk/src/java/org/apache/sqoop/manager/PGBulkloadManager.java
          • /sqoop/trunk/src/java/org/apache/sqoop/mapreduce/AutoProgressMapper.java
          • /sqoop/trunk/src/java/org/apache/sqoop/mapreduce/AutoProgressReducer.java
          • /sqoop/trunk/src/java/org/apache/sqoop/mapreduce/PGBulkloadExportJob.java
          • /sqoop/trunk/src/java/org/apache/sqoop/mapreduce/PGBulkloadExportMapper.java
          • /sqoop/trunk/src/java/org/apache/sqoop/mapreduce/PGBulkloadExportReducer.java
          • /sqoop/trunk/src/java/org/apache/sqoop/mapreduce/ProgressThread.java
          • /sqoop/trunk/src/java/org/apache/sqoop/mapreduce/SqoopReducer.java
          • /sqoop/trunk/src/java/org/apache/sqoop/util/PostgreSQLUtils.java
          • /sqoop/trunk/src/test/com/cloudera/sqoop/manager/PGBulkloadManagerManualTest.java
          Show
          Hudson added a comment - Integrated in Sqoop-ant-jdk-1.6-hadoop200 #2 (See https://builds.apache.org/job/Sqoop-ant-jdk-1.6-hadoop200/2/ ) SQOOP-390 . PostgreSQL connector for direct export with pg_bulkload. (Masatake Iwasaki via Jarek Jarcec Cecho) (Revision 1374923) Result = SUCCESS jarcec : Files : /sqoop/trunk/src/docs/user/SqoopUserGuide.txt /sqoop/trunk/src/docs/user/connectors.txt /sqoop/trunk/src/java/org/apache/sqoop/manager/DirectPostgresqlManager.java /sqoop/trunk/src/java/org/apache/sqoop/manager/PGBulkloadManager.java /sqoop/trunk/src/java/org/apache/sqoop/mapreduce/AutoProgressMapper.java /sqoop/trunk/src/java/org/apache/sqoop/mapreduce/AutoProgressReducer.java /sqoop/trunk/src/java/org/apache/sqoop/mapreduce/PGBulkloadExportJob.java /sqoop/trunk/src/java/org/apache/sqoop/mapreduce/PGBulkloadExportMapper.java /sqoop/trunk/src/java/org/apache/sqoop/mapreduce/PGBulkloadExportReducer.java /sqoop/trunk/src/java/org/apache/sqoop/mapreduce/ProgressThread.java /sqoop/trunk/src/java/org/apache/sqoop/mapreduce/SqoopReducer.java /sqoop/trunk/src/java/org/apache/sqoop/util/PostgreSQLUtils.java /sqoop/trunk/src/test/com/cloudera/sqoop/manager/PGBulkloadManagerManualTest.java
          Hide
          Hudson added a comment -

          Integrated in Sqoop-ant-jdk-1.6-hadoop20 #11 (See https://builds.apache.org/job/Sqoop-ant-jdk-1.6-hadoop20/11/)
          SQOOP-390. PostgreSQL connector for direct export with pg_bulkload.

          (Masatake Iwasaki via Jarek Jarcec Cecho) (Revision 1374923)

          Result = SUCCESS
          jarcec :
          Files :

          • /sqoop/trunk/src/docs/user/SqoopUserGuide.txt
          • /sqoop/trunk/src/docs/user/connectors.txt
          • /sqoop/trunk/src/java/org/apache/sqoop/manager/DirectPostgresqlManager.java
          • /sqoop/trunk/src/java/org/apache/sqoop/manager/PGBulkloadManager.java
          • /sqoop/trunk/src/java/org/apache/sqoop/mapreduce/AutoProgressMapper.java
          • /sqoop/trunk/src/java/org/apache/sqoop/mapreduce/AutoProgressReducer.java
          • /sqoop/trunk/src/java/org/apache/sqoop/mapreduce/PGBulkloadExportJob.java
          • /sqoop/trunk/src/java/org/apache/sqoop/mapreduce/PGBulkloadExportMapper.java
          • /sqoop/trunk/src/java/org/apache/sqoop/mapreduce/PGBulkloadExportReducer.java
          • /sqoop/trunk/src/java/org/apache/sqoop/mapreduce/ProgressThread.java
          • /sqoop/trunk/src/java/org/apache/sqoop/mapreduce/SqoopReducer.java
          • /sqoop/trunk/src/java/org/apache/sqoop/util/PostgreSQLUtils.java
          • /sqoop/trunk/src/test/com/cloudera/sqoop/manager/PGBulkloadManagerManualTest.java
          Show
          Hudson added a comment - Integrated in Sqoop-ant-jdk-1.6-hadoop20 #11 (See https://builds.apache.org/job/Sqoop-ant-jdk-1.6-hadoop20/11/ ) SQOOP-390 . PostgreSQL connector for direct export with pg_bulkload. (Masatake Iwasaki via Jarek Jarcec Cecho) (Revision 1374923) Result = SUCCESS jarcec : Files : /sqoop/trunk/src/docs/user/SqoopUserGuide.txt /sqoop/trunk/src/docs/user/connectors.txt /sqoop/trunk/src/java/org/apache/sqoop/manager/DirectPostgresqlManager.java /sqoop/trunk/src/java/org/apache/sqoop/manager/PGBulkloadManager.java /sqoop/trunk/src/java/org/apache/sqoop/mapreduce/AutoProgressMapper.java /sqoop/trunk/src/java/org/apache/sqoop/mapreduce/AutoProgressReducer.java /sqoop/trunk/src/java/org/apache/sqoop/mapreduce/PGBulkloadExportJob.java /sqoop/trunk/src/java/org/apache/sqoop/mapreduce/PGBulkloadExportMapper.java /sqoop/trunk/src/java/org/apache/sqoop/mapreduce/PGBulkloadExportReducer.java /sqoop/trunk/src/java/org/apache/sqoop/mapreduce/ProgressThread.java /sqoop/trunk/src/java/org/apache/sqoop/mapreduce/SqoopReducer.java /sqoop/trunk/src/java/org/apache/sqoop/util/PostgreSQLUtils.java /sqoop/trunk/src/test/com/cloudera/sqoop/manager/PGBulkloadManagerManualTest.java
          Hide
          Hudson added a comment -

          Integrated in Sqoop-ant-jdk-1.6-hadoop100 #11 (See https://builds.apache.org/job/Sqoop-ant-jdk-1.6-hadoop100/11/)
          SQOOP-390. PostgreSQL connector for direct export with pg_bulkload.

          (Masatake Iwasaki via Jarek Jarcec Cecho) (Revision 1374923)

          Result = SUCCESS
          jarcec :
          Files :

          • /sqoop/trunk/src/docs/user/SqoopUserGuide.txt
          • /sqoop/trunk/src/docs/user/connectors.txt
          • /sqoop/trunk/src/java/org/apache/sqoop/manager/DirectPostgresqlManager.java
          • /sqoop/trunk/src/java/org/apache/sqoop/manager/PGBulkloadManager.java
          • /sqoop/trunk/src/java/org/apache/sqoop/mapreduce/AutoProgressMapper.java
          • /sqoop/trunk/src/java/org/apache/sqoop/mapreduce/AutoProgressReducer.java
          • /sqoop/trunk/src/java/org/apache/sqoop/mapreduce/PGBulkloadExportJob.java
          • /sqoop/trunk/src/java/org/apache/sqoop/mapreduce/PGBulkloadExportMapper.java
          • /sqoop/trunk/src/java/org/apache/sqoop/mapreduce/PGBulkloadExportReducer.java
          • /sqoop/trunk/src/java/org/apache/sqoop/mapreduce/ProgressThread.java
          • /sqoop/trunk/src/java/org/apache/sqoop/mapreduce/SqoopReducer.java
          • /sqoop/trunk/src/java/org/apache/sqoop/util/PostgreSQLUtils.java
          • /sqoop/trunk/src/test/com/cloudera/sqoop/manager/PGBulkloadManagerManualTest.java
          Show
          Hudson added a comment - Integrated in Sqoop-ant-jdk-1.6-hadoop100 #11 (See https://builds.apache.org/job/Sqoop-ant-jdk-1.6-hadoop100/11/ ) SQOOP-390 . PostgreSQL connector for direct export with pg_bulkload. (Masatake Iwasaki via Jarek Jarcec Cecho) (Revision 1374923) Result = SUCCESS jarcec : Files : /sqoop/trunk/src/docs/user/SqoopUserGuide.txt /sqoop/trunk/src/docs/user/connectors.txt /sqoop/trunk/src/java/org/apache/sqoop/manager/DirectPostgresqlManager.java /sqoop/trunk/src/java/org/apache/sqoop/manager/PGBulkloadManager.java /sqoop/trunk/src/java/org/apache/sqoop/mapreduce/AutoProgressMapper.java /sqoop/trunk/src/java/org/apache/sqoop/mapreduce/AutoProgressReducer.java /sqoop/trunk/src/java/org/apache/sqoop/mapreduce/PGBulkloadExportJob.java /sqoop/trunk/src/java/org/apache/sqoop/mapreduce/PGBulkloadExportMapper.java /sqoop/trunk/src/java/org/apache/sqoop/mapreduce/PGBulkloadExportReducer.java /sqoop/trunk/src/java/org/apache/sqoop/mapreduce/ProgressThread.java /sqoop/trunk/src/java/org/apache/sqoop/mapreduce/SqoopReducer.java /sqoop/trunk/src/java/org/apache/sqoop/util/PostgreSQLUtils.java /sqoop/trunk/src/test/com/cloudera/sqoop/manager/PGBulkloadManagerManualTest.java
          Hide
          Ruslan Al-Fakikh added a comment -

          Hey guys. I am wondering why bulk insert with COPY command was not added to Sqoop in the first place:
          https://issues.cloudera.org/browse/SQOOP-184
          I understand that pg_bulkload can be faster, but this is a third-party tool, COPY is the official and recommended way of bulk inserts for Postgres:
          http://www.postgresql.org/docs/current/interactive/populate.html
          Maybe there were some drawbacks for COPY?

          Show
          Ruslan Al-Fakikh added a comment - Hey guys. I am wondering why bulk insert with COPY command was not added to Sqoop in the first place: https://issues.cloudera.org/browse/SQOOP-184 I understand that pg_bulkload can be faster, but this is a third-party tool, COPY is the official and recommended way of bulk inserts for Postgres: http://www.postgresql.org/docs/current/interactive/populate.html Maybe there were some drawbacks for COPY?
          Hide
          Jarek Jarcec Cecho added a comment -

          Hi Ruslan Al-Fakikh,
          Sqoop is a community driven project, so all features that goes in were at some point required by the community. Masatake Iwasaki was in need to have pg_bulkload connector, so he sacrificed his free unpaid time to wrote it. I would advise you to create a new JIRA asking for COPY based connector in case that you would prefer this approach.

          Jarcec

          Show
          Jarek Jarcec Cecho added a comment - Hi Ruslan Al-Fakikh , Sqoop is a community driven project, so all features that goes in were at some point required by the community. Masatake Iwasaki was in need to have pg_bulkload connector, so he sacrificed his free unpaid time to wrote it. I would advise you to create a new JIRA asking for COPY based connector in case that you would prefer this approach. Jarcec
          Hide
          Ruslan Al-Fakikh added a comment -

          Hi Jarek,
          Thanks for the reply. I understand your point. The jira for COPY is already there (see my previous comment).
          Honestly I was confused by the docs from the beginning and had to spend a day to understand all the options clearly, because I am not a huge Postgres expert. I suggest the docs should be changed this way:
          Here
          http://archive.cloudera.com/cdh4/cdh/4/sqoop/SqoopUserGuide.html#_pg_bulkload_connector
          we should mention the standard COPY method for those who are not Postgres experts, so that they can evaluate for themselves. From the start I thought it was just the standard bulk insert for Postgres.
          Here
          http://archive.cloudera.com/cdh4/cdh/4/sqoop/SqoopUserGuide.html#_supported_databases
          instead of specifying:
          Yes (import only)
          for Postgres --direct support, we should add something like this:
          Yes (export via a third-party tool)
          so that users won't think that bulk insert for Postgres is not present in Sqoop at all.

          Show
          Ruslan Al-Fakikh added a comment - Hi Jarek, Thanks for the reply. I understand your point. The jira for COPY is already there (see my previous comment). Honestly I was confused by the docs from the beginning and had to spend a day to understand all the options clearly, because I am not a huge Postgres expert. I suggest the docs should be changed this way: Here http://archive.cloudera.com/cdh4/cdh/4/sqoop/SqoopUserGuide.html#_pg_bulkload_connector we should mention the standard COPY method for those who are not Postgres experts, so that they can evaluate for themselves. From the start I thought it was just the standard bulk insert for Postgres. Here http://archive.cloudera.com/cdh4/cdh/4/sqoop/SqoopUserGuide.html#_supported_databases instead of specifying: Yes (import only) for Postgres --direct support, we should add something like this: Yes (export via a third-party tool) so that users won't think that bulk insert for Postgres is not present in Sqoop at all.
          Hide
          Jarek Jarcec Cecho added a comment -

          Hi Ruslan Al-Fakikh,
          thank you very much for your feedback. The JIRA on Cloudera site is no longer used, so if you're still interested in the functionality, you should go ahead and create the ticket here on Apache JIRA.

          I think that the information in docs should suggest that -direct option in case of PostgreSQL works only with import and not with export. In case that user is interested, then there is separate Connector that can utilize pg_bulkload utility. However as you've mentioned, pg_bulkload is not part of PostgreSQL and thus it's not activated by -direct parameter.

          Jarcec

          Show
          Jarek Jarcec Cecho added a comment - Hi Ruslan Al-Fakikh , thank you very much for your feedback. The JIRA on Cloudera site is no longer used, so if you're still interested in the functionality, you should go ahead and create the ticket here on Apache JIRA. I think that the information in docs should suggest that - direct option in case of PostgreSQL works only with import and not with export. In case that user is interested, then there is separate Connector that can utilize pg_bulkload utility. However as you've mentioned, pg_bulkload is not part of PostgreSQL and thus it's not activated by -direct parameter. Jarcec
          Hide
          Ruslan Al-Fakikh added a comment -

          Jarek, thanks for the clarification. I created the issue:
          https://issues.apache.org/jira/browse/SQOOP-999

          Ruslan

          Show
          Ruslan Al-Fakikh added a comment - Jarek, thanks for the clarification. I created the issue: https://issues.apache.org/jira/browse/SQOOP-999 Ruslan

            People

            • Assignee:
              Masatake Iwasaki
              Reporter:
              Masatake Iwasaki
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development