Uploaded image for project: 'Sqoop (Retired)'
  1. Sqoop (Retired)
  2. SQOOP-2874

Highlight Sqoop import with --as-parquetfile use cases (Dataset name <NAME> is not alphanumeric (plus '_'))

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • docs
    • None

    Description

      Hello Sqoop Community,

      Would it be possible to request some documentation enhancements?

      The ask is here is to proactively help raise awareness and improve user experience with a few specific use cases [1] where some Sqoop commands have restricted character options when using import with --as-parquetfile.

      My understanding is Sqoop1 currently relies on Kite Datasets to write Parquet files. From the Kite documentation [3] we see that to ensure compatibility (with Hive, etc.), Kite imposes some restrictions on Names and Namespaces which bubble up in Sqoop.

      The following Sqoop use cases when using import with --as-parquetfile result in the error [2] below. Full tests cases for each scenario are attached. If it is an option to enhance the Sqoop documentation for these use cases I am happy to provide proposed changes, let me know.

      [1] Use Cases:
      1. sqoop import --as-parquetfile + --target-dir /<path>/<rdbms_database>.<table>
      1.1. The '.' is not allowed
      2. sqoop import --as-parquetfile + --table <rdbms_database>.<table> + (no --target-dir)
      2.1. The '.' is not allowed, this is essentially the same as (1)
      3. sqoop import --as-parquetfile + --hive-import --table <hive_database>.<table>
      3.1. The proper usage is to use --hive-database with --hive-table however with --as-textfile --hive-table works with <hive_database>.<table>

      [2] Kite Error:
      16/03/06 08:45:56 ERROR sqoop.Sqoop: Got exception running Sqoop: org.kitesdk.data.ValidationException: Dataset name DATABASE.TABLE is not alphanumeric (plus '_')
      org.kitesdk.data.ValidationException: Dataset name DATABASE.TABLE is not alphanumeric (plus '_')
      at org.kitesdk.data.ValidationException.check(ValidationException.java:55)
      at org.kitesdk.data.spi.Compatibility.checkDatasetName(Compatibility.java:105)
      at org.kitesdk.data.spi.Compatibility.check(Compatibility.java:68)
      at org.kitesdk.data.spi.filesystem.FileSystemMetadataProvider.create(FileSystemMetadataProvider.java:209)
      at org.kitesdk.data.spi.filesystem.FileSystemDatasetRepository.create(FileSystemDatasetRepository.java:137)
      at org.kitesdk.data.Datasets.create(Datasets.java:239)
      at org.kitesdk.data.Datasets.create(Datasets.java:307)
      at org.apache.sqoop.mapreduce.ParquetJob.createDataset(ParquetJob.java:141)
      at org.apache.sqoop.mapreduce.ParquetJob.configureImportJob(ParquetJob.java:119)
      at org.apache.sqoop.mapreduce.DataDrivenImportJob.configureMapper(DataDrivenImportJob.java:130)
      at org.apache.sqoop.mapreduce.ImportJobBase.runImport(ImportJobBase.java:260)
      at org.apache.sqoop.manager.SqlManager.importTable(SqlManager.java:673)
      at org.apache.sqoop.manager.OracleManager.importTable(OracleManager.java:444)
      at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:497)
      at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:605)
      at org.apache.sqoop.Sqoop.run(Sqoop.java:143)
      at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
      at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:179)
      at org.apache.sqoop.Sqoop.runTool(Sqoop.java:218)
      at org.apache.sqoop.Sqoop.runTool(Sqoop.java:227)
      at org.apache.sqoop.Sqoop.main(Sqoop.java:236)

      [3] Kite Documenation:
      http://kitesdk.org/docs/1.0.0/introduction-to-datasets.html
      Names and Namespaces
      URIs also define a name and namespace for your dataset. Kite uses these values when the underlying system has the same concept (for example, Hive). The name and namespace are typically the last two values in a URI. For example, if you create a dataset using the URI dataset:hive:fact_tables/ratings, Kite stores a Hive table ratings in the fact_tables Hive database. If you create a dataset using the URI dataset:hdfs:/user/cloudera/fact_tables/ratings, Kite stores an HDFS dataset named ratings in the fact_tables namespace. To ensure compatibility with Hive and other underlying systems, names and namespaces in URIs must be made of alphanumeric or underscore (_) characters and cannot start with a number.

      Thanks, Markus

      Attachments

        1. Jira_SQOOP-2874_TestCases.txt
          9 kB
          Markus Kemper

        Activity

          People

            markuskemper@me.com Markus Kemper
            markuskemper@me.com Markus Kemper
            Votes:
            4 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: