Uploaded image for project: 'Sqoop (Retired)'
  1. Sqoop (Retired)
  2. SQOOP-3477

Sqoop-import creating .metadata and .signal folders while import

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.4.7
    • 1.4.7
    • sqoop2-client
    • None

    Description

      Unable to read sqoop-import extracted files in hive 

       

      Steps to reproduce -

      1 Import a table from a relational database( Oracle RDS ) or any other rdbms into S3 through sqoop import.

      Sqoop-import Sample -

      sqoop import --connect jdbc:oracle:thin:@//ishantest.xxx.us-west-2.rds.amazonaws.comhttps://rds-op-usw2.amazon.com/ops/search?query=ishantest#linker_verify_account=515450464085:1521/ISHANORC --table PV_AGGREGATES --username admin --password xxx# -m 1 --delete-target-dir --target-dir s3n://xx-isgxaur-logs/sqoop_issue/pv_ag --as-parquetfile --compression-codec=snappy

      Once sqoop-import is done - Data in s3 will look like -

      [hadoop@ip-xx-xx-xx-xx ~]$ aws s3 ls s3://xx-xx-xx/sqoop_issue/ishan_new/

      PRE .metadata/
      PRE .signals/
      2020-06-03 17:49:03 0 .metadata_$folder$
      2020-06-03 17:49:03 0 .signals_$folder$
      2020-06-03 17:49:03 417 f7e5670c-77ca-460b-80c5-e0449cb09dfe.parquet

       

      As we can see sqoop-import is cretaing .metadata and .signal folders while extracting the data . This causes failure to read the data in hive

      2 Deploy a v5.19 EMR or any virtual machine with Hive 2.3.3 and sqoop 1.4.7 installed.

      3 Create an external table, whose location should be the path copied in point 1.

      Command to create external table -

      CREATE EXTERNAL TABLE test (dt string,code string,views string) STORED as parquet location 's3://xxx-isgaur-logs/sqoop_issue/ishan_new';

      4 Use hive and run a "select * from table; " statement to the table. you will get below error 

      Error message it throws - Failed with exception java.io.IOException:java.lang.RuntimeException: s3://cccxxxx-logs/sqoop_issue/ishan_new/.metadata/descriptor.properties is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [117, 101, 116, 10]

      The above error message is because while reading the data in hive , it doesn't expect the .metadata and .signal folders created by sqoop ,hence the failure.

      I did extensive analysis on this problem but did not find any solution , please let me know if there is any workaround to it ? How can one read the sqoop-import extracted data within hive ?

      Attachments

        Activity

          People

            Unassigned Unassigned
            ishangaur ISHAN gaur
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: