Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-2699 Improve compatibility with parquet file/table
  3. SPARK-2700

Hidden files (such as .impala_insert_staging) should be filtered out by sqlContext.parquetFile

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.0.1
    • 1.1.0
    • SQL
    • None

    Description

      when creating a table in impala, a hidden folder .impala_insert_staging will be created in the folder of table.

      if we want to load such a table using Spark SQL API sqlContext.parquetFile, this hidden folder makes trouble, spark try to get metadata from this folder, you will see the exception:

      Caused by: java.io.IOException: Could not read footer for file FileStatus{path=hdfs://xxx:8020/user/hive/warehouse/parquet_strings/.impala_insert_staging; isDirectory=true; modification_time=1406333729252; access_time=0; owner=hdfs; group=hdfs; permission=rwxr-xr-x; isSymlink=false}
      ...
      ...
      Caused by: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): Path is not a file: /user/hive/warehouse/parquet_strings/.impala_insert_staging
      

      and impala side do not think this is their problem: https://issues.cloudera.org/browse/IMPALA-837 (IMPALA-837 Delete .impala_insert_staging directory after INSERT)

      so maybe we should filter out these hidden folder/file by reading parquet tables

      Attachments

        Activity

          People

            Unassigned Unassigned
            chutium Teng Qiu
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: