Description
when creating a table in impala, a hidden folder .impala_insert_staging will be created in the folder of table.
if we want to load such a table using Spark SQL API sqlContext.parquetFile, this hidden folder makes trouble, spark try to get metadata from this folder, you will see the exception:
Caused by: java.io.IOException: Could not read footer for file FileStatus{path=hdfs://xxx:8020/user/hive/warehouse/parquet_strings/.impala_insert_staging; isDirectory=true; modification_time=1406333729252; access_time=0; owner=hdfs; group=hdfs; permission=rwxr-xr-x; isSymlink=false} ... ... Caused by: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): Path is not a file: /user/hive/warehouse/parquet_strings/.impala_insert_staging
and impala side do not think this is their problem: https://issues.cloudera.org/browse/IMPALA-837 (IMPALA-837 Delete .impala_insert_staging directory after INSERT)
so maybe we should filter out these hidden folder/file by reading parquet tables