Details
-
Bug
-
Status: Resolved
-
Minor
-
Resolution: Cannot Reproduce
-
Impala 1.1
-
None
-
None
-
Impala 1.1.0 and CM 4.6.2.
Description
In the CM "Query Details" page, one of the fields is "File Formats". If I query a table created with STORED AS SEQFILE with the BZip2 compression codec, CM shows a line like:
File Formats: SEQUENCE_FILE/BZIP2
That seems intuitive. However, for other combinations of file format and compression codec, the "File Formats" value is blank or seems misleading.
select * from seqfile_snappy limit 5 -> file formats in CM is blank
select * from rcfile_snappy limit 5 -> file formats in CM is blank
select count from seqfile_deflate -> file formats in CM = SEQUENCE_FILE/DEFAULT
select count from rcfile_deflate -> file formats in CM = RC_FILE/DEFAULT (is DEFAULT a typo for DEFLATE since this happens for both SEQFILE and RCFILE tables?)
select count from parquet_snappy -> file formats = PARQUET/NONE
I also see PARQUET/NONE for a Parquet table compressed with GZip.
I also see PARQUET/NONE for a Parquet table where the Impala data directory contains data files compressed with different codecs. I understand CM could in some cases display multiple values in this "File Formats" field, and that's what I'd expect to happen in this case. (The same way I'd expect multiple "File Formats" values for a join of tables with different file formats, or a query against a partitioned table where partitions had different file formats.)
I did not have an LZO-compressed text table, so I didn't check if that case would produce TEXT/LZO as expected.
I did not have an Avro table, so I didn't check those combinations.
I did not check Avro, SEQFILE, or RCFILE with data files from more than one compression codec in the same directory.
Other than the above cases, I think I checked every combination of file format and codec, and the only issues I saw were those I listed.
impala-shell PROFILE output or CM profile text available if desired.