Details
-
Improvement
-
Status: Resolved
-
Minor
-
Resolution: Won't Fix
-
Impala 2.2.4
-
None
-
None
Description
Currently, if you put a Parquet file into a text table, queries can experience long pauses and throw lots of conversion errors. SELECT COUNT will even return a number, even though wildly inaccurate.
Could the scanner recognize a magic number in the file header for Parquet, Avro, SequenceFile, and/or RCFile and either fail the query or skip the file if it was the wrong format relative to the table or partition?
Now that I think about it, this might involve a little complexity if node X read the first block of the file containing the magic number, while node Y read the second block. Would the coordinator node have some way to detect the mismatch and communicate to all the other nodes to skip such-and-such a file or block? Perhaps this technique would be easiest to implement for Parquet files, with only a single block to consider.
Attachments
Issue Links
- is related to
-
IMPALA-4753 Table created like parquet file shows wrong row count
- Resolved