Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-2209

More user-friendly behavior for mismatched table format and file format

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Won't Fix
    • Impala 2.2.4
    • None
    • Backend
    • None

    Description

      Currently, if you put a Parquet file into a text table, queries can experience long pauses and throw lots of conversion errors. SELECT COUNT will even return a number, even though wildly inaccurate.

      Could the scanner recognize a magic number in the file header for Parquet, Avro, SequenceFile, and/or RCFile and either fail the query or skip the file if it was the wrong format relative to the table or partition?

      Now that I think about it, this might involve a little complexity if node X read the first block of the file containing the magic number, while node Y read the second block. Would the coordinator node have some way to detect the mismatch and communicate to all the other nodes to skip such-and-such a file or block? Perhaps this technique would be easiest to implement for Parquet files, with only a single block to consider.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              jrussell John Russell
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: