Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-5358

Error if Parquet file changes during query

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 1.9.0
    • Fix Version/s: None
    • Component/s: Metadata, Storage - Parquet
    • Labels:
      None

      Description

      We have a scenario where we generate our own parquet files
      every X amount of seconds.
      These files are in a structure based on date and it is only the file for today that gets updated

      The process is as follows

      1. generate parquet file in temp directory
      2. When finished generation mv the file into a drill workspace/ (data/2017/03/10/data.parquet, ..)
      3. Then restart the process

      We have noticed that if the file is moved in while a query has started running
      it will throw and error that the parquet magic number is incorrect
      This is due to the file length being cached and reused so basically what seems to happen is

      1. Drill plans the query
      2. File gets changed under Drills feet
      3. Drill executes query and tries to read and incorrect offset of the changed file

      Is there anyway to fix this or avoid this scenario?
      Another side effect of constantly generating a new file is that the metadata cache gets discarded for the whole workspace despite only one file changing
      Is there a way to avoid that?

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              tobad357 Tobias
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated: