We have a scenario where we generate our own parquet files
every X amount of seconds.
These files are in a structure based on date and it is only the file for today that gets updated
The process is as follows
1. generate parquet file in temp directory
2. When finished generation mv the file into a drill workspace/ (data/2017/03/10/data.parquet, ..)
3. Then restart the process
We have noticed that if the file is moved in while a query has started running
it will throw and error that the parquet magic number is incorrect
This is due to the file length being cached and reused so basically what seems to happen is
1. Drill plans the query
2. File gets changed under Drills feet
3. Drill executes query and tries to read and incorrect offset of the changed file
Is there anyway to fix this or avoid this scenario?
Another side effect of constantly generating a new file is that the metadata cache gets discarded for the whole workspace despite only one file changing
Is there a way to avoid that?