As John noted, the easiest way to envision INSERT INTO for big data would be to add new files to an existing directory.
Fortunately for Drill, we can learn from existing solution such as Hive and Impala. We should certainly learn from the strengths of those solutions, while trying to avoid any pitfalls.
At the data level, two variations are possible:
- For file formats that allow it, Drill might append to the file. (CSV, TSV, JSON for example.) (But, note that HDFS does not allow append operations though MFS and Linux do.)
- For file formats that do not allow appending (e.g. Parquet), then adding files to partition directories is a solution.
Drill is schema-on-read, which means that Drill, in general, does not know the schema of data until it "looks." For a CSV file, this means we don't know the file schema. (Although, for CSV files with headers, Drill could read the header to approximate the schema.) So, it may be that it falls to the user to be responsible to append only data that fits the existing schema. Since Drill (but not its clients) handle schema change, Drill can deal with (some) files that change schemas in the middle of the file, though Tableau and similar xDBC tools cannot.
The difficulties emerge, however, at the physical level: dealing with concurrency in a distributed, multi-user system. A key contribution of databases is that they have kernels that deal with this physical layer (and do such a great job that most of us never have to think about the issues.)
- When appending to a file, what do readers see? Do readers see a possibly incomplete last record? Do they see an atomic update? (All the inserts or none?)
- What is the guarantee of consistency? What is the expected behavior if a Drillbit crashes during an INSERT? OK to leave a file half written, or must inserts be atomic?
- What happens if two queries try to insert into the same file? How are inserts synchronized? (Or, does the last update win? Or, can files simply be overwritten and it is up to the user to coordinate writes?)
- If the insert involves metadata (Parquet metadata, say) how are the metadata updates synchronized with the INSERTs to provide a consistent view to readers?
- If an INSERT creates multiple files (as part of a partition directory), what are the semantics for readers?
These issues are not at all at the logical SQL level; they are at the physical level: dealing with the reality of a concurrent system.
The problems are made more complex because file systems do not provide the semantics needed for DB-level synchronization. This is why the world needs Oracle (and MySQL and Postgres): one cannot easily build DB semantics just using file system mechanisms. (Though, many have tried.)
- The traditional solution is to use locking, transactions and the like to synchronize work. These are (likely) not available to a query engine such as Drill.
- Leverage a meta-store (such as Hive) to handle the synchronization.
- Leverage ZK to handle synchronization of just the insert/read operations (e.g. place a marker in ZK to "lock" a file for updates.)
- Design a log-based approach, based on multi-version concurrency control (MVCC) that allows writers to create new versions of files while readers read old versions. The log file could be the coordination mechanism. An update is not visible until the required entries appear in the log. There is a large literature about using logs for synchronization.
All this said, it is worth considering another proposition: that Drill is a query engine, not a DB. In this view, Drill should do nothing that requires DB-like concurrency semantics; Drill should be purely read-only. Accordingly, Drill should not attempt to add INSERT or UPDATE semantics: that is the proper domain of other tools. (With the logical consequence that Drill should not be used for ETL.) It is worth debating if that is a viable approach, or if users pretty much need the INSERT INTO semantics (with all the resulting complexity.)
Let the brainstorming begin!