Details

    • Type: New Feature
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: Future
    • Component/s: None
    • Labels:
      None

      Description

      Umbrella JIRA to track the "Insert into table" feature. More details regarding the scope, design etc will follow as things start to materialize.

        Activity

        Hide
        mandoskippy John Omernik added a comment -

        This would be a great feature to see in Drill as an "ETL Simplifier"

        Right now, there is a workaround, you can use CTAS to create tables of files.

        If you have a table say "MYDATA" and there directories for days below that (2017-08-10, 2017-08-11 etc) What I typically do is

        CREATE TABLE `MYDATA/.2017-08-11` as Select ...

        When the query is done, I use the filesystem (or hadoop) mv command to move those created files to their final destination.

        I don't like this method because it involves orchestration outside of Drill, where if Drill had a way to mimic this in a INSERT INTO statement, it would really make it easy on folks.

        Happy to answer questions or have more discussion here!

        Show
        mandoskippy John Omernik added a comment - This would be a great feature to see in Drill as an "ETL Simplifier" Right now, there is a workaround, you can use CTAS to create tables of files. If you have a table say "MYDATA" and there directories for days below that (2017-08-10, 2017-08-11 etc) What I typically do is CREATE TABLE `MYDATA/.2017-08-11` as Select ... When the query is done, I use the filesystem (or hadoop) mv command to move those created files to their final destination. I don't like this method because it involves orchestration outside of Drill, where if Drill had a way to mimic this in a INSERT INTO statement, it would really make it easy on folks. Happy to answer questions or have more discussion here!
        Hide
        Paul.Rogers Paul Rogers added a comment - - edited

        As John noted, the easiest way to envision INSERT INTO for big data would be to add new files to an existing directory.

        Fortunately for Drill, we can learn from existing solution such as Hive and Impala. We should certainly learn from the strengths of those solutions, while trying to avoid any pitfalls.

        At the data level, two variations are possible:

        • For file formats that allow it, Drill might append to the file. (CSV, TSV, JSON for example.) (But, note that HDFS does not allow append operations though MFS and Linux do.)
        • For file formats that do not allow appending (e.g. Parquet), then adding files to partition directories is a solution.

        Drill is schema-on-read, which means that Drill, in general, does not know the schema of data until it "looks." For a CSV file, this means we don't know the file schema. (Although, for CSV files with headers, Drill could read the header to approximate the schema.) So, it may be that it falls to the user to be responsible to append only data that fits the existing schema. Since Drill (but not its clients) handle schema change, Drill can deal with (some) files that change schemas in the middle of the file, though Tableau and similar xDBC tools cannot.

        The difficulties emerge, however, at the physical level: dealing with concurrency in a distributed, multi-user system. A key contribution of databases is that they have kernels that deal with this physical layer (and do such a great job that most of us never have to think about the issues.)

        • When appending to a file, what do readers see? Do readers see a possibly incomplete last record? Do they see an atomic update? (All the inserts or none?)
        • What is the guarantee of consistency? What is the expected behavior if a Drillbit crashes during an INSERT? OK to leave a file half written, or must inserts be atomic?
        • What happens if two queries try to insert into the same file? How are inserts synchronized? (Or, does the last update win? Or, can files simply be overwritten and it is up to the user to coordinate writes?)
        • If the insert involves metadata (Parquet metadata, say) how are the metadata updates synchronized with the INSERTs to provide a consistent view to readers?
        • If an INSERT creates multiple files (as part of a partition directory), what are the semantics for readers?

        These issues are not at all at the logical SQL level; they are at the physical level: dealing with the reality of a concurrent system.

        The problems are made more complex because file systems do not provide the semantics needed for DB-level synchronization. This is why the world needs Oracle (and MySQL and Postgres): one cannot easily build DB semantics just using file system mechanisms. (Though, many have tried.)

        Possible approaches:

        • The traditional solution is to use locking, transactions and the like to synchronize work. These are (likely) not available to a query engine such as Drill.
        • Leverage a meta-store (such as Hive) to handle the synchronization.
        • Leverage ZK to handle synchronization of just the insert/read operations (e.g. place a marker in ZK to "lock" a file for updates.)
        • Design a log-based approach, based on multi-version concurrency control (MVCC) that allows writers to create new versions of files while readers read old versions. The log file could be the coordination mechanism. An update is not visible until the required entries appear in the log. There is a large literature about using logs for synchronization.

        All this said, it is worth considering another proposition: that Drill is a query engine, not a DB. In this view, Drill should do nothing that requires DB-like concurrency semantics; Drill should be purely read-only. Accordingly, Drill should not attempt to add INSERT or UPDATE semantics: that is the proper domain of other tools. (With the logical consequence that Drill should not be used for ETL.) It is worth debating if that is a viable approach, or if users pretty much need the INSERT INTO semantics (with all the resulting complexity.)

        Let the brainstorming begin!

        Show
        Paul.Rogers Paul Rogers added a comment - - edited As John noted, the easiest way to envision INSERT INTO for big data would be to add new files to an existing directory. Fortunately for Drill, we can learn from existing solution such as Hive and Impala . We should certainly learn from the strengths of those solutions, while trying to avoid any pitfalls. At the data level, two variations are possible: For file formats that allow it, Drill might append to the file. (CSV, TSV, JSON for example.) (But, note that HDFS does not allow append operations though MFS and Linux do.) For file formats that do not allow appending (e.g. Parquet), then adding files to partition directories is a solution. Drill is schema-on-read, which means that Drill, in general, does not know the schema of data until it "looks." For a CSV file, this means we don't know the file schema. (Although, for CSV files with headers, Drill could read the header to approximate the schema.) So, it may be that it falls to the user to be responsible to append only data that fits the existing schema. Since Drill (but not its clients) handle schema change, Drill can deal with (some) files that change schemas in the middle of the file, though Tableau and similar xDBC tools cannot. The difficulties emerge, however, at the physical level: dealing with concurrency in a distributed, multi-user system. A key contribution of databases is that they have kernels that deal with this physical layer (and do such a great job that most of us never have to think about the issues.) When appending to a file, what do readers see? Do readers see a possibly incomplete last record? Do they see an atomic update? (All the inserts or none?) What is the guarantee of consistency? What is the expected behavior if a Drillbit crashes during an INSERT? OK to leave a file half written, or must inserts be atomic? What happens if two queries try to insert into the same file? How are inserts synchronized? (Or, does the last update win? Or, can files simply be overwritten and it is up to the user to coordinate writes?) If the insert involves metadata (Parquet metadata, say) how are the metadata updates synchronized with the INSERTs to provide a consistent view to readers? If an INSERT creates multiple files (as part of a partition directory), what are the semantics for readers? These issues are not at all at the logical SQL level; they are at the physical level: dealing with the reality of a concurrent system. The problems are made more complex because file systems do not provide the semantics needed for DB-level synchronization. This is why the world needs Oracle (and MySQL and Postgres): one cannot easily build DB semantics just using file system mechanisms. (Though, many have tried.) Possible approaches: The traditional solution is to use locking, transactions and the like to synchronize work. These are (likely) not available to a query engine such as Drill. Leverage a meta-store (such as Hive) to handle the synchronization. Leverage ZK to handle synchronization of just the insert/read operations (e.g. place a marker in ZK to "lock" a file for updates.) Design a log-based approach, based on multi-version concurrency control (MVCC) that allows writers to create new versions of files while readers read old versions. The log file could be the coordination mechanism. An update is not visible until the required entries appear in the log. There is a large literature about using logs for synchronization. All this said, it is worth considering another proposition: that Drill is a query engine, not a DB. In this view, Drill should do nothing that requires DB-like concurrency semantics; Drill should be purely read-only. Accordingly, Drill should not attempt to add INSERT or UPDATE semantics: that is the proper domain of other tools. (With the logical consequence that Drill should not be used for ETL.) It is worth debating if that is a viable approach, or if users pretty much need the INSERT INTO semantics (with all the resulting complexity.) Let the brainstorming begin!
        Hide
        kbotzum Keys Botzum added a comment -

        I defer to others on how Drill could support implementing the full semantics of concurrent updates to existing files - a challenging problem indeed!

        I do want to suggest/remind that Drill already has the ability to access a number of databases that do support concurrent updates already quite well - HBase and MapR-DB for example. There is also work related several other NoSQL databases and more traditional relational databases.

        Perhaps it would be useful to start with supporting insert and update to databases that already support those concepts inherently and then later consider updates to files?

        Show
        kbotzum Keys Botzum added a comment - I defer to others on how Drill could support implementing the full semantics of concurrent updates to existing files - a challenging problem indeed! I do want to suggest/remind that Drill already has the ability to access a number of databases that do support concurrent updates already quite well - HBase and MapR-DB for example. There is also work related several other NoSQL databases and more traditional relational databases. Perhaps it would be useful to start with supporting insert and update to databases that already support those concepts inherently and then later consider updates to files?
        Hide
        mandoskippy John Omernik added a comment -

        Keys Botzum I agree whole heartedly with your statement except in implementation...

        I.e. I think we should be doing the inserts into the HBASE/MapRDB 100%, I think many folks would benefit from this, and in talking with other Drill members, this actually may be another area where it may be good to discuss how Drill can implement an interface for INSERT for storage plugins. I.e. beyond just HBASE, what about Mongo DB? Instead of handling all functionality of insert in the plugin itself, we could instead have an interface that storage plugins could extend if they want to do inserts. This would help make it easier to implement on backends that would easily support.

        Where I disagree though is I think Filesystem inserts, especially as it relates to Parquet files, is a must have in the MVP for Drill Inserts. Drill is often compared to Hive on Spark, Hive on Tez, and Impala, and all three of those support some variance of Insert into Parquet table... this is one area that can be a deal breaker for those looking to stack up Drill against other SQL on Hadoop Products. Thus, starting with the MapRDB/Hbase stuff, while it may be easier, I think at a minimum should done in parallel with the Parquet INSERTS so we can achieve some feature parity.

        Show
        mandoskippy John Omernik added a comment - Keys Botzum I agree whole heartedly with your statement except in implementation... I.e. I think we should be doing the inserts into the HBASE/MapRDB 100%, I think many folks would benefit from this, and in talking with other Drill members, this actually may be another area where it may be good to discuss how Drill can implement an interface for INSERT for storage plugins. I.e. beyond just HBASE, what about Mongo DB? Instead of handling all functionality of insert in the plugin itself, we could instead have an interface that storage plugins could extend if they want to do inserts. This would help make it easier to implement on backends that would easily support. Where I disagree though is I think Filesystem inserts, especially as it relates to Parquet files, is a must have in the MVP for Drill Inserts. Drill is often compared to Hive on Spark, Hive on Tez, and Impala, and all three of those support some variance of Insert into Parquet table... this is one area that can be a deal breaker for those looking to stack up Drill against other SQL on Hadoop Products. Thus, starting with the MapRDB/Hbase stuff, while it may be easier, I think at a minimum should done in parallel with the Parquet INSERTS so we can achieve some feature parity.

          People

          • Assignee:
            Unassigned
            Reporter:
            mehant Mehant Baid
          • Votes:
            5 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:

              Development