Uploaded image for project: 'Calcite'
  1. Calcite
  2. CALCITE-2141

Implement query rewrite based on sharding configuration

    XMLWordPrintableJSON

    Details

    • Type: New Feature
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Based on topology changes, it should be possible to dynamically update a sharding configuration for calcite. The effect of such a configuration is, that a query involving sharded tables is rewritten to a more optimal form possibly targetting mutliple different datasources.

      This is an interesting building block for distributed databases but also for applications since it enables the implementation of a static sharding scheme. Doing the shard rewriting on a client is also a lot better when using a distributed database as that eliminates the need for a coordinator node through which data is tunneled.

      Also see https://github.com/shardingjdbc/sharding-jdbc for an existing implementation.

      So imagine a topology with a master node and 2 worker nodes, one having shards with even numbers and the other one having shards with odd numbers. Table "A" is sharded by the column "tenant_id" into e.g. 30 shards. So the sharding configuration for table "A" would contain the information "worker 1 has shards 1,3,5,..." and "worker 2 has shards 0,2,4,...". It also specifies that the sharding strategy should use a hash function shard = tenant_id % 30.

      When an application sends a query like e.g. select * from A where tenant_id = 1, normally the master/coordinator would do the shard rewriting, but doing this already at the client can eliminate the master as bottleneck for many scenarios. It is clear that the query can be completely fulfilled by worker 1 since it owns shard 1. The query rewriting therefore simply pushes the query to that worker. Note that there might be cases where a shard is replicated to other workers so it might be beneficial to make it configurable whether or when replicas should be used for querying.

      A query like select * from A where tenant_id in(1,2) could be transformed to select * from worker1.A where tenant_id = 1 union all select * from worker2.A where tenant_id = 2. One optimization could be to target a single worker if it contains at least a replica of all required shards, but that would need to be configurable again since replicas might lag behind.

       

      DML statements obviously should be handled as well but at first, I would simply forbid the use of multiple workers within one transaction. Supporting multiple workers in a transaction will essentially require a 2PC and I'm not sure it's always a good idea to let an application be the coordinator for such a transaction. There should definitely be an option to let a master/coordinator node of a distributed database handle the details of the 2PC by configuring that DML statements should always be pushed to the master/coordinator node.

      The sharding-jdbc implementation only offers BASE guarantees. I guess there are cases where this makes sense so making the transaction handling pluggable to allow other strategies would definitely be nice as well.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              christian.beikov Christian Beikov
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated: