Currently when a distributed task is done using the Kudu client, the driver/coordinator client needs to open the table to request its current metadata and locations. Then it can distribute the work to tasks/executors on remote nodes. In the case of reading data, often ScanTokens are used to distribute the work, and in the case of writing data perhaps just the table name is required.
The problem is that each parallel task then also needs to open the table to request the metadata for the table. Using Spark as an example, this happens when deserializing the scan tokens in KuduRDD (here) or when writing rows using the KuduContext (here). This results in a large burst of metadata requests to the leader Kudu master all at once. Given the Kudu master is only a single server and requests can't be served from the follower masters, this effectively limits the amount of parallel tasks that can run in a large Kudu deployment. Even if the follower masters could service the requests, that still limits scalability in very large clusters given most deployments would only have 3-5 masters.
Adding a metadata token, similar to a scan token, would be a useful way to allow the single driver to fetch all the metadata required for the parallel tasks. The tokens can be serialized and then passed to each task in a similar fashion to scan tokens.
Of course in a pessimistic case, something may change between generation of the token and the start of the task. In that case a request would need to be sent to get the updated metadata. However, that scenario should be rare and likely would not result in all of the requests happening at the same time.