Copied from https://github.com/apache/arrow/issues/4318 (it's easier to read there, sorry hate Jira formatting)
While working with the plasma API to create/seal an object for a table, using a custom object-ID, it would help to have a convenience API to get the size of the table.
The following code might help to illustrate the request and notes below:
The use case is a workflow something like this:
- generate a pandas DataFrame `df`
- save the `df` to parquet, using pyarrow.parquet, with a unique parquet path
- (this process will not save directly to plasma)
- get the data from plasma or load it into plasma from the parquet file
- use the unique parquet path to generate a unique object-ID
- `plasma_client.put` for the same data-table is not idempotent, it generates unique object-ID values that are not based on any hash of the data payload, so every put saves a new object-ID; could it use a data hash for idempotent puts? e.g.
- In process-B, when the data is not already in plasma, it reads data from a parquet file into a pyarrow.Table and then needs an object-ID and the table size to use plasma `client.create_and_seal` but it's not easy to get the table size - this might be related to github issue #2707 (#3444) - it might be ideal if the `client.create_and_seal` accepts responsibility for the size of the object to be created when given a pyarrow data object like a table.
- when the plasma store does not have the object, it could have a default timeout rather than hang indefinitely, and it's a bit clumsy to return an object that is not easily checked with `isinstance` and it could be better to have an exception handling pattern (or something like the requests 404 patterns and options?)