Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-5324

[Plasma] API requests



    • Type: Improvement
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 3.0.0
    • Component/s: C++ - Plasma
    • Labels:


      Copied from https://github.com/apache/arrow/issues/4318 (it's easier to read there, sorry hate Jira formatting)

      Related to https://issues.apache.org/jira/browse/ARROW-3444 

      While working with the plasma API to create/seal an object for a table, using a custom object-ID, it would help to have a convenience API to get the size of the table.

      The following code might help to illustrate the request and notes below:

          if not parquet_path:
              parquet_path = f"./data/dataset_{size}.parquet"
          if not plasma_path:
              plasma_path = f"./data/dataset_{size}.plasma"
              plasma_client = plasma.connect(plasma_path)
              plasma_client = None
          if plasma_client:
              table_id = plasma.ObjectID(bytes(parquet_path[:20], encoding='utf8'))
                  table = plasma_client.get(table_id, timeout_ms=4000)
                  if table.__name__ == 'ObjectNotAvailable':
                      raise ValueError('Failed to get plasma object')
              except ValueError:
                  table = pq.read_table(parquet_path, use_threads=True)
                  plasma_client.create_and_seal(table_id, table)


      The use case is a workflow something like this:

      • process-A
        • generate a pandas DataFrame `df`
        • save the `df` to parquet, using pyarrow.parquet, with a unique parquet path
        • (this process will not save directly to plasma)
      • process-B
        • get the data from plasma or load it into plasma from the parquet file
        • use the unique parquet path to generate a unique object-ID


      • `plasma_client.put` for the same data-table is not idempotent, it generates unique object-ID values that are not based on any hash of the data payload, so every put saves a new object-ID; could it use a data hash for idempotent puts? e.g.
      • In : plasma_client.put(table)
        In : plasma_client.put(table)
        In : plasma_client.put(table)
        In : hash(table)
        TypeError: unhashable type: 'pyarrow.lib.Table'
      • In process-B, when the data is not already in plasma, it reads data from a parquet file into a pyarrow.Table and then needs an object-ID and the table size to use plasma `client.create_and_seal` but it's not easy to get the table size - this might be related to github issue #2707 (#3444) - it might be ideal if the `client.create_and_seal` accepts responsibility for the size of the object to be created when given a pyarrow data object like a table.
      • when the plasma store does not have the object, it could have a default timeout rather than hang indefinitely, and it's a bit clumsy to return an object that is not easily checked with `isinstance` and it could be better to have an exception handling pattern (or something like the requests 404 patterns and options?)




            • Assignee:
              dazza Darren Weber
            • Votes:
              0 Vote for this issue
              1 Start watching this issue


              • Created: