Details
-
Story
-
Status: Closed
-
Major
-
Resolution: Implemented
-
None
-
None
Description
Build a query interface over many text-based lineage traces from various ML workloads. A lineage trace is a serialized DAG of the operations without any control flows. The task is to deserialize the traces into in-memory formats and answer queries regarding the workload characteristics.
The in-memory format could be tabular or semi-structured. The internal representation should preserve the structure of the DAGs and the operators' properties to answer all kinds of queries. One possible way would be to represent each DAG by multiple tables – one for each operator with the corresponding attributes, and one for preserving the structure with attributes including output nodes for each operator. These tables can be joined by IDs. Existing libraries (e.g. Pandas) can be used to define the query interface.
Example queries include:
Find all DAGs with a convolution operation that takes more than 20ms to execute.
Compare the total number of operations between two DAGs.
Group DAGs by the type of non-linear operation used and calculate the average execution time for each group.
Compare the memory usage of the matrix multiplication between two DAGs.
Find similar DAGs on different datasets.
Attachments
1.
|
Collect lineage traces of ML pipelines | Closed | Unassigned | |
2.
|
Deserialize the lineage traces | Closed | Unassigned | |
3.
|
Design and implement the internal data representation | Closed | Unassigned |