[SYSTEMDS-3430] Query interface over lineage traces - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Story
Status: Closed
Priority: Major
Resolution: Implemented
Affects Version/s: None
Fix Version/s: SystemDS 3.3
Component/s: None
Labels:
- StudentProject

Description

Build a query interface over many text-based lineage traces from various ML workloads. A lineage trace is a serialized DAG of the operations without any control flows. The task is to deserialize the traces into in-memory formats and answer queries regarding the workload characteristics.

The in-memory format could be tabular or semi-structured. The internal representation should preserve the structure of the DAGs and the operators' properties to answer all kinds of queries. One possible way would be to represent each DAG by multiple tables – one for each operator with the corresponding attributes, and one for preserving the structure with attributes including output nodes for each operator. These tables can be joined by IDs. Existing libraries (e.g. Pandas) can be used to define the query interface.

Example queries include:

Find all DAGs with a convolution operation that takes more than 20ms to execute.
Compare the total number of operations between two DAGs.
Group DAGs by the type of non-linear operation used and calculate the average execution time for each group.
Compare the memory usage of the matrix multiplication between two DAGs.
Find similar DAGs on different datasets.

Attachments

Sub-Tasks

1.	Collect lineage traces of ML pipelines	Closed	Unassigned
2.	Deserialize the lineage traces	Closed	Unassigned
3.	Design and implement the internal data representation	Closed	Unassigned

Activity

People

Assignee:: Unassigned

Reporter:: Arnab Phani

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 26/Aug/22 17:36

Updated:: 02/May/24 07:37

Resolved:: 02/May/24 07:37