[HOP-4024] Create a new execution information platform - ASF JIRA

XML

Word

Printable

JSON

There's a certain lack of transparency and consistency when we're executing workflows and pipelines.

We need to create a new API to solve this. Here are a few things we need to do:

Create a read-only copy of metadata when we execute a workflow or pipeline alongside execution information
Allow for the capturing of execution information locally and remotely for all types of engines
Execution information should include the ability to capture full transform output for unit testing, on-demand profiling of fields, sampling of rows (first, last, random, sniff, ...)
The location to where execution information is sent needs to be configurable. It should be possible to send to more than one location. The logical solution here is to implement a new type of metadata.
Sending execution information should be done using an API which is implemented using plugins (Execution Logging plugin type). This will allow anyone to implement a new back-end.
Reading execution information should be done using an API which is implemented using plugins (Execution Logging plugin type)
We should have a new perspective in the Hop GUI to interact with the various execution information locations and their information.
We should have a new command line tool to interact with execution information locations ad their information.

A few thoughts on the process of adding execution information to a location:

The update of a status should be configurable using an API:

Various types of data logging:
- Sampling: Sample N rows (First, Last, Random, None, ...)
- Capture output: Get all rows (for unit testing)
- Profile: data profile fields in rows
Incremental update of the logging text
Number of records processed, buffer sizes
Environment information (where, when, who, memory, disk)
...