Details
-
Task
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
Description
There's a certain lack of transparency and consistency when we're executing workflows and pipelines.
We need to create a new API to solve this. Here are a few things we need to do:
- Create a read-only copy of metadata when we execute a workflow or pipeline alongside execution information
- Allow for the capturing of execution information locally and remotely for all types of engines
- Execution information should include the ability to capture full transform output for unit testing, on-demand profiling of fields, sampling of rows (first, last, random, sniff, ...)
- The location to where execution information is sent needs to be configurable. It should be possible to send to more than one location. The logical solution here is to implement a new type of metadata.
- Sending execution information should be done using an API which is implemented using plugins (Execution Logging plugin type). This will allow anyone to implement a new back-end.
- Reading execution information should be done using an API which is implemented using plugins (Execution Logging plugin type)
- We should have a new perspective in the Hop GUI to interact with the various execution information locations and their information.
- We should have a new command line tool to interact with execution information locations ad their information.
A few thoughts on the process of adding execution information to a location:
- Register a new execution at a location (start of execution):
- adds full read-only copy of the executable, metadata, variables, ...
- Register under the unique ID of the pipeline or workflow
- Run configuration
- Execution configuration (logging level, parameters, ...)
- Project & environment information
- Execution lineage information
- Environment information (where, when, who, memory, disk, ...)
- After initialization: update status
- Periodically (configurable delay and interval): update status
- End of execution: update status
The update of a status should be configurable using an API:
- Various types of data logging:
- Sampling: Sample N rows (First, Last, Random, None, ...)
- Capture output: Get all rows (for unit testing)
- Profile: data profile fields in rows
- Incremental update of the logging text
- Number of records processed, buffer sizes
- Environment information (where, when, who, memory, disk)
- ...