Uploaded image for project: 'Apache Hop (Retired)'
  1. Apache Hop (Retired)
  2. HOP-4024

Create a new execution information platform

    XMLWordPrintableJSON

Details

    Description

      There's a certain lack of transparency and consistency when we're executing workflows and pipelines.

      We need to create a new API to solve this.  Here are a few things we need to do:

      • Create a read-only copy of metadata when we execute a workflow or pipeline alongside execution information
      • Allow for the capturing of execution information locally and remotely for all types of engines
      • Execution information should include the ability to capture full transform output for unit testing, on-demand profiling of fields, sampling of rows (first, last, random, sniff, ...)
      • The location to where execution information is sent needs to be configurable.  It should be possible to send to more than one location.  The logical solution here is to implement a new type of metadata.
      • Sending execution information should be done using an API which is implemented using plugins (Execution Logging plugin type).  This will allow anyone to implement a new back-end.
      • Reading execution information should be done using an API which is implemented using plugins (Execution Logging plugin type) 
      • We should have a new perspective in the Hop GUI to interact with the various execution information locations and their information.  
      • We should have a new command line tool to interact with execution information locations ad their information.

       

      A few thoughts on the process of adding execution information to a location:

      • Register a new execution at a location (start of execution):
        • adds full read-only copy of the executable, metadata, variables, ...
        • Register under the unique ID of the pipeline or workflow
        • Run configuration
        • Execution configuration (logging level, parameters, ...)
        • Project & environment information
        • Execution lineage information
        • Environment information (where, when, who, memory, disk, ...) 
      • After initialization: update status
      • Periodically (configurable delay and interval): update status
      • End of execution: update status

       

      The update of a status should be configurable using an API:

      • Various types of data logging:
        • Sampling: Sample N rows (First, Last, Random, None, ...)
        • Capture output: Get all rows (for unit testing)
        • Profile: data profile fields in rows
      • Incremental update of the logging text
      • Number of records processed, buffer sizes
      • Environment information (where, when, who, memory, disk)
      • ...

       

       

      Attachments

        Issue Links

          Activity

            People

              mcasters Matt Casters
              mcasters Matt Casters
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2h 10m
                  2h 10m