Uploaded image for project: 'Atlas'
  1. Atlas
  2. ATLAS-3006

Option to ignore/prune metadata for temporary/staging Hive tables

    XMLWordPrintableJSON

Details

    Description

      It is not uncommon for a Hive deployment to use a large number of staging/temporary tables, which are created periodically to load data into target tables and deleted after completion of data load. A large number of entities are created in Atlas for these staging/temporary tables (tables/columns/column-lineage).

      For staging tables, it is probably not useful to track details like columns and column-lineage in Atlas. Not tracking these details in Atlas can significantly reduce the time it takes to process notifications, and can help in improving the performance overall. Only minimum details of these staging tables can be stored in Atlas, to capture data lineage from source to target table via all intermediate staging tables.

      Also, it will be helpful to good to ignore tables that are created & deleted during data loading i.e. temporary tables.

      Configurations should be provided to specify which of the tables are staging/temporary. In addition to supporting this in Hive hook (to avoid generation of large messages for staging/temporary tables), Atlas server should also be updated, to control this further at server side while processing notifications.

      Attachments

        1. ATLAS-3006-branch-0.8.patch
          56 kB
          Madhan Neethiraj
        2. ATLAS-3006.patch
          57 kB
          Madhan Neethiraj

        Issue Links

          Activity

            People

              madhan Madhan Neethiraj
              madhan Madhan Neethiraj
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: