Uploaded image for project: 'Atlas'
  1. Atlas
  2. ATLAS-2708

AWS S3 data lake typedefs for Atlas

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • None
    • 1.1.0, 2.0.0
    • atlas-core
    • None

    Description

      Currently the base types in Atlas do not include AWS data lake objects. It would be nice to add typedefs for AWS data lake objects (buckets and pseudo-directories) and lineage processes that move the data from another source (e.g., kafka topic) to the data lake.  For example:

      • AWSS3PseudoDir type represents the pseudo-directory “prefix” of objects in an S3 bucket.  For example, in the case of an object with key “myWork/Development/Projects1.xls”, “myWork/Development” is the pseudo-directory.  It supports:
        • Array of avro schemas that are associated with the data in the pseudo-directory (based on Avro schema extensions outlined in ATLAS-2694)
        • what type of data it contains, e.g., avro, json, unstructured
        • time of creation
      • AWSS3BucketLifeCycleRule type represents a rule specifying a transition of the data in a bucket to a storageClass after a specific time interval, or expiration.  For example, transition to GLACIER after 60 days, or expire (i.e. be deleted) after 90 days:
        • ruleType (e.g., transition or expiration)
        • time interval in days before rule is executed  
        • storageClass to which the data is transitioned (null if ruleType is expiration)
      • AWSTag type represents a tag-value pair created by the user and associated with an AWS object.
        •  tag
        • value
      • AWSCloudWatchMetric type represents a storage or request metric that is monitored by AWS CloudWatch and can be configured for a bucket
        • metricName, for example, “AllRequests”, “GetRequests”, TotalRequestLatency, BucketSizeBytes
        • scope: null if entire bucket; otherwise, the prefixes/tags that filter or limit the monitoring of the metric.
      • AWSS3Bucket type represents a bucket in an S3 instance.  It supports:
        • Array of AWSS3PseudoDirectories that are associated with objects stored in the bucket 
        • AWS region
        • IsEncrypted (boolean) 
        • encryptionType, e.g., AES-256
        • S3AccessPolicy, a JSON object expressing access policies, eg GetObject, PutObject
        • time of creation
        • Array of AWSS3BucketLifeCycleRules that are associated with the bucket 
        • Array of AWSS3CloudWatchMetrics that are associated with the bucket or its tags or prefixes
        • Array of AWSTags that are associated with the bucket
      • Generic dataset2Dataset process to represent movement of data from one dataset to another.  It supports:
        • array of transforms performed by the process 
        • map of tag/value pairs representing configurationParameters of the process
        • inputs and outputs are arrays of dataset objects, e.g., kafka topic and S3 pseudo-directory.

       

      Attachments

        1. ATLAS-2708-2.patch
          18 kB
          Madhan Neethiraj
        2. ATLAS-2708.patch
          18 kB
          Madhan Neethiraj
        3. all_datalake_typedefs.json
          7 kB
          Barbara Eckman
        4. all_datalake_typedefs_v2.json
          13 kB
          Barbara Eckman
        5. all_AWS_common_typedefs.json
          1 kB
          Barbara Eckman
        6. all_AWS_common_typedefs_v2.json
          2 kB
          Barbara Eckman
        7. 3010-aws_model.json
          13 kB
          Madhan Neethiraj

        Issue Links

          Activity

            People

              barbara Barbara Eckman
              barbara Barbara Eckman
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: