Uploaded image for project: 'Apache Gobblin'
  1. Apache Gobblin
  2. GOBBLIN-117

Remove current.jst from FsStateStore, add state-store retention to driver

    XMLWordPrintableJSON

Details

    Description

      • The `FsStateStore` creates and updates a `current.jst` to track the most recent version of the job state
      • The problem is that for AWS users, the state-store typically has be put on S3
      • The problem is that for overwriting data, S3 only provides eventual consistency
      • This can cause problems as Gobblin jobs will see an old version of the state-store

      A simple solution to this problem would be to:

      • Remove the concept of `current.jst` and just let each state-store entry be of the form `job_id.jst`
      • The gobblin just does a `ls` on the state-store directory, sorts the contents by file name and picks the most recent one
      • File listing time + sorting the listing shouldn't take long, but just in case the state-store retention job should be run as part of the Gobblin core job - either in the `ApplicationLauncher` or the `JobLauncher`

      Github Url : https://github.com/linkedin/gobblin/issues/882
      Github Reporter : stakiar
      Github Created At : 2016-03-25T03:54:26Z
      Github Updated At : 2017-01-12T04:50:48Z

      Comments


      stakiar wrote on 2016-03-25T03:55:30Z : @zliu41 I believe we discussed this briefly while working on #741, any comments on the above approach?

      Github Url : https://github.com/linkedin/gobblin/issues/882#issuecomment-201126060


      zliu41 wrote on 2016-03-25T15:35:35Z : LGTM except that if a job has multiple datasets there will be multiple `current.jst`s so you'll need to find the most recent one for each dataset urn.

      Github Url : https://github.com/linkedin/gobblin/issues/882#issuecomment-201334649


      jbaranick wrote on 2016-04-12T02:58:38Z : I've started working on this.

      Github Url : https://github.com/linkedin/gobblin/issues/882#issuecomment-208682515


      lakshmanantokbox wrote on 2016-04-22T01:23:37Z : If the consistency is turned on in EMR,“consistent view” for EMRFS(https://blogs.aws.amazon.com/bigdata/post/Tx1WL4KR7SE37YY/Ensuring-Consistency-When-Using-Amazon-S3-and-Amazon-Elastic-MapReduce-for-ETL-W), this problem can be avoided

      Github Url : https://github.com/linkedin/gobblin/issues/882#issuecomment-213198627


      jbaranick wrote on 2016-04-22T01:48:32Z : Correct, but for those use Qubole, this is not the case.

      > On Apr 21, 2016, at 6:23 PM, lakshmanantokbox notifications@github.com wrote:
      >
      > If the consistency is turned on in EMR,“consistent view” for EMRFS(https://blogs.aws.amazon.com/bigdata/post/Tx1WL4KR7SE37YY/Ensuring-Consistency-When-Using-Amazon-S3-and-Amazon-Elastic-MapReduce-for-ETL-W), this problem can be avoided
      >
      > —
      > You are receiving this because you commented.
      > Reply to this email directly or view it on GitHub

      Github Url : https://github.com/linkedin/gobblin/issues/882#issuecomment-213207376

      Attachments

        Activity

          People

            hutran Hung Tran
            stakiar Sahil Takiar
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: