Uploaded image for project: 'Apache Gobblin'
  1. Apache Gobblin
  2. GOBBLIN-147

source.filebased.fs.snapshot can cause excessive memory usage

    XMLWordPrintableJSON

Details

    Description

      Currently the `source.filebased.fs.snapshot` duplicated across each workUnit. This can cause excessive memory usage when there are a large number of workUnits and the snapshot is large. When reading in the `source.filebased.fs.snapshot` it is pulled from the first workUnit in the set of previous workUnits. It would be more optimal to store `source.filebased.fs.snapshot` on the job state. I believe that this can be done by storing it directly in `state` in the `FileBasedSource.getWorkunits` method and loading it from `state.getPreviousSourceState` in the `FileBasedSource.getWorkunits` method. Storing the snapshot on the state will need to occur after the workunits are created to ensure that it does not get copied to them. Does this seem correct?

      Github Url : https://github.com/linkedin/gobblin/issues/623
      Github Reporter : jbaranick
      Github Created At : 2016-01-20T23:05:27Z
      Github Updated At : 2017-01-12T04:37:18Z

      Comments


      stakiar wrote on 2016-02-04T23:20:08Z : Yes, that sounds like the approach taken by `FileBasedSource`, and I agree it is not particularly efficient. Just curious what is your use case for using the `FileBasedSource?

      Your approach sounds like it should work, at least from a brief look into the code that handles the serialization of the `SourceState`. Did this approach work for you?

      Github Url : https://github.com/linkedin/gobblin/issues/623#issuecomment-180098751


      jbaranick wrote on 2016-02-04T23:26:33Z : @sahilTakiar We are reading in a bunch of event log files and loading them into our datamart. For a 1 hour block of time there might be ~30k files which get get split across ~200 tasks. We are running under yarn so we are especially sensitive to excess memory usage.

      I tried the method I outlined above, and it gets past the initial error. Then it fails later because it gets copied again elsewhere. We hacked it to get it to work, but I think it would be way better if the states had parent-child relationship and would look up the hierarchy when getting a property. As it stands, there is a ton of property duplication due to the `State.addAll methods`.

      Github Url : https://github.com/linkedin/gobblin/issues/623#issuecomment-180100828


      stakiar wrote on 2016-02-05T02:02:59Z : Yes, we have seen similar problems when running production jobs where OOM exception get thrown because `WorkUnit`s or `TaskState`s are too big. You are correct, I believe the `SourceState` is added to each `WorkUnit` inside a Gobblin Task (e.g. a Map Task, or YARN container). We have some work in progress to deprecate using the `State` objects as we have found them to cause a number of problems. We plan to replace it with `gobblin-config-management` a config management library based on Typesafe's [config](https://github.com/typesafehub/config). But the migration won't be happening for a while.

      Once the files are copied into your datamart can they safely be deleted? One alternative would be to just delete the files once they are copied, in which case there is no need to internally track `source.filedbased.fs.snapshot`.

      If not, then perhaps we can change `FileBasedSource` to solve this problem in a smarter way. Some ideas:

      • If the source file system is immutable (e.g. HDFS), do a file listing, iterate through each FileStatus, and track the highest modified timestamp seen. This becomes the [watermark](https://github.com/linkedin/gobblin/wiki/State-Management-and-Watermarks) for the job
      • At the beginning of each job do an `ls` on the directory and store is somewhere, there is really no need to keep in-memory in the Gobblin job, as it is only really needed at the beginning of each execution

      Github Url : https://github.com/linkedin/gobblin/issues/623#issuecomment-180155540


      jbaranick wrote on 2016-02-05T02:31:56Z : We have put a temporary hack in State to not copy this property around. It is unfortunate that the current state system duplicates so much data. When do you think the new confit system will be done?

      Github Url : https://github.com/linkedin/gobblin/issues/623#issuecomment-180163481


      stakiar wrote on 2016-02-10T21:47:48Z : We don't have a definite timeline, but regardless I think storing `fs.filebased.fs.snapshot` in memory is the wrong solution, even if that is the current approach in `FileBasedSource`. Even if we fix the duplicate `State` issue its entirely possible that at some point even one copy of `fs.filebased.fs.snapshot` is too big to fix into memory. We should consider changing the logic in `FileBasedSource` to use one of the above approaches.

      Github Url : https://github.com/linkedin/gobblin/issues/623#issuecomment-182595934


      jbaranick wrote on 2016-02-10T22:34:24Z : Regarding the two ideas from above:
      1. There is an edge case where the modified timestamp of the file will not be granular enough. Additionally, if the timestamp changes we would reprocess the file. Lastly, this only works if the file is written in one shot to the filesystem (not if the files is created and then subsequently written to).
      2. Regardless of where we store the snapshot we need to do a set difference when determining which files to pick up.
      1. One way of doing this would be pulling the snapshot into memory and diff it with the current ls results. This would have similar problems to today.
      2. Or, we could keep the snapshot as a sorted list and perform a streaming set difference against the sorted results of the current ls. Unfortunately, the ls call doesn't not guarantee sorted results. We could use an external merge sort.

      Github Url : https://github.com/linkedin/gobblin/issues/623#issuecomment-182609969

      Attachments

        Activity

          People

            Unassigned Unassigned
            jbaranick Joel Baranick
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: