Pig
  1. Pig
  2. PIG-176

pig creates many small files when it spills

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.1.0
    • Component/s: None
    • Labels:
      None

      Description

      Currently, on spill pig can generate millions of small (under 128K) files. Partially this is due to PIG-170 but even with that patch, you can still try and spill small bags.

      The proposal is to not spill small files. Alan told me that the logic is already there but we just need to bump the size limit.

      1. pig_176_smallbags_v1.patch
        6 kB
        Pi Song
      2. pig176_v2.patch
        9 kB
        Pi Song

        Activity

        Hide
        Alan Gates added a comment -

        Patch checked in at revision 652906.

        Show
        Alan Gates added a comment - Patch checked in at revision 652906.
        Hide
        Pi Song added a comment -

        Updated with the latest trunk + make use of the new configuration structure

        Show
        Pi Song added a comment - Updated with the latest trunk + make use of the new configuration structure
        Hide
        Pi Song added a comment -

        OK, will do that.

        Show
        Pi Song added a comment - OK, will do that.
        Hide
        Alan Gates added a comment -

        Pi,

        Did you want to rework this patch now since PIG-111 is in and you can read the properties from pig's Properties object rather than System.getProperties()?

        Show
        Alan Gates added a comment - Pi, Did you want to rework this patch now since PIG-111 is in and you can read the properties from pig's Properties object rather than System.getProperties()?
        Hide
        Pi Song added a comment -

        This patch implements (1) Spill file size threshold (2)My idea in the last comment

        "spill.size.threshold" and "spill.gc.activation.size" are to be set as JVM parameters or .pigrc in order to use this new feature. Default values are 0 and Long.MAX_VALUE respectively.

        There is a bit of problem in (1) that Bag.getMemorySize() sometimes doesn't return accurate value so even the threshold is set, it's still possible that files smaller than the threshold are created.

        The configuration code is still messy in MapReduceLauncher. This needs a clean-up after the configuration patch gets in.

        Show
        Pi Song added a comment - This patch implements (1) Spill file size threshold (2)My idea in the last comment "spill.size.threshold" and "spill.gc.activation.size" are to be set as JVM parameters or .pigrc in order to use this new feature. Default values are 0 and Long.MAX_VALUE respectively. There is a bit of problem in (1) that Bag.getMemorySize() sometimes doesn't return accurate value so even the threshold is set, it's still possible that files smaller than the threshold are created. The configuration code is still messy in MapReduceLauncher. This needs a clean-up after the configuration patch gets in.
        Hide
        Pi Song added a comment -

        Based on the fact that now we spill big bags first, my observation is that there are still cases where a big container bag is spilled and therefore its mContent becomes empty but most of its inner bags' WeakReferences aren't clean-up by GC yet. In such cases, if we haven't freed up enough memory, those inner bags will be unnecessarily spilled (however all their contents were already spilled in the big bag spill). Possibly that are 2 simple ways to solve this:-

        1) In SpillableMemoryManager, we try putting Thread.yield() in between each spill. This should allow some more time for GC to do more clean-up without degrading performance too much. However, if the main execution thread doesn't produce any bag (e.g. a map task where all keys and values are tuples and atomic data), this will give more time to the main execution thread to use up more memory more quickly.

        2) Check the size of the current spillable being spilled. If it is larger than constant X, do a System.GC(). This is safer than (1) but due to the fact that we explicitly call GC more often, it may have some impact on performance. However, by considering the fact that spilling small files is much slower than doing System.GC(), this approach should then generally give a better performance.

        I don't really have a processing task that incurs spilling that much. Can anyone please try (2) out?

        Show
        Pi Song added a comment - Based on the fact that now we spill big bags first, my observation is that there are still cases where a big container bag is spilled and therefore its mContent becomes empty but most of its inner bags' WeakReferences aren't clean-up by GC yet. In such cases, if we haven't freed up enough memory, those inner bags will be unnecessarily spilled (however all their contents were already spilled in the big bag spill). Possibly that are 2 simple ways to solve this:- 1) In SpillableMemoryManager, we try putting Thread.yield() in between each spill. This should allow some more time for GC to do more clean-up without degrading performance too much. However, if the main execution thread doesn't produce any bag (e.g. a map task where all keys and values are tuples and atomic data), this will give more time to the main execution thread to use up more memory more quickly. 2) Check the size of the current spillable being spilled. If it is larger than constant X, do a System.GC(). This is safer than (1) but due to the fact that we explicitly call GC more often, it may have some impact on performance. However, by considering the fact that spilling small files is much slower than doing System.GC(), this approach should then generally give a better performance. I don't really have a processing task that incurs spilling that much. Can anyone please try (2) out?
        Hide
        Olga Natkovich added a comment -

        Pi,

        Running faster is part of it. The other part is not to fill up disks with tiny files which causes disk frgamentation and also takes forever to cleanup at the end of processing though you suggestion of cleaning as we go might help that a bit.

        Show
        Olga Natkovich added a comment - Pi, Running faster is part of it. The other part is not to fill up disks with tiny files which causes disk frgamentation and also takes forever to cleanup at the end of processing though you suggestion of cleaning as we go might help that a bit.
        Hide
        Pi Song added a comment -

        So let's say if the size is smaller than something, don't spill right? This is very easy to fix but we will be able to reclaim a bit less memory than before therefore causing some tasks to fail more often in exchange for some tasks running faster. Is this acceptable?

        Probably the best way to go is to make it configurable but Pig-111 isn't in yet. Sighhh..... I want to have more time.

        Show
        Pi Song added a comment - So let's say if the size is smaller than something, don't spill right? This is very easy to fix but we will be able to reclaim a bit less memory than before therefore causing some tasks to fail more often in exchange for some tasks running faster. Is this acceptable? Probably the best way to go is to make it configurable but Pig-111 isn't in yet. Sighhh..... I want to have more time.

          People

          • Assignee:
            Alan Gates
            Reporter:
            Olga Natkovich
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development