Mahout
  1. Mahout
  2. MAHOUT-633

Add SequenceFileIterable; put Iterable stuff in one place

    Details

      Description

      In another project I have a useful little class, SequenceFileIterable, which simplifies iterating over a sequence file. It's like FileLineIterable. I'd like to add it, then use it throughout the code. See patch, which for now merely has the proposed new classes.

      Well it also moves some other iterator-related classes that seemed to be outside their rightful home in common.iterator.

      1. MAHOUT-633.patch
        529 kB
        Sean Owen
      2. MAHOUT-633.patch
        450 kB
        Sean Owen
      3. MAHOUT-633.patch
        214 kB
        Sean Owen
      4. MAHOUT-633.patch
        29 kB
        Sean Owen

        Activity

        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #708 (See https://hudson.apache.org/hudson/job/Mahout-Quality/708/)

        Show
        Hudson added a comment - Integrated in Mahout-Quality #708 (See https://hudson.apache.org/hudson/job/Mahout-Quality/708/ )
        Hide
        Dmitriy Lyubimov added a comment -

        Looks fine. Terrific effort.
        I guess i can wire up SequenceFileDirIterable as appropriate, later.

        Show
        Dmitriy Lyubimov added a comment - Looks fine. Terrific effort. I guess i can wire up SequenceFileDirIterable as appropriate, later.
        Hide
        Sean Owen added a comment -

        One last patch. This has everything we've talked about. It won't create new instances unless it's necessary. Should be a win-win change now.

        Show
        Sean Owen added a comment - One last patch. This has everything we've talked about. It won't create new instances unless it's necessary. Should be a win-win change now.
        Hide
        Dmitriy Lyubimov added a comment -

        Actually deep copy iterator was using copy-constructor just because it was hasty solution . Acutally it could've reused the same double[] (as in same 'writable' pattern). well this side R matrices info doesn't use mass iterations so it probably won't be much of an overhead (only ~1000 instances in all job max).

        Show
        Dmitriy Lyubimov added a comment - Actually deep copy iterator was using copy-constructor just because it was hasty solution . Acutally it could've reused the same double[] (as in same 'writable' pattern). well this side R matrices info doesn't use mass iterations so it probably won't be much of an overhead (only ~1000 instances in all job max).
        Hide
        Dmitriy Lyubimov added a comment -

        Ok.. i see just two functional changes:

        – replacing deep copy iterator for upper-triangular matrices with copy-constructor iterator, which seems to be fine (although it uses reflection to invoke copy-constructor whereas ad-hoc stuff uses direct invocation, which is technically slower but who cares).

        – using sequenceFileIterator for reading side info R sequences in BtJob. Also seems fine by looking at it. This actually could use multifile glob iterator provided there's a way to enforce order like i described.

        I think qtInput also might be replaced with the same iterator (second side feed)

        Am i missing any other functional changes? thanks.

        Show
        Dmitriy Lyubimov added a comment - Ok.. i see just two functional changes: – replacing deep copy iterator for upper-triangular matrices with copy-constructor iterator, which seems to be fine (although it uses reflection to invoke copy-constructor whereas ad-hoc stuff uses direct invocation, which is technically slower but who cares). – using sequenceFileIterator for reading side info R sequences in BtJob. Also seems fine by looking at it. This actually could use multifile glob iterator provided there's a way to enforce order like i described. I think qtInput also might be replaced with the same iterator (second side feed) Am i missing any other functional changes? thanks.
        Hide
        Grant Ingersoll added a comment -

        You probably should take over the SequenceFileVectorIterable class I added in Utils.

        Show
        Grant Ingersoll added a comment - You probably should take over the SequenceFileVectorIterable class I added in Utils.
        Hide
        Dmitriy Lyubimov added a comment -

        In the tests I ran, not significant. But i was using preprocessing handler on VectorWritable (the patch you guys were reluctant to accept) which did not create intermediate vector storage at all. All matrix elements were passed in on stack. Mahout's patch doesn't have this code but i will be happy to put it in jira for discussion. Also, i did not run close to memory limits on the tasks i ran with SSVD. I just don't have datasets that big.

        Actually correction: i do think i saw difference in running time between code running with existing VectorWriable that creates interim vector instances for each iteration and the code that just passes matrix elements on stack. But like i said i did not measure the difference it as it was not my goal. And it's not the evidence i'd like to appeal anyway since there wasn't much side info. But i have other situations (outside Mahout realm but also iterative batches with big side info) that i'd like to appeal to.

        Show
        Dmitriy Lyubimov added a comment - In the tests I ran, not significant. But i was using preprocessing handler on VectorWritable (the patch you guys were reluctant to accept) which did not create intermediate vector storage at all. All matrix elements were passed in on stack. Mahout's patch doesn't have this code but i will be happy to put it in jira for discussion. Also, i did not run close to memory limits on the tasks i ran with SSVD. I just don't have datasets that big. Actually correction: i do think i saw difference in running time between code running with existing VectorWriable that creates interim vector instances for each iteration and the code that just passes matrix elements on stack. But like i said i did not measure the difference it as it was not my goal. And it's not the evidence i'd like to appeal anyway since there wasn't much side info. But i have other situations (outside Mahout realm but also iterative batches with big side info) that i'd like to appeal to.
        Hide
        Dmitriy Lyubimov added a comment - - edited

        I guess i need to clarify that i am not talking about situation that something is bad when you have -Xmx6G and start using a lot of small references.

        I am talking about situation when before you start iterating, you have already preloaded 5G out of 6G with "stuff". or something along those lines. Low-mem situation.

        Show
        Dmitriy Lyubimov added a comment - - edited I guess i need to clarify that i am not talking about situation that something is bad when you have -Xmx6G and start using a lot of small references. I am talking about situation when before you start iterating, you have already preloaded 5G out of 6G with "stuff". or something along those lines. Low-mem situation.
        Hide
        Dmitriy Lyubimov added a comment -

        How much of that time was due to GC?

        In the tests I ran, not significant. But i was using preprocessing handler on VectorWritable (the patch you guys were reluctant to accept) which did not create intermediate vector storage at all. All matrix elements were passed in on stack. Mahout's patch doesn't have this code but i will be happy to put it in jira for discussion. Also, i did not run close to memory limits on the tasks i ran with SSVD. I just don't have datasets that big.

        But i ran other code that did – and like i said, running time losses were significant, up to order of magnitude (in jvm 1.5).

        Do you have any evidence that simpler techniques that cause ephemeral garbage are increasing your memory pressure?

        I am not sure I understand this. If you are asking whether memory use per se is increased because of you are using tons of short lived references instead of one 'old' gen reference, no, i don't beleive that effect would be very significant so as we can construe it as being detrimental. It's just in near-limits memory use you'd vent out significantly more cpu on this, actually, surpirsingly a lot of cpu on 64bit systems with big heaps and a lot of references in them (>40 second global pauses, i.e. 4 times that on per-cpu basis, per full GC run on 12Gb), that's all. It's only apparent in jobs that need a lot of side info to run.

        Show
        Dmitriy Lyubimov added a comment - How much of that time was due to GC? In the tests I ran, not significant. But i was using preprocessing handler on VectorWritable (the patch you guys were reluctant to accept) which did not create intermediate vector storage at all. All matrix elements were passed in on stack. Mahout's patch doesn't have this code but i will be happy to put it in jira for discussion. Also, i did not run close to memory limits on the tasks i ran with SSVD. I just don't have datasets that big. But i ran other code that did – and like i said, running time losses were significant, up to order of magnitude (in jvm 1.5). Do you have any evidence that simpler techniques that cause ephemeral garbage are increasing your memory pressure? I am not sure I understand this. If you are asking whether memory use per se is increased because of you are using tons of short lived references instead of one 'old' gen reference, no, i don't beleive that effect would be very significant so as we can construe it as being detrimental. It's just in near-limits memory use you'd vent out significantly more cpu on this, actually, surpirsingly a lot of cpu on 64bit systems with big heaps and a lot of references in them (>40 second global pauses, i.e. 4 times that on per-cpu basis, per full GC run on 12Gb), that's all. It's only apparent in jobs that need a lot of side info to run.
        Hide
        Ted Dunning added a comment -

        Not in SSVD, it packs parts of massive scale QR and and stochastic projection in one map step and i had it 98.8% avg CPU saturation.

        Dmitriy,

        How much of that time was due to GC?

        Do you have any evidence that simpler techniques that cause ephemeral garbage are increasing your memory pressure?

        Show
        Ted Dunning added a comment - Not in SSVD, it packs parts of massive scale QR and and stochastic projection in one map step and i had it 98.8% avg CPU saturation. Dmitriy, How much of that time was due to GC? Do you have any evidence that simpler techniques that cause ephemeral garbage are increasing your memory pressure?
        Hide
        Sean Owen added a comment -

        Sounds like we just need a new flag or something to select reuse of the key/value objects. Then I can go back and enable it where the code seemed to have been reusing them already. I can get on that along with ordering support.

        While it's sounding complex... I think it's really not much, compared to the amount of clean-up and code removal this is enabling. I quite like all this.

        Good luck reading the patch. Really, you want to look at new code in .common.iterator.sequencefile, and how it's used in your bits of code. The rest is probably not relevant.

        Show
        Sean Owen added a comment - Sounds like we just need a new flag or something to select reuse of the key/value objects. Then I can go back and enable it where the code seemed to have been reusing them already. I can get on that along with ordering support. While it's sounding complex... I think it's really not much, compared to the amount of clean-up and code removal this is enabling. I quite like all this. Good luck reading the patch. Really, you want to look at new code in .common.iterator.sequencefile, and how it's used in your bits of code. The rest is probably not relevant.
        Hide
        Dmitriy Lyubimov added a comment -

        Oh yes, it's already split up that way and one delegates to the other. Ordering is easy to add. How about I add it to the patch and let you take a look at using it? I say that as I think it may be better if you have a look at what I changed directly to make sure it makes sense, and that's a small way of accomplishing this.

        Yes sir, i will try to do it in the evening. (My daughter going to sleep routine permitting).

        Show
        Dmitriy Lyubimov added a comment - Oh yes, it's already split up that way and one delegates to the other. Ordering is easy to add. How about I add it to the patch and let you take a look at using it? I say that as I think it may be better if you have a look at what I changed directly to make sure it makes sense, and that's a small way of accomplishing this. Yes sir, i will try to do it in the evening. (My daughter going to sleep routine permitting).
        Hide
        Dmitriy Lyubimov added a comment -

        And, in about half the cases, the caller is cloning the key and/or value because it wants to save a copy. So in some cases it's already making new objects.

        Yes, that's true. We can't prevent ppl from screwing it over. We only can given them a chance not to. And I want that chance for myself.

        I had in mind that this factor is probably dwarfed by I/O and the actual deserialization... right? I had the impression these Hadoop jobs were most certainly I/O bound, not memory/GC/CPU bound.

        Not in SSVD, it packs parts of massive scale QR and and stochastic projection in one map step and i had it 98.8% avg CPU saturation. Which basically told me i wasn't wasteful on I/O – which I tried pretty hard not to be. QR algorithms are quadratic – even that we reduce the scale of the problem. I am still a little bit wasteful on flops here but it's not dramatic and it got to be enough for open source. So this near-limit memory use GC stuff will affect me very very much (i build a series of jobs with similar dynamics before in java 1.5, it was pretty bad (up to 50 times slower) until i employed the strategies i told about above).

        Show
        Dmitriy Lyubimov added a comment - And, in about half the cases, the caller is cloning the key and/or value because it wants to save a copy. So in some cases it's already making new objects. Yes, that's true. We can't prevent ppl from screwing it over. We only can given them a chance not to. And I want that chance for myself. I had in mind that this factor is probably dwarfed by I/O and the actual deserialization... right? I had the impression these Hadoop jobs were most certainly I/O bound, not memory/GC/CPU bound. Not in SSVD, it packs parts of massive scale QR and and stochastic projection in one map step and i had it 98.8% avg CPU saturation. Which basically told me i wasn't wasteful on I/O – which I tried pretty hard not to be. QR algorithms are quadratic – even that we reduce the scale of the problem. I am still a little bit wasteful on flops here but it's not dramatic and it got to be enough for open source. So this near-limit memory use GC stuff will affect me very very much (i build a series of jobs with similar dynamics before in java 1.5, it was pretty bad (up to 50 times slower) until i employed the strategies i told about above).
        Hide
        Dmitriy Lyubimov added a comment - - edited

        I haven't looked thru the patch yet. I will try to do it in the evening.

        But i think we are in agreement except for missing ordering support – i could add it later if you want.

        Show
        Dmitriy Lyubimov added a comment - - edited I haven't looked thru the patch yet. I will try to do it in the evening. But i think we are in agreement except for missing ordering support – i could add it later if you want.
        Hide
        Sean Owen added a comment -

        Oh yes, it's already split up that way and one delegates to the other. Ordering is easy to add. How about I add it to the patch and let you take a look at using it? I say that as I think it may be better if you have a look at what I changed directly to make sure it makes sense, and that's a small way of accomplishing this.

        Show
        Sean Owen added a comment - Oh yes, it's already split up that way and one delegates to the other. Ordering is easy to add. How about I add it to the patch and let you take a look at using it? I say that as I think it may be better if you have a look at what I changed directly to make sure it makes sense, and that's a small way of accomplishing this.
        Hide
        Dmitriy Lyubimov added a comment - - edited

        yes – it is often the case – esp. in lin algebra side info files – that you need to load those blocks in a specific order. In this case order is determined by the task id. That's what my code is doing here.

        It's just if you don't have support for some strategy determining file order then i can't use your multifile glob support.

        anything involving SSVDSolver.partitionComparator for instance?

        Yes.


        Thought No 2: I also had a thought that maybe it's worth to factor the problem out into two : 1 – sequence file iterator and 2-- multifile glob iterator (supporting ordering as well) that delegates iterating to single file iterator. (that's usually how i did that sort of stuff).

        Show
        Dmitriy Lyubimov added a comment - - edited yes – it is often the case – esp. in lin algebra side info files – that you need to load those blocks in a specific order. In this case order is determined by the task id. That's what my code is doing here. It's just if you don't have support for some strategy determining file order then i can't use your multifile glob support. anything involving SSVDSolver.partitionComparator for instance? Yes. – Thought No 2: I also had a thought that maybe it's worth to factor the problem out into two : 1 – sequence file iterator and 2-- multifile glob iterator (supporting ordering as well) that delegates iterating to single file iterator. (that's usually how i did that sort of stuff).
        Hide
        Sean Owen added a comment -

        On sorting the order of input – Dmitriy I think I kept my hands off the cases where it looked like it was not a simple matter of iterating through the results from globStatus() and listStatus(). The patch is, well, a mess to read. What in particular are you referring to so I can double-check? anything involving SSVDSolver.partitionComparator for instance? I didn't touch those.

        We can definitely add sorting support.

        Show
        Sean Owen added a comment - On sorting the order of input – Dmitriy I think I kept my hands off the cases where it looked like it was not a simple matter of iterating through the results from globStatus() and listStatus(). The patch is, well, a mess to read. What in particular are you referring to so I can double-check? anything involving SSVDSolver.partitionComparator for instance? I didn't touch those. We can definitely add sorting support.
        Hide
        Sean Owen added a comment -

        On new Writables: I agree, I don't think it can be faster to allocate many Writables. This is really the big possible problem with what this patch does.

        I found that in maybe 60% of cases the key is discarded, so, wrote special support for that which would not make a new key object on each iteration.

        And, in about half the cases, the caller is cloning the key and/or value because it wants to save a copy. So in some cases it's already making new objects.

        I had in mind that this factor is probably dwarfed by I/O and the actual deserialization... right? I had the impression these Hadoop jobs were most certainly I/O bound, not memory/GC/CPU bound.

        There's a subtle argument for correctness too... it's a bit all-to-easy to forget to clone a key or value and get bitten with a subtle bug. But this is more of a minor practical argument.

        And of course the whole point of the exercise is to standardize the code, since from touring the code it was clear there were many approaches (e.g. makeQualified() or not) and occasional correctness issues (e.g. no call to close()).

        That's my reasoning. Thoughts on this thinking? I suppose my gut is that this isn't a big deal, but I don't have evidence.

        Show
        Sean Owen added a comment - On new Writables: I agree, I don't think it can be faster to allocate many Writables. This is really the big possible problem with what this patch does. I found that in maybe 60% of cases the key is discarded, so, wrote special support for that which would not make a new key object on each iteration. And, in about half the cases, the caller is cloning the key and/or value because it wants to save a copy. So in some cases it's already making new objects. I had in mind that this factor is probably dwarfed by I/O and the actual deserialization... right? I had the impression these Hadoop jobs were most certainly I/O bound, not memory/GC/CPU bound. There's a subtle argument for correctness too... it's a bit all-to-easy to forget to clone a key or value and get bitten with a subtle bug. But this is more of a minor practical argument. And of course the whole point of the exercise is to standardize the code, since from touring the code it was clear there were many approaches (e.g. makeQualified() or not) and occasional correctness issues (e.g. no call to close()). That's my reasoning. Thoughts on this thinking? I suppose my gut is that this isn't a big deal, but I don't have evidence.
        Hide
        Sean Owen added a comment -

        Here comes a new version of the patch. Yes I have in here glob and list support, ReflectionUtils, and avoiding creation of extra key instances where not needed.

        This actually should be called the "Great Iterator Patch". I kept finding little things around Iterators, Iterables, PathFilters as well that could be refactored and threw it into the mix. I really like the result, but, it's a doozy of a patch. It does delete a net 500 lines of code, tests pass, it's definitely standardized about 80% of the interaction with reading sequence files, and fixed a few small bugs along the way (see what I did to StableFixedSizeSamplingIterator for instance – it wasn't stable).

        Sebastian I think there are appropriate calls to makeQualified().

        Dmitriy let me now go read your comments, I had not actually seen them yet!

        Show
        Sean Owen added a comment - Here comes a new version of the patch. Yes I have in here glob and list support, ReflectionUtils, and avoiding creation of extra key instances where not needed. This actually should be called the "Great Iterator Patch". I kept finding little things around Iterators, Iterables, PathFilters as well that could be refactored and threw it into the mix. I really like the result, but, it's a doozy of a patch. It does delete a net 500 lines of code, tests pass, it's definitely standardized about 80% of the interaction with reading sequence files, and fixed a few small bugs along the way (see what I did to StableFixedSizeSamplingIterator for instance – it wasn't stable). Sebastian I think there are appropriate calls to makeQualified(). Dmitriy let me now go read your comments, I had not actually seen them yet!
        Hide
        Dmitriy Lyubimov added a comment -

        Here is how hbase is fighting this: https://issues.apache.org/jira/browse/HBASE-3455. Not quite the same as batch iteration but basically, the same thing i was talking about : cloning stuff into one "old" reference and reusing it is more "benign" than tons of short-lived references. Not clear if they got any significant performance boost, but they clearly experience dramatically less full GCs in 0.90.1. I think performance would be more affected more in CPU-bound batch though than in an hbase type serivce.

        Show
        Dmitriy Lyubimov added a comment - Here is how hbase is fighting this: https://issues.apache.org/jira/browse/HBASE-3455 . Not quite the same as batch iteration but basically, the same thing i was talking about : cloning stuff into one "old" reference and reusing it is more "benign" than tons of short-lived references. Not clear if they got any significant performance boost, but they clearly experience dramatically less full GCs in 0.90.1. I think performance would be more affected more in CPU-bound batch though than in an hbase type serivce.
        Hide
        Dmitriy Lyubimov added a comment -

        P.S. my stochastic projection jobs with modified VectorWritable which doesn't form actual vector at all but rather passes elements on stack memory also seem to run significantly faster just because of that. which is basically equivalent to case A vs. case B. I don't have exact benchmark comparison on hand since I never really wanted to compared it but it's quite apparent even with -Xmx200M map tasks. So perhaps if we could have actual real-life simulation benchmarks, we could actually see how "bad" or "good" it is. But my evidence so far has been pretty "bad", although i did not care collecting it for some time now so it related to less modern jvms.

        Show
        Dmitriy Lyubimov added a comment - P.S. my stochastic projection jobs with modified VectorWritable which doesn't form actual vector at all but rather passes elements on stack memory also seem to run significantly faster just because of that. which is basically equivalent to case A vs. case B. I don't have exact benchmark comparison on hand since I never really wanted to compared it but it's quite apparent even with -Xmx200M map tasks. So perhaps if we could have actual real-life simulation benchmarks, we could actually see how "bad" or "good" it is. But my evidence so far has been pretty "bad", although i did not care collecting it for some time now so it related to less modern jvms.
        Hide
        Dmitriy Lyubimov added a comment -


        I think that creating a new Writable each time is probably a good thing on the whole. I doubt seriously that it will be any slower as long as it avoids unnecessary copying of a large structure. If you avoid large copies then new allocation can actually be better than re-use since new allocation keeps the garbage reclamation work in the newspace collector which is considerably better than letting stuff get copied several times and possibly even tenured.

        so we are clear, we are talking

        A
        Writable w = ReflectionUtil.newIntance(...)
        for ( i = 0 ; i< 100000000; i++ ) { 
          ... do something with w ... 
        }
        

        vs.

        B
        for ( i = 0 ; i< 100000000; i++ ) { 
          Writable w = ReflectionUtil.newIntance(...)
          ... do something with w ... 
        }
        

        ... and you essentually saying that B is faster than A. I am not a GC expert so i reserve right to be wrong. but i am dubious about it because of my benchmarks and simulations i ran on java.

        1) It's one writable (a handful references at most) vs. young reference in every iteration. Sure, YGC and allocation are fast, but I am kind of dubious it is faster than not doing anything at all. In fact, we want it to tenure and even go to permanent pool so YGC 'forgets' about it and not check it any more, but even if it doesn't happen we don't care much since it's only one iteration for GC. Overhead, as far as i understand, in this case only happens during full gcs which would be rarer than YGC.

        2) what i witness usually is that dynamics changes quite a bit when you approach the memory limit. Full GCs are happenning in that situation more frequently (perhaps you can fight that by decreasing young GC space but i couldn't and it's still a hassle to tune). Full GCs are also quite long running. In fact, this problem is so bad that HBase folks told me they had cases when Full GC even with 12G caused pauses long enough to break 40 second zookeeper session to a region node (causing node die-offs). So they in fact recommend longer zk sessions for higher RAMs just because of GC!

        So either you don't approach the limit (i.e. not use all RAM you paid for) or preallocate stuff and let it tenure without adding much new in the mix.

        In practice what i found that allocation of new objects that don't leave YGC is indeed better in my benchmarks than yanking them from either pessimistic or even optimistic object pool (with optimistic pools, i.e. those running on Compare-And-Swap operations those using surprisingly being only marginally better) but things are starting to change quite dramatically as soon as you try to fill in say 5G out of 6 and factor in all tenured and full gc overhead.

        Actually in practice for real time applications the best practice i found is to run java processes at ~300M with the rest of the system dedicated to I/O cache which keeps my memory mapped structures (btrees and such). java heaps are only used to allocate a handful of long lived and reused object trees to walk the memory-mapped structures. That combination seems to be unbeatable so far. Kernel manages 'bulk' memory and you manage TWAs. And even if the rest of the processing is causing full GCs, they are unlikely to be catastrophic on that size of heap. but that's of course is not applicable to batches.

        In general, i would be very greatful if somebody could give me hint how to fight GC thrashing in near-full RAM situations, but in any event i think i doubt that it would be preferring code B over A.

        Show
        Dmitriy Lyubimov added a comment - I think that creating a new Writable each time is probably a good thing on the whole. I doubt seriously that it will be any slower as long as it avoids unnecessary copying of a large structure. If you avoid large copies then new allocation can actually be better than re-use since new allocation keeps the garbage reclamation work in the newspace collector which is considerably better than letting stuff get copied several times and possibly even tenured. so we are clear, we are talking A Writable w = ReflectionUtil.newIntance(...) for ( i = 0 ; i< 100000000; i++ ) { ... do something with w ... } vs. B for ( i = 0 ; i< 100000000; i++ ) { Writable w = ReflectionUtil.newIntance(...) ... do something with w ... } ... and you essentually saying that B is faster than A. I am not a GC expert so i reserve right to be wrong. but i am dubious about it because of my benchmarks and simulations i ran on java. 1) It's one writable (a handful references at most) vs. young reference in every iteration. Sure, YGC and allocation are fast, but I am kind of dubious it is faster than not doing anything at all. In fact, we want it to tenure and even go to permanent pool so YGC 'forgets' about it and not check it any more, but even if it doesn't happen we don't care much since it's only one iteration for GC. Overhead, as far as i understand, in this case only happens during full gcs which would be rarer than YGC. 2) what i witness usually is that dynamics changes quite a bit when you approach the memory limit. Full GCs are happenning in that situation more frequently (perhaps you can fight that by decreasing young GC space but i couldn't and it's still a hassle to tune). Full GCs are also quite long running. In fact, this problem is so bad that HBase folks told me they had cases when Full GC even with 12G caused pauses long enough to break 40 second zookeeper session to a region node (causing node die-offs). So they in fact recommend longer zk sessions for higher RAMs just because of GC! So either you don't approach the limit (i.e. not use all RAM you paid for) or preallocate stuff and let it tenure without adding much new in the mix. In practice what i found that allocation of new objects that don't leave YGC is indeed better in my benchmarks than yanking them from either pessimistic or even optimistic object pool (with optimistic pools, i.e. those running on Compare-And-Swap operations those using surprisingly being only marginally better) but things are starting to change quite dramatically as soon as you try to fill in say 5G out of 6 and factor in all tenured and full gc overhead. Actually in practice for real time applications the best practice i found is to run java processes at ~300M with the rest of the system dedicated to I/O cache which keeps my memory mapped structures (btrees and such). java heaps are only used to allocate a handful of long lived and reused object trees to walk the memory-mapped structures. That combination seems to be unbeatable so far. Kernel manages 'bulk' memory and you manage TWAs. And even if the rest of the processing is causing full GCs, they are unlikely to be catastrophic on that size of heap. but that's of course is not applicable to batches. In general, i would be very greatful if somebody could give me hint how to fight GC thrashing in near-full RAM situations, but in any event i think i doubt that it would be preferring code B over A.
        Hide
        Dmitriy Lyubimov added a comment - - edited

        seems good to me.
        I am not stuck on certain style issues, i just used eclipse's ctrl-shift-f autoformat. So it's fine it's just hard for me to tell 'good' from 'bad' sometimes. Such as i don't use compound block

        { .. }

        if there's only one statement in there. it's just something i do. I thought it was also Sun's convention but I don't have a strong opinion.

        I see that you don't support glob expressions for multiple files yet? Did you decide not to support multiple files or it's TBD or i am missing some extra helper functionality here?

        I see that you applied that to loading the (R_1, R_2...R_z). That seems technically correct to me the way it is done here. But irony is that couldn't use a glob expression parameterized cause order of side R files is relevant. So if you ever switch to multiple files by Hadoop glob, that piece would not be covered. Perhaps there's a space for a contract to supply both glob and comparator for files selecting so they are opened in a predefined order (like there in BtJob).

        perhaps a constructor version

         
        SequenceFileIterator ( String(or path) glob, Comparator<String or Path> openSequenceComparator)
        

        would solve that in one bite.

        There's a helper method in SSVDSolver that does the same (load matrices in memory). That probably has an extensive use – i think in tests mostly but it might be used something else. So that might be refactored to use this as well. but i can attend to it myself later.

        I am kind of inclined to frown on not reusing Writables. That's what hadoop does and there's a rather good reason for that. GC thrashing. Effects on massive datasets are felt especially if the job is CPU-bound (which is how the SSVD is). I actually see up to 20% improvements in running times when i equip VectorWritable with preprocessing ability so that it doesn't create a vector for every iteration – just because of GC thrashing. SSVD solver relies on that to load side R sequences – currently, all at once into RAM that's currently a bottleneck for RAM scalability. It can be thus somewhat sizeable, so that better be done as efficiently as possible. GC thrashing probably going to be especially bad in this case since we are trying to load a lot of stuff into memory which has the same (tenured) lifespan and GC will have especially bad overhead in Young gen space especially as we approach to RAM limits for the R data.

        Problem usually is that it is o.k. to approach RAM limits when you don't need to relinquish objects. But if you do a lot of (small dead) GC material floating around, full GCs including memory defrags will occur more often and take more time to a degree that you'd eventually be more busy doing GCs than anything else on 64bit systems with large heaps (>2G xmx). This may last for quite a long time before you are bound to converge on OOM. But if your RAM use prediction is tuned accurately, you'd never hit OOM but it'd be just painfully slow in java (i have a lot of experience with seeing that). The only remedy that i know is to preallocate near-limit RAM that you need in advance and in big chunks and not to create any (if possible, at all) 'dead small' GC stuff per iteration. (IMO). That said, that purely comes from observing the effects, I am not actually very strong on inner workings of GC.

        Show
        Dmitriy Lyubimov added a comment - - edited seems good to me. I am not stuck on certain style issues, i just used eclipse's ctrl-shift-f autoformat. So it's fine it's just hard for me to tell 'good' from 'bad' sometimes. Such as i don't use compound block { .. } if there's only one statement in there. it's just something i do. I thought it was also Sun's convention but I don't have a strong opinion. I see that you don't support glob expressions for multiple files yet? Did you decide not to support multiple files or it's TBD or i am missing some extra helper functionality here? I see that you applied that to loading the (R_1, R_2...R_z). That seems technically correct to me the way it is done here. But irony is that couldn't use a glob expression parameterized cause order of side R files is relevant. So if you ever switch to multiple files by Hadoop glob, that piece would not be covered. Perhaps there's a space for a contract to supply both glob and comparator for files selecting so they are opened in a predefined order (like there in BtJob). perhaps a constructor version SequenceFileIterator ( String (or path) glob, Comparator< String or Path> openSequenceComparator) would solve that in one bite. There's a helper method in SSVDSolver that does the same (load matrices in memory). That probably has an extensive use – i think in tests mostly but it might be used something else. So that might be refactored to use this as well. but i can attend to it myself later. I am kind of inclined to frown on not reusing Writables. That's what hadoop does and there's a rather good reason for that. GC thrashing. Effects on massive datasets are felt especially if the job is CPU-bound (which is how the SSVD is). I actually see up to 20% improvements in running times when i equip VectorWritable with preprocessing ability so that it doesn't create a vector for every iteration – just because of GC thrashing. SSVD solver relies on that to load side R sequences – currently, all at once into RAM that's currently a bottleneck for RAM scalability. It can be thus somewhat sizeable, so that better be done as efficiently as possible. GC thrashing probably going to be especially bad in this case since we are trying to load a lot of stuff into memory which has the same (tenured) lifespan and GC will have especially bad overhead in Young gen space especially as we approach to RAM limits for the R data. Problem usually is that it is o.k. to approach RAM limits when you don't need to relinquish objects. But if you do a lot of (small dead) GC material floating around, full GCs including memory defrags will occur more often and take more time to a degree that you'd eventually be more busy doing GCs than anything else on 64bit systems with large heaps (>2G xmx). This may last for quite a long time before you are bound to converge on OOM. But if your RAM use prediction is tuned accurately, you'd never hit OOM but it'd be just painfully slow in java (i have a lot of experience with seeing that). The only remedy that i know is to preallocate near-limit RAM that you need in advance and in big chunks and not to create any (if possible, at all) 'dead small' GC stuff per iteration. (IMO). That said, that purely comes from observing the effects, I am not actually very strong on inner workings of GC.
        Hide
        Ted Dunning added a comment -

        The one possible downside is that this implementation creates a new Writable for each read. This is mildly positive in that it avoids some common bugs in reading from sequence files wherein the caller doesn't realize it's storing a copy to a Writable that's changing. (The Mahout code is cloning/instantiating new ones in most cases anyways as it has to) It does mean more objects allocated though. While I think the overhead of that is minor, probably, compared to the I/O of the read itself, it's not obviously trivial.

        I think that creating a new Writable each time is probably a good thing on the whole. I doubt seriously that it will be any slower as long as it avoids unnecessary copying of a large structure. If you avoid large copies then new allocation can actually be better than re-use since new allocation keeps the garbage reclamation work in the newspace collector which is considerably better than letting stuff get copied several times and possibly even tenured.

        Show
        Ted Dunning added a comment - The one possible downside is that this implementation creates a new Writable for each read. This is mildly positive in that it avoids some common bugs in reading from sequence files wherein the caller doesn't realize it's storing a copy to a Writable that's changing. (The Mahout code is cloning/instantiating new ones in most cases anyways as it has to) It does mean more objects allocated though. While I think the overhead of that is minor, probably, compared to the I/O of the read itself, it's not obviously trivial. I think that creating a new Writable each time is probably a good thing on the whole. I doubt seriously that it will be any slower as long as it avoids unnecessary copying of a large structure. If you avoid large copies then new allocation can actually be better than re-use since new allocation keeps the garbage reclamation work in the newspace collector which is considerably better than letting stuff get copied several times and possibly even tenured.
        Hide
        Sean Owen added a comment -

        I'm through core now, but no tests or other modules. It's already an epic patch, but I think it's quite good. It actually deletes 100 lines more of code than it adds, and reduces some duplication, and even does a bit better job at cleanup.

        I'm still working on this patch.

        The one possible downside is that this implementation creates a new Writable for each read. This is mildly positive in that it avoids some common bugs in reading from sequence files wherein the caller doesn't realize it's storing a copy to a Writable that's changing. (The Mahout code is cloning/instantiating new ones in most cases anyways as it has to) It does mean more objects allocated though. While I think the overhead of that is minor, probably, compared to the I/O of the read itself, it's not obviously trivial.

        I think this can be designed around in the wrapper – for instance often the key is not used so that can be embedded in the logic of the iterator to avoid allocations.

        The only gotchas to watch out for in this patch are, first, that change / fix? to VectorCache. And second Dmitriy when I ran into some of your latest commits I left several uses of SequenceFile.Reader as-is since it needs to be done that way, but ended up accumulating a load of small style changes. That bit of the patch is noisy. (There was also a section where I think I elided use of a "closeables" object in the case where it seemed to be a no-op?)

        Thoughts?

        Show
        Sean Owen added a comment - I'm through core now, but no tests or other modules. It's already an epic patch, but I think it's quite good. It actually deletes 100 lines more of code than it adds, and reduces some duplication, and even does a bit better job at cleanup. I'm still working on this patch. The one possible downside is that this implementation creates a new Writable for each read. This is mildly positive in that it avoids some common bugs in reading from sequence files wherein the caller doesn't realize it's storing a copy to a Writable that's changing. (The Mahout code is cloning/instantiating new ones in most cases anyways as it has to) It does mean more objects allocated though. While I think the overhead of that is minor, probably, compared to the I/O of the read itself, it's not obviously trivial. I think this can be designed around in the wrapper – for instance often the key is not used so that can be embedded in the logic of the iterator to avoid allocations. The only gotchas to watch out for in this patch are, first, that change / fix? to VectorCache. And second Dmitriy when I ran into some of your latest commits I left several uses of SequenceFile.Reader as-is since it needs to be done that way, but ended up accumulating a load of small style changes. That bit of the patch is noisy. (There was also a section where I think I elided use of a "closeables" object in the case where it seemed to be a no-op?) Thoughts?
        Hide
        Sean Owen added a comment -

        Both good points!

        Show
        Sean Owen added a comment - Both good points!
        Hide
        Dmitriy Lyubimov added a comment -

        It would be great if we also had an iterator that runs over all the parts* files in a directory, that would allow us to simplify a lot of code.

        Even better: Hadoop globs...

        Show
        Dmitriy Lyubimov added a comment - It would be great if we also had an iterator that runs over all the parts* files in a directory, that would allow us to simplify a lot of code. Even better: Hadoop globs...
        Hide
        Dmitriy Lyubimov added a comment - - edited

        if you want to be true to the Hadoop contract, you need to refactor the following to use hadoop's ReflectionUtils and pass in the configuration. There are tons of writables around that are also Configurable. Including one of my Mahout's branches that equips VectorWritable with additional capabilties and controls them by making it Configurable.

        private void instantiateKeyValue() throws IOException {
        +    try {
        +      key = keyClass.newInstance();
        +      if (noValue) {
        +        value = null;
        +      } else {
        +        value = valueClass.newInstance();
        +      }
        +    } catch (InstantiationException ie) {
        +      throw new IOException(ie);
        +    } catch (IllegalAccessException iae) {
        +      throw new IOException(iae);
        +    }
        +  }
        
        Show
        Dmitriy Lyubimov added a comment - - edited if you want to be true to the Hadoop contract, you need to refactor the following to use hadoop's ReflectionUtils and pass in the configuration. There are tons of writables around that are also Configurable. Including one of my Mahout's branches that equips VectorWritable with additional capabilties and controls them by making it Configurable. private void instantiateKeyValue() throws IOException { + try { + key = keyClass.newInstance(); + if (noValue) { + value = null ; + } else { + value = valueClass.newInstance(); + } + } catch (InstantiationException ie) { + throw new IOException(ie); + } catch (IllegalAccessException iae) { + throw new IOException(iae); + } + }
        Hide
        Sebastian Schelter added a comment -

        looks very useful to me!

        It would be great if we also had an iterator that runs over all the parts* files in a directory, that would allow us to simplify a lot of code.

        Show
        Sebastian Schelter added a comment - looks very useful to me! It would be great if we also had an iterator that runs over all the parts* files in a directory, that would allow us to simplify a lot of code.

          People

          • Assignee:
            Sean Owen
            Reporter:
            Sean Owen
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Due:
              Created:
              Updated:
              Resolved:

              Development