Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      The LinkedList used by CachingTokenFilter is accessed using the get() method. Direct access on a LinkedList is slow and an Iterator should be used instead. For more than a handful of tokens, the difference in speed grows exponentially.

      1. CachingTokenFilter.patch
        0.7 kB
        Mark Miller
      2. CachingTokenFilterRev2.patch
        1 kB
        Mark Miller

        Issue Links

          Activity

          Hide
          Mark Miller added a comment -

          Actually, I think the starting size should probably be upped as well. I will do some tests.

          • Mark
          Show
          Mark Miller added a comment - Actually, I think the starting size should probably be upped as well. I will do some tests. Mark
          Hide
          Doron Cohen added a comment -

          Good catch, patch look good to me.
          It makes sense to modify anyhow, but just wondering, - what is the performance difference?

          While we're looking at this, I noticed three other uses of LinkedList that can be changed to ArrayList:

          • DocumentWriter
          • CompoundFileWriter
          • MultipleTermPositions
          Show
          Doron Cohen added a comment - Good catch, patch look good to me. It makes sense to modify anyhow, but just wondering, - what is the performance difference? While we're looking at this, I noticed three other uses of LinkedList that can be changed to ArrayList: DocumentWriter CompoundFileWriter MultipleTermPositions
          Hide
          Michael Busch added a comment -

          > While we're looking at this, I noticed three other uses of LinkedList
          > that can be changed to ArrayList:

          Well, an ArrayList is not always faster than a LinkedList, is it?
          A LinkedList should be faster for insertions and removals in the middle
          of the list compared to an ArrayList. So I think we have to understand
          how the lists are used in the different spots before we replace them.

          To this patch: I actually chose LinkedList here intentionally because
          random-access to the list is not needed here. We only iterate over the
          list and I expected the performance to be comparable to the ArrayList.
          But inconsequently I used the get() method instead of a list iterator,
          which is for sure slower compared to the ArrayList. I wonder how the
          LinkedList would perform if we changed this class to use an iterator?

          Thanks Mark your efforts here!

          Show
          Michael Busch added a comment - > While we're looking at this, I noticed three other uses of LinkedList > that can be changed to ArrayList: Well, an ArrayList is not always faster than a LinkedList, is it? A LinkedList should be faster for insertions and removals in the middle of the list compared to an ArrayList. So I think we have to understand how the lists are used in the different spots before we replace them. To this patch: I actually chose LinkedList here intentionally because random-access to the list is not needed here. We only iterate over the list and I expected the performance to be comparable to the ArrayList. But inconsequently I used the get() method instead of a list iterator, which is for sure slower compared to the ArrayList. I wonder how the LinkedList would perform if we changed this class to use an iterator? Thanks Mark your efforts here!
          Hide
          Mark Miller added a comment -

          My tests early must have gotten out of whack. I was measuring a much bigger difference than I see now.

          As a result, I started from scratch, carefully creating and lableing a new Lucene core jar for each case and averaging the performance over 15,000 calls creating and reading TokenStreams off the Reuters data.

          After very thorough testing (I was in quite a hurry this morning), I have come up with the following:

          LinkedList() using get, LinkedList() using iterator, and ArrayList() are practically identical in speed.

          ArrayList(30) gave a 47% increase in speed. Above 30-60 gave no more returns.

          This patch should not go through as is. What do you think given these results? I assumed that an ArrayList would be faster as all of the data is guaranteed contiguous, but it surprised me that the resizing was not enough to slow things down to LinkedList speed (unless you start with too low an initial size – default is 10).

          • Mark
          Show
          Mark Miller added a comment - My tests early must have gotten out of whack. I was measuring a much bigger difference than I see now. As a result, I started from scratch, carefully creating and lableing a new Lucene core jar for each case and averaging the performance over 15,000 calls creating and reading TokenStreams off the Reuters data. After very thorough testing (I was in quite a hurry this morning), I have come up with the following: LinkedList() using get, LinkedList() using iterator, and ArrayList() are practically identical in speed. ArrayList(30) gave a 47% increase in speed. Above 30-60 gave no more returns. This patch should not go through as is. What do you think given these results? I assumed that an ArrayList would be faster as all of the data is guaranteed contiguous, but it surprised me that the resizing was not enough to slow things down to LinkedList speed (unless you start with too low an initial size – default is 10). Mark
          Hide
          Doron Cohen added a comment -

          > Mark: I assumed that an AL would be faster as all of the data is guaranteed contiguous

          Only the pointers to the objects are contiguous, right? The tokens themselves are, well, where they are. But with LinkedList there are new objects created, containing the tokens and the pointers to the other list members. So it may be safe to say that if you can estimate the list size (avoiding array grow), AL is preferable if there's no add/remove not at the end.

          > Michael: (~) LL iterator comparable to AL

          That's a good point. I had the impression that AL is always simpler than LL and unless removing or adding not at the end, it is preferable. (that's why I excluded the NgramTokenFiltrers that use LL.removeFirst()). Now you're saying that with iteration (instead of direct access) LinkedList is supposed to be faster - could be, since then there's no need to grow the array. (however you have more "pointers").

          With this reasoning -

          • CompoundFileWriter - using iterator, no direct access.
          • MultipleTermPositions - same.
          • DocumentWRiter - same.
            So I am not so sure anymore about needing to change in these classes.

          ---------------

          In summary since we can't assume estimating the size in advance, I think the best change would be as Michael suggested to use Iterator in CachingTokenFilter.

          Show
          Doron Cohen added a comment - > Mark: I assumed that an AL would be faster as all of the data is guaranteed contiguous Only the pointers to the objects are contiguous, right? The tokens themselves are, well, where they are. But with LinkedList there are new objects created, containing the tokens and the pointers to the other list members. So it may be safe to say that if you can estimate the list size (avoiding array grow), AL is preferable if there's no add/remove not at the end. > Michael: (~) LL iterator comparable to AL That's a good point. I had the impression that AL is always simpler than LL and unless removing or adding not at the end, it is preferable. (that's why I excluded the NgramTokenFiltrers that use LL.removeFirst()). Now you're saying that with iteration (instead of direct access) LinkedList is supposed to be faster - could be, since then there's no need to grow the array. (however you have more "pointers"). With this reasoning - CompoundFileWriter - using iterator, no direct access. MultipleTermPositions - same. DocumentWRiter - same. So I am not so sure anymore about needing to change in these classes. --------------- In summary since we can't assume estimating the size in advance, I think the best change would be as Michael suggested to use Iterator in CachingTokenFilter.
          Hide
          Michael Busch added a comment -

          > This patch should not go through as is. What do you think given these
          > results? I assumed that an ArrayList would be faster as all of the data
          > is guaranteed contiguous, but it surprised me that the resizing was not
          > enough to slow things down to LinkedList speed (unless you start with
          > too low an initial size – default is 10).

          I think an ArrayList also has higher initialization costs. Your test
          actually tests the performance for a single document. It would be
          interesting to know how the different implementations perform when you
          run the tests with more than one document. I would think that LinkedList()
          is probably better if you have lots of very small documents, whereas
          ArrayList(30) is faster if you have bigger docs with lots of Tokens.

          Show
          Michael Busch added a comment - > This patch should not go through as is. What do you think given these > results? I assumed that an ArrayList would be faster as all of the data > is guaranteed contiguous, but it surprised me that the resizing was not > enough to slow things down to LinkedList speed (unless you start with > too low an initial size – default is 10). I think an ArrayList also has higher initialization costs. Your test actually tests the performance for a single document. It would be interesting to know how the different implementations perform when you run the tests with more than one document. I would think that LinkedList() is probably better if you have lots of very small documents, whereas ArrayList(30) is faster if you have bigger docs with lots of Tokens.
          Hide
          Mark Miller added a comment -

          The 15,000 calls are each on a separate document. The documents are reletivley small...newspapers articles from Reuters. Anything smaller would have to be very small.

          I have again carefully tested LinkedList get VS LinkedList iterator and the performance is identical as far as I can tell.

          I'll do more work to prove my case when I get a free moment, but just to be clear:

          I am using small documents (15,000 different varying sized docs), measuring the total time and dividing by 15,000. The results show a 43% improvement using ArrayList(30). I will run a test will even smaller docs when I get a chance. In my work with a new Span based Highlighter, I need this speed or my implementation is slower than the old Highlighter. With this boost, my Span based Highlighter is actually (very)slightly faster. If you decide to keep things as they are I will have to roll an alternate CachingTokenFilter for my Highlighter (no problem of course <g>).

          Perhaps it is best to just leave things as they are and if you need more performance on docs with more than a handful of tokens, make your own Caching Filter. If the common case is closer to docs the size of newspaper articles or larger, a 43% gain is hard to ignore.

          I will get back about the speed when using very short documents.

          >>Only the pointers to the objects are contiguous, right?
          One of these days I will actually make that transition from C++ to Java <g> I don't know where the speed is coming from then...but its a heck of a difference.

          • Mark
          Show
          Mark Miller added a comment - The 15,000 calls are each on a separate document. The documents are reletivley small...newspapers articles from Reuters. Anything smaller would have to be very small. I have again carefully tested LinkedList get VS LinkedList iterator and the performance is identical as far as I can tell. I'll do more work to prove my case when I get a free moment, but just to be clear: I am using small documents (15,000 different varying sized docs), measuring the total time and dividing by 15,000. The results show a 43% improvement using ArrayList(30). I will run a test will even smaller docs when I get a chance. In my work with a new Span based Highlighter, I need this speed or my implementation is slower than the old Highlighter. With this boost, my Span based Highlighter is actually (very)slightly faster. If you decide to keep things as they are I will have to roll an alternate CachingTokenFilter for my Highlighter (no problem of course <g>). Perhaps it is best to just leave things as they are and if you need more performance on docs with more than a handful of tokens, make your own Caching Filter. If the common case is closer to docs the size of newspaper articles or larger, a 43% gain is hard to ignore. I will get back about the speed when using very short documents. >>Only the pointers to the objects are contiguous, right? One of these days I will actually make that transition from C++ to Java <g> I don't know where the speed is coming from then...but its a heck of a difference. Mark
          Hide
          Michael Busch added a comment -

          > If you decide to keep things as they are I will have to roll an
          > alternate CachingTokenFilter for my Highlighter (no problem of
          > course <g>).

          I'm certainly up to make a change here if it improves performance
          significantly (and yes, 43% is significant). I just want to be
          sure that we don't optimize a special case. So yes, it would be
          great if you could vary the parameters in your tests.

          Btw. which JVM version are you using?

          Show
          Michael Busch added a comment - > If you decide to keep things as they are I will have to roll an > alternate CachingTokenFilter for my Highlighter (no problem of > course <g>). I'm certainly up to make a change here if it improves performance significantly (and yes, 43% is significant). I just want to be sure that we don't optimize a special case. So yes, it would be great if you could vary the parameters in your tests. Btw. which JVM version are you using?
          Hide
          Mark Miller added a comment -

          > So it may be safe to say that if you can estimate the list size (avoiding array grow), AL is preferable if there's no add/remove not at the end.

          In the CachingTokenFilter case I don't even believe it is really necessary to estimate the list size. Many of the documents I used had way more than 30 tokens, but initializing the Array larger gave no benefits. I believe this is because the ArrayList doubles each time it grows (not guaranteed, but how it is implemented), and so a small increase in size can dramatically lower the number of resizes needed even when the List must grow much bigger than the init size. 10 just doesn't cut it, but 30 works great. A LinkedList (iterator or get()) seems to perform no better than an ArrayList(10).

          • Mark
          Show
          Mark Miller added a comment - > So it may be safe to say that if you can estimate the list size (avoiding array grow), AL is preferable if there's no add/remove not at the end. In the CachingTokenFilter case I don't even believe it is really necessary to estimate the list size. Many of the documents I used had way more than 30 tokens, but initializing the Array larger gave no benefits. I believe this is because the ArrayList doubles each time it grows (not guaranteed, but how it is implemented), and so a small increase in size can dramatically lower the number of resizes needed even when the List must grow much bigger than the init size. 10 just doesn't cut it, but 30 works great. A LinkedList (iterator or get()) seems to perform no better than an ArrayList(10). Mark
          Hide
          Mark Miller added a comment -

          I am testing on Java 1.5

          I tested with LinkedList using get, LinkedList using iterator, ArrayList() (defaults to 10), ArrayList(16), ArrayList(30), ArrayList(60)

          For a handful of tokens, all methods are about the same speed. (3-10 tokens) LinkedList, ArrayList(16) and ArrayList are a smidgen faster than ArrayList(30-60) though.

          At 15-40 tokens all of the methods are still about the same speed, though ArrayList(30) may be a hair faster (than even ArrayList(10))

          At 50-300 tokens (weighted towards 50), you see a 47% increase in speed using ArrayList(16 or 30) and ArrayList(60), the other methods are about the same speed.

          At 150-900 tokens (weighted towards 150), you see a 100% increase in speed using ArrayList(30) – ArrayList(60) is no better,
          the other methods take twice as long (even ArrayList(10)).

          I used Reuters data, shortening it and stiching it together for the various sizes.

          The first test was the average speed of about 21,000 runs on 21,000 different docs.

          The second test was the average speed of about 21,000 runs on 21,000 different docs.

          The third test was the average speed of about 15,000 runs on 15,000 different docs.

          The fourth test was the average speed of about 5,000 runs on 5,000 different docs.

          I don't completely understand why, but ArrayList(30) or ArrayList(16) blow everything else out of the water. ArrayList(16) is a smidgen faster at very low
          token counts, while ArrayList(30) is a smidgen faster above 50 or so tokens. The differences are almost negligible though.

          ArrayList(30) or ArrayList(16) are approx the same speed as LinkedList (get and iterator are always approx the same speed)
          at low numbers and get better and better VERY quickly as the number of tokens goes up.

          I'd recommend ArrayList(16) as the best choice for this class. It is no worse than LinkedList on very small documents, and much better than LinkedList on small to large documents.

          A friend was recently telling me that ArrayList defaulted to 16, but it does not – it defaults to 10. He must have been confused and known that it should default to 16. 16 is a much better default number than 10 due to the exponential growth when doubling the array on resize. It may take a bit more memory, but it gets a LOT more speed.

          • Mark
          Show
          Mark Miller added a comment - I am testing on Java 1.5 I tested with LinkedList using get, LinkedList using iterator, ArrayList() (defaults to 10), ArrayList(16), ArrayList(30), ArrayList(60) For a handful of tokens, all methods are about the same speed. (3-10 tokens) LinkedList, ArrayList(16) and ArrayList are a smidgen faster than ArrayList(30-60) though. At 15-40 tokens all of the methods are still about the same speed, though ArrayList(30) may be a hair faster (than even ArrayList(10)) At 50-300 tokens (weighted towards 50), you see a 47% increase in speed using ArrayList(16 or 30) and ArrayList(60), the other methods are about the same speed. At 150-900 tokens (weighted towards 150), you see a 100% increase in speed using ArrayList(30) – ArrayList(60) is no better, the other methods take twice as long (even ArrayList(10)). I used Reuters data, shortening it and stiching it together for the various sizes. The first test was the average speed of about 21,000 runs on 21,000 different docs. The second test was the average speed of about 21,000 runs on 21,000 different docs. The third test was the average speed of about 15,000 runs on 15,000 different docs. The fourth test was the average speed of about 5,000 runs on 5,000 different docs. I don't completely understand why, but ArrayList(30) or ArrayList(16) blow everything else out of the water. ArrayList(16) is a smidgen faster at very low token counts, while ArrayList(30) is a smidgen faster above 50 or so tokens. The differences are almost negligible though. ArrayList(30) or ArrayList(16) are approx the same speed as LinkedList (get and iterator are always approx the same speed) at low numbers and get better and better VERY quickly as the number of tokens goes up. I'd recommend ArrayList(16) as the best choice for this class. It is no worse than LinkedList on very small documents, and much better than LinkedList on small to large documents. A friend was recently telling me that ArrayList defaulted to 16, but it does not – it defaults to 10. He must have been confused and known that it should default to 16. 16 is a much better default number than 10 due to the exponential growth when doubling the array on resize. It may take a bit more memory, but it gets a LOT more speed. Mark
          Hide
          Mark Miller added a comment -

          Well this is embarrassing. I messed up the implementation of the iterator approach. Egg on my face moment. At least now things make more sense. LinkedList using an iterator is as fast as ArrayList(16). It does not appear to be any faster, but it is more memory efficient. Michael was absolutely right and the mistake is the direct access over the iterator.

          Sorry about the major misdirection on this...youthful exuberance always gets the best of me...<g>

          I suggest the change is just to change from get() to Iterator usage.

          Show
          Mark Miller added a comment - Well this is embarrassing. I messed up the implementation of the iterator approach. Egg on my face moment. At least now things make more sense. LinkedList using an iterator is as fast as ArrayList(16). It does not appear to be any faster, but it is more memory efficient. Michael was absolutely right and the mistake is the direct access over the iterator. Sorry about the major misdirection on this...youthful exuberance always gets the best of me...<g> I suggest the change is just to change from get() to Iterator usage.
          Hide
          Mark Miller added a comment -

          I have the reset method check if the cache is null before creating a new Iterator because it would not throw an exception before if you called reset before calling next. I don't know how you feel about this.

          Show
          Mark Miller added a comment - I have the reset method check if the cache is null before creating a new Iterator because it would not throw an exception before if you called reset before calling next. I don't know how you feel about this.
          Hide
          Michael Busch added a comment -

          Mark,

          the new patch looks good to me! I'm going to commit this soon.

          And you certainly don't have to apologize - thanks to your efforts
          I feel that I understand these data structures better now...

          Thank you very much for investigating this so thoroughly!

          Show
          Michael Busch added a comment - Mark, the new patch looks good to me! I'm going to commit this soon. And you certainly don't have to apologize - thanks to your efforts I feel that I understand these data structures better now... Thank you very much for investigating this so thoroughly!
          Hide
          Michael Busch added a comment -

          Committed. Thanks, Mark.

          Show
          Michael Busch added a comment - Committed. Thanks, Mark.
          Hide
          David Smiley added a comment -

          In LUCENE-6033 I swapped LL for ArrayList purely for reduced memory usage (fewer objects).

          Show
          David Smiley added a comment - In LUCENE-6033 I swapped LL for ArrayList purely for reduced memory usage (fewer objects).

            People

            • Assignee:
              Unassigned
              Reporter:
              Mark Miller
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development