Details
-
Bug
-
Status: Patch Available
-
Major
-
Resolution: Fixed
-
1.4
-
None
-
None
-
Operating System: Linux
Platform: Other
-
32847
Description
I'm re-opening a bug I logged previously. My previous bug report has
disappeared.
Issue: IndexWriter.addIndexes results in java.lang.OutOfMemoryError for large
merges.
Until this writing, I've been merging successfully only through repetition,
i.e. I keep repeating merges until a success. As my index size has grown, my
success rate has steadily declined. I've reached the point where merges now
fail 100% of the time. I can't merge.
My tests indicate the threshold is ~30GB on P4/800MB VM with 6 indexes. I have
repeated my tests on many different machines (not machine dependent). I have
repeated my test using local and attached storage devices (not storage
dependent).
For what its worth, I believe the exception occurs entirely during the optimize
process which is called implicitly after the merge. I say this because each
time it appears the correct amount of bytes are written to the new index. Is it
possible to decouple the merge and optimize processes?
The code snippet follows. I can send you the class file and 120GB data set. Let
me know how you want it.
>>>>> code sample >>>>>
Directory[] sources = new Directory[paths.length];
...
Directory dest = FSDirectory.getDirectory( path, true);
IndexWriter writer = new IndexWriter( dest, new TermAnalyzer(
StopWords.SEARCH_MAP), true);
writer.addIndexes( sources);
writer.close();
Attachments
Attachments
- ASF.LICENSE.NOT.GRANTED--merger.patch
- 0.9 kB
- cutting@apache.org
Activity
Suggestion #1
>>>I fixed a bug that left a SegmentReader open in addIndexes(IndexReader[]
readers)
>>>and it also left obsolete index files undeleted.
>>>But this could have hardly caused your memory problems.
I'm running the current version of Lucene.
Suggestion #2
>>>The call to writer.optimize() isn't necessary
The call was removed. Please read my comment about the failure
happening after the new index is written
Suggestion #3
>>>please try with StandardAnaylzer to make sure the problem isn't in your
Termanalyzer
With the StandardAnalyzer, the results are the same.
Question: Is it acceptable to merge using a different analyzer than was
used to index?
Suggestion #4
>>>please provide a test case. If you can provide a test case, please attach it
here.
OK. It's just these five lines of code plus alot of data.
I've offered before to send the data before, but its won't post as an
email attachment.
>>>>> code
Directory[] sources = new Directory[paths.length];
...
Directory dest = FSDirectory.getDirectory( path, true);
IndexWriter writer = new IndexWriter( dest, new StandardAnalyzer(),
true);
writer.addIndexes( sources);
writer.close();
You need to close the directory "dest" explicitly, not only the writer. Does
that make a difference?
I just executed the following test case in a 700M VM and had the same outcome.
See details below.
In this test case, the sum of the input directories was ~42G and the amount
written to
'dest' directory is ~41G. This pattern is reliably repeatable. That is, it
looks like the merge works,
but the failure happens at the very end.
If I open 'dest' directory and call docCount I get 0.
>>>>>>>>> revised code >>>>>>>>>>>
...
Directory dest = FSDirectory.getDirectory( destination, true);
IndexWriter writer = new IndexWriter( dest, new StandardAnalyzer(),
true);
writer.addIndexes( sources);
log("here"); //never prints to screen
writer.close();
dest.close();
>>>>>>>>> inputs >>>>>>>>>>>
6.3G ./index0/index
5.9G ./index1/index
6.0G ./index2/index
5.2G ./index3/index
5.4G ./index4/index
3.8G ./index5/index
4.0G ./index6/index
5.3G ./index7/index
>>>>>>>>> output on screen >>>>>>>>>>>
>>> merging: ./index0/index
>>> merging: ./index1/index
>>> merging: ./index2/index
>>> merging: ./index3/index
>>> merging: ./index4/index
>>> merging: ./index5/index
>>> merging: ./index6/index
>>> merging: ./index7/index
Exception in thread "main" java.lang.OutOfMemoryError
>>>>>>>>> destination directory stats >>>>>>>>>>>
41G ./merged.0000/index
There are two addIndexes() methods, one for IndexReaders and one for
directories, does the problem occur with both? Does it also appear with
smaller indexes and a smaller JVM (I won't be able to reproduce problems if it
requires a 40 GB index)? Are you using Lucene 1.4.3? You will probably need to
change the Lucene code and add debug statements to see where the exception
occurs. Also, are all indexes in the same format, i.e. compound or
non-compound?
Results of my next test:
Same machine, same indexes, same size VM. If I reduce the sum of the inputs to
< 30G the merge succeeds. If I add an additional index, bringing the total >
30G the merge fails.
I will continue with remaining tests tomorrow.
>>>>>>>>> inputs >>>>>>>>>>>
5.8G ./index3
5.9G ./index4
4.3G ./index5
4.4G ./index6
5.8G ./index7
>>>>>>>>> destination directory stats >>>>>>>>>>>
>>> merging: /home/agense/raw/questions/index3/index
>>> merging: /home/agense/raw/questions/index4/index
>>> merging: /home/agense/raw/questions/index5/index
>>> merging: /home/agense/raw/questions/index6/index
>>> merging: /home/agense/raw/questions/index7/index
>>>>>>>>> destination directory stats >>>>>>>>>>>
26G ./merged.0000
Results for the remainder of my testing:
All my indexes use the compound file format.
Selecting random combinations of the test indexes, I am able to successfully
merge using both method signatures, as long as the sum of the inputs is less
than 30G.
For all attempts to merge all the indexes, using either method signature, I get
the out-of-memory condition.
What is the next step?
Have you tried running your app under an profiler? I suggest you try that and
see where the memory is being allocated. If there is a bug in Lucene, this may
help us narrow down the area we need to look at.
>>>Have you tried running your app under an profiler? ...
I started to spend time looking at various profilers, but concluded that all I
do is make a call to addIndexes inside a jar file. From there, it's all Lucene.
To profile, I'd have to find and build a profiler; rebuild Lucene with
profiling tags; understand the Lucene call stack and logic; make sense of the
output; etc. This is, in effect, debugging Lucene. My understanding is this is
the purview of the Lucene team.
>>>If there is a bug in Lucene...
What is it about my test results that would lead you to conclude something
other than a Lucene bug at this point? If there is more evidence I can provide,
please let me know what tests I can run. Can you replicate my results?
Thanks.
How many fields and how many documents are in your index?
Can you provide a stack trace from the OutOfMemoryException? This would be very
useful.
>>>How many fields and how many documents are in your index?
7 million documents with 100 fields each
>>>Can you provide a stack trace from the OutOfMemoryException?
The code I've been testing is inside a try/catch block with print stack trace.
There is no trace. My experience has been that the stack doesn't print when out-
of-memory exception is thrown (JVM 1.4.2)
Are the 100 fields all indexed, or are some only stored? If indexed, 100 is a
very large number of indexed fields. 7M documents with 100 indexed fields could
require 700MB when searching, since one byte per searched field per document of
RAM is used to cache the norms for each field. But that RAM is not required
when indexing. Merging an index, where you're having troubles, should not
require much RAM.
Since your problem only requires a 5-line program to demonstrate, and it only
requires the Lucene jar file, please create such a 5-line java program as a
single file that depends only on the Lucene jar and demonstrate the problem with
something like:
javac -classpath lucene.jar Test.java
java -classpath lucene.jar Test index1 index2 index3 ...
Test.java should look something like:
import org.apache.lucene.index.IndexWriter;
public class Test {
public static void main(String[] paths) throws Exception {
Directory[] sources = new Directory[paths.length];
for (int i = 0; i < paths.length)
IndexWriter writer =
new IndexWriter("dest", new StandardAnalyzer(), true);
writer.addIndexes(sources);
writer.close();
}
}
Once you have replicated the bug with such a program, please attach the program
to this bug report. This way we can be certain that there is nothing involved
but Lucene.
Note that this code does not explicitly try to catch exceptions but rather lets
the JVM print a final stack trace if it exits in an exception. That may work
better.
If the problem still appears and we still don't get a stack trace then we can
try putting in log statements in SegmentMerger.java.
Thanks for your patience.
>>>>>>>>>>>> Code >>>>>>>>>>>>
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
public class MergeTest
{
public static void main(String[] paths)
throws Exception
}
>>>>>>>>>>>> Inputs >>>>>>>>>>>>
7.2G index0
7.0G index1
5.8G index3
5.8G index4
4.2G index5
5.3G index6
5.8G index7
6.0G index8
4.8G index9
52G .
>>>>>>>>>>>> Test Results >>>>>>>>>>>>
#1 java -Xms700M -Xmx700M MergeTest index0 index1
result -> 13G success
#2 java -Xms700M -Xmx700M MergeTest index0 index1 index3
result -> 18G success
#3 java -Xms700M -Xmx700M MergeTest index0 index1 index3 index4
result -> 24G success
#4 java -Xms700M -Xmx700M MergeTest index0 index1 index4 index5
result -> 22G success
#5 java -Xms700M -Xmx700M MergeTest index0 index1 index3 index4 index5
result -> 27G Exception in thread "main" java.lang.OutOfMemoryError (no
stack trace printed)
#6 java -Xms700M -Xmx700M MergeTest index0 index1 index3 index4 index5
result -> 27G Exception in thread "main" java.lang.OutOfMemoryError (no
stack trace printed)
>>>If indexed, 100 is a very large number of indexed fields.
Doug, how are multi-value fields treated in calculating total fields? If I add
a field called "link" 18 times, is this considered 18 or 1?
Thanks.
Thanks.
Can you please attach the output of 'ls -lt /home/dan/merged/' after it fails?
That may indicate where it is dying.
What does 'ulimit -c' print? If you're not getting a stack trace perhaps we can
get a core dump. One can get stack traces from java core dumps.
Also, why do you specify a minimum heap size with -Xms700M, rather than just let
the heap grow to its maximum? I have had troubles before specifying -Xms and
have never found it advantageous. Can you also please try once without that option?
Thanks again,
Doug
The number of fields that I'm referring to is the number of unique field names
that are ever added as indexed. So adding a field name multiple times to a
single document will not change things.
Created an attachment (id=14156)
patch to reduce memory requirements of segment merger
Okay. I see the problem. You have over 159 indexed fields in over 5M
documents, which, if norms are cached, requires over 700MB.
I've attached a patch which fixes segment merging to not use cached access to
the norms. Please try this and tell me how it works.
You will still have trouble searching this index in a 700MB JVM if you search
all of the fields.
Doug
I will try the patch and report back.
>>>You will still have trouble searching this index in a 700MB...
Yes. I'm testing a redesigned index now.
>>>since one byte per searched field per document of RAM is used to cache the
norms for each field.
How does one programmatically flush the cache? I've been looking for such a
method.
Created an attachment (id=14172)
SegmentMerger patch test results
The SegmentMerger patch is a success. Thanks for looking into this.
Your old bug didn't just disappear, it has been closed because you didn't
reply to our suggestions. See the history here:
http://issues.apache.org/bugzilla/show_bug.cgi?id=30421
If you can provide a test case, please attach it here.