[LUCENE-10518] FieldInfos consistency check can refuse to open Lucene 8 index - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Patch Available
Priority: Major
Resolution: Fixed
Affects Version/s: 8.10.1
Fix Version/s: 9.2, 9.1.1
Component/s: core/index
Labels:
None

Lucene Fields:

New

Description

A field-infos consistency check introduced in Lucene 9 (~~LUCENE-9334~~) can refuse to open a Lucene 8 index. Lucene 8 can create a partial FieldInfo if hitting a non-aborting exception (for example term is too long) during processing fields of a document. We don't have this problem in Lucene 9 as we process fields in two phases with the first phase processing only FieldInfos.

The issue can be reproduced with this snippet.

public void testWriteIndexOn8x() throws Exception {
  FieldType KeywordField = new FieldType();
  KeywordField.setTokenized(false);
  KeywordField.setOmitNorms(true);
  KeywordField.setIndexOptions(IndexOptions.DOCS);
  KeywordField.freeze();

  try (Directory dir = newDirectory()) {
    IndexWriterConfig config = new IndexWriterConfig();
    config.setCommitOnClose(false);
    config.setMergePolicy(NoMergePolicy.INSTANCE);
    try (IndexWriter writer = new IndexWriter(dir, config)) {

      // first segment
      writer.addDocument(new Document()); // an empty doc
      Document d1 = new Document();
      byte[] chars = new byte[IndexWriter.MAX_STORED_STRING_LENGTH + 1];
      Arrays.fill(chars, (byte) 'a');
      d1.add(new Field("field", new BytesRef(chars), KeywordField));
      d1.add(new BinaryDocValuesField("field", new BytesRef(chars)));
      expectThrows(IllegalArgumentException.class, () -> writer.addDocument(d1));
      writer.flush();

      // second segment
      Document d2 = new Document();
      d2.add(new Field("field", new BytesRef("hello world"), KeywordField));
      d2.add(new SortedDocValuesField("field", new BytesRef("hello world")));
      writer.addDocument(d2);
      writer.flush();
      writer.commit();

      // Check for doc values types consistency
      Map<String, DocValuesType> docValuesTypes = new HashMap<>();
      try(DirectoryReader reader = DirectoryReader.open(dir)){
        for (LeafReaderContext leaf : reader.leaves()) {
          for (FieldInfo fi : leaf.reader().getFieldInfos()) {
            DocValuesType current = docValuesTypes.putIfAbsent(fi.name, fi.getDocValuesType());
            if (current != null && current != fi.getDocValuesType()) {
              fail("cannot change DocValues type from " + current + " to " + fi.getDocValuesType() + " for field \"" + fi.name + "\"");
            }
          }
        }
      }
    }
  }
}

I would like to propose to:

Backport the two-phase fields processing from Lucene9 to Lucene8. The patch should be small and contained.
Introduce an option in Lucene9 to skip checking field-infos consistency (i.e., behave like Lucene 8 when the option is enabled).

/cc mayya and jpountz

Attachments

Issue Links

links to

GitHub Pull Request #842

GitHub Pull Request #852

Activity

People

Assignee:: Unassigned

Reporter:: Nhat Nguyen

Votes:: 1 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 16/Apr/22 14:56

Updated:: 06/Oct/22 23:28

Resolved:: 28/Apr/22 18:23

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

1h 40m