Lucene - Core
  1. Lucene - Core
  2. LUCENE-4539

DocValues impls should read all headers up-front instead of per-directsource

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Currently, when DocValues opens, it just opens files. it doesnt read codec headers etc.

      Instead we read these every single time a directsource opens.

      I think it should work like PostingsReaders: e.g. the PackedInts impl would read its versioning info and codec headers and creating a new Direct impl should be a IndexInput.clone() + getDirectReaderNoHeader().

      Today its much more costly.

        Activity

        Hide
        Robert Muir added a comment -

        I fixed all the ones here except SortedSource to just read this stuff in up-front.

        Show
        Robert Muir added a comment - I fixed all the ones here except SortedSource to just read this stuff in up-front.
        Hide
        Adrien Grand added a comment -

        except SortedSource

        Perhaps we should do what the TODO comment in your patch suggests (the two seeks to read metadata) and upgrade the version number to write all metadata at the beginning of the stream?

        Semi-related discussion: the getDirectReaderNoHeader method should probably take an additional long (the start pointer) as an argument so that there is no need to seek to get an instance if the cursor is not at the right place.

        Show
        Adrien Grand added a comment - except SortedSource Perhaps we should do what the TODO comment in your patch suggests (the two seeks to read metadata) and upgrade the version number to write all metadata at the beginning of the stream? Semi-related discussion: the getDirectReaderNoHeader method should probably take an additional long (the start pointer) as an argument so that there is no need to seek to get an instance if the cursor is not at the right place.
        Hide
        Robert Muir added a comment -

        I think you are probably right... I just got the straightforward ones done.

        I wanted to suggest a possibility that maybe we add PackedInts.Header to encapsulate what it stores?

        Today we have:

        Reader getReader(DataInput in)
        Reader getReaderNoHeader(DataInput in, Format format, int version, int valueCount, int bitsPerValue)
        

        And if you want to read in the header yourself, you are doing:

        final int version = CodecUtil.checkHeader(in, CODEC_NAME, VERSION_START, VERSION_CURRENT);
        final int bitsPerValue = in.readVInt();
        assert bitsPerValue > 0 && bitsPerValue <= 64: "bitsPerValue=" + bitsPerValue;
        final int valueCount = in.readVInt();
        final Format format = Format.byId(in.readVInt());
        

        So my idea would just be something like:

        Reader getReader(DataInput in, Header header) {
          return getReader(in, header.format, header.version, header.valueCount, header.bitsPerValue);
        }
        

        and maybe

        Header readHeader(DataInput in)
        

        to encapsulate this stuff. we could still keep the "raw" versions around if we wanted too.

        I didnt want to overengineer anything, not sure if this would be useful outside of this particular issue.

        Show
        Robert Muir added a comment - I think you are probably right... I just got the straightforward ones done. I wanted to suggest a possibility that maybe we add PackedInts.Header to encapsulate what it stores? Today we have: Reader getReader(DataInput in) Reader getReaderNoHeader(DataInput in, Format format, int version, int valueCount, int bitsPerValue) And if you want to read in the header yourself, you are doing: final int version = CodecUtil.checkHeader(in, CODEC_NAME, VERSION_START, VERSION_CURRENT); final int bitsPerValue = in.readVInt(); assert bitsPerValue > 0 && bitsPerValue <= 64: "bitsPerValue=" + bitsPerValue; final int valueCount = in.readVInt(); final Format format = Format.byId(in.readVInt()); So my idea would just be something like: Reader getReader(DataInput in, Header header) { return getReader(in, header.format, header.version, header.valueCount, header.bitsPerValue); } and maybe Header readHeader(DataInput in) to encapsulate this stuff. we could still keep the "raw" versions around if we wanted too. I didnt want to overengineer anything, not sure if this would be useful outside of this particular issue.
        Hide
        Adrien Grand added a comment -

        Interesting, I had started to work on the exact same API for LUCENE-4536 (because I needed to read the header generated by getWriter to know how long a packed array would be depending on the version) but I finally decided to fix it another way.

        not sure if this would be useful outside of this particular issue

        Maybe not but I agree that it would help make the code cleaner, so +1!

        Related discussion: I think the header format is fine when packed ints are in their own file, but when packed ints are nested in another file, they should probably not declare a codec header: the PackedInts codec name check is redundant with the main codec name check. (I was actually thinking of deprecating get

        {Reader,DirectReader,ReaderIterator}

        to force callers to think about the way they should store the valueCount and bitsPerValue, and to discourage from using a standard header when packed ints are not in their own file to prevent from performing redundant checks.)

        Show
        Adrien Grand added a comment - Interesting, I had started to work on the exact same API for LUCENE-4536 (because I needed to read the header generated by getWriter to know how long a packed array would be depending on the version) but I finally decided to fix it another way. not sure if this would be useful outside of this particular issue Maybe not but I agree that it would help make the code cleaner, so +1! Related discussion: I think the header format is fine when packed ints are in their own file, but when packed ints are nested in another file, they should probably not declare a codec header: the PackedInts codec name check is redundant with the main codec name check. (I was actually thinking of deprecating get {Reader,DirectReader,ReaderIterator} to force callers to think about the way they should store the valueCount and bitsPerValue, and to discourage from using a standard header when packed ints are not in their own file to prevent from performing redundant checks.)
        Hide
        Robert Muir added a comment -

        When are packed ints in their own file today?

        Show
        Robert Muir added a comment - When are packed ints in their own file today?
        Hide
        Adrien Grand added a comment -

        Only in tests. This is why I think that writing a full header (including the "PackedInts" codec name) is useless most of time if not always.

        Show
        Adrien Grand added a comment - Only in tests. This is why I think that writing a full header (including the "PackedInts" codec name) is useless most of time if not always.
        Hide
        Robert Muir added a comment -

        I agree with you its bogus how it writes its header.

        But I see a downside (I hope we can come up with an idea to deal with it rather than keeping the header!)

        One advantage of PackedInts writing its versioning (like FSTs) is that lots of things nest them in their own file.

        The problem with these two things is that they are themselves changing and versioned: they arent like readVint()
        which is pretty much fixed in what it does.

        So having them write their own versions etc today to some extent makes back compat management of file formats easier:
        today its just DocValues and Term dictionaries using these things, tomorrow (4.1) its also the postings lists: documents,
        frequencies, and positions, and maybe in the future even stored fields (LUCENE-4527).

        Who is keeping up with all the places that must be managed when a packed ints version change needs to happen? Today
        the header encapsulates in one place: if i backwards break FSTs and it breaks a few suggester impls, i know anyone
        using those suggesters will get IndexFormatTooOldException without me doing anything. So thats very convenient.

        Show
        Robert Muir added a comment - I agree with you its bogus how it writes its header. But I see a downside (I hope we can come up with an idea to deal with it rather than keeping the header!) One advantage of PackedInts writing its versioning (like FSTs) is that lots of things nest them in their own file. The problem with these two things is that they are themselves changing and versioned: they arent like readVint() which is pretty much fixed in what it does. So having them write their own versions etc today to some extent makes back compat management of file formats easier: today its just DocValues and Term dictionaries using these things, tomorrow (4.1) its also the postings lists: documents, frequencies, and positions, and maybe in the future even stored fields ( LUCENE-4527 ). Who is keeping up with all the places that must be managed when a packed ints version change needs to happen? Today the header encapsulates in one place: if i backwards break FSTs and it breaks a few suggester impls, i know anyone using those suggesters will get IndexFormatTooOldException without me doing anything. So thats very convenient.
        Hide
        Adrien Grand added a comment -

        Who is keeping up with all the places that must be managed when a packed ints version change needs to happen?

        Sorry I was not clear: I didn't mean to remove the version number, just the codec name. I think the Lucene41 postings format is a good example: it never writes "PackedInts" in the stream, writes the PackedInts version at the beginning of the stream and may then serialize thousands arrays of 128 values with the number of bits per value as a byte in front of each of them.

        Show
        Adrien Grand added a comment - Who is keeping up with all the places that must be managed when a packed ints version change needs to happen? Sorry I was not clear: I didn't mean to remove the version number, just the codec name. I think the Lucene41 postings format is a good example: it never writes "PackedInts" in the stream, writes the PackedInts version at the beginning of the stream and may then serialize thousands arrays of 128 values with the number of bits per value as a byte in front of each of them.

          People

          • Assignee:
            Unassigned
            Reporter:
            Robert Muir
          • Votes:
            1 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:

              Development