Lucene - Core
  1. Lucene - Core
  2. LUCENE-5188

Make CompressingStoredFieldsFormat more friendly to StoredFieldVisitors

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.5, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      The way CompressingStoredFieldsFormat works is that it first decompresses data and then consults the StoredFieldVisitor. This is a bit wasteful in case documents are big and only the first field of a document is of interest so maybe we could decompress and consult the StoredFieldVicitor in a more streaming fashion.

      1. LUCENE-5188.patch
        14 kB
        Adrien Grand

        Activity

        Hide
        Adrien Grand added a comment -

        Here is a patch that slices large chunks (>= twice the configured chunk size) into several LZ4 blocks (of chunkSize bytes each). The LZ4 blocks will be decompressed as needed so that you don't end up decompressing everything if you only need the first field of your document.

        A nice side-effect of this patch is that it reduces memory pressure as well when working with big documents (LUCENE-4955): since big documents are sliced into fixed-size blocks, it is not needed anymore to allocate a byte[] of the size of the document (potentially several MB) to decompress it.

        Show
        Adrien Grand added a comment - Here is a patch that slices large chunks (>= twice the configured chunk size) into several LZ4 blocks (of chunkSize bytes each). The LZ4 blocks will be decompressed as needed so that you don't end up decompressing everything if you only need the first field of your document. A nice side-effect of this patch is that it reduces memory pressure as well when working with big documents ( LUCENE-4955 ): since big documents are sliced into fixed-size blocks, it is not needed anymore to allocate a byte[] of the size of the document (potentially several MB) to decompress it.
        Hide
        Robert Muir added a comment -

        nice idea!

        Show
        Robert Muir added a comment - nice idea!
        Hide
        Adrien Grand added a comment -

        I will commit later today if there is no objection.

        Show
        Adrien Grand added a comment - I will commit later today if there is no objection.
        Hide
        Simon Willnauer added a comment -

        cool stuff adrien!
        One thing I wonder is if we should use a specialized DataInput maybe SkippableDataInput in that class to prevent the static method. That shared byte array worries me. Aside of this, I wonder if we had this method in DataInput or however we gonna do this would it be possible to skip an entire decompression step if we know that the amount of bytes we skip is larger than one or more decompression blocks. I have to admit I don't exactly know how this works and if what I propose is possible but that would help me to better understand why we need to read all the data and decompress if we trash it anyway.

        Show
        Simon Willnauer added a comment - cool stuff adrien! One thing I wonder is if we should use a specialized DataInput maybe SkippableDataInput in that class to prevent the static method. That shared byte array worries me. Aside of this, I wonder if we had this method in DataInput or however we gonna do this would it be possible to skip an entire decompression step if we know that the amount of bytes we skip is larger than one or more decompression blocks. I have to admit I don't exactly know how this works and if what I propose is possible but that would help me to better understand why we need to read all the data and decompress if we trash it anyway.
        Hide
        Adrien Grand added a comment -

        These bytes can be shared because they are write-only, kind of like /dev/null. Having this on DataInput to be able to skip an entire decompression would be nice but unfortunately with the current design, the field numbers are stored in the compressed stream, so you need to decompress anyway to know whether you should skip (StoredFieldVisitor allows to skip based on the FieldInfo, that my StoredFieldReader computes from the field number). But your idea is something I would like to explore for the next StoredFieldsFormat, along with preset dictionaries.

        Show
        Adrien Grand added a comment - These bytes can be shared because they are write-only, kind of like /dev/null. Having this on DataInput to be able to skip an entire decompression would be nice but unfortunately with the current design, the field numbers are stored in the compressed stream, so you need to decompress anyway to know whether you should skip (StoredFieldVisitor allows to skip based on the FieldInfo, that my StoredFieldReader computes from the field number). But your idea is something I would like to explore for the next StoredFieldsFormat, along with preset dictionaries.
        Hide
        Simon Willnauer added a comment -

        thanks adrien for elaborating... progress over perfection so lets move on here. +1 to commit

        Show
        Simon Willnauer added a comment - thanks adrien for elaborating... progress over perfection so lets move on here. +1 to commit
        Hide
        ASF subversion and git services added a comment -

        Commit 1520025 from Adrien Grand in branch 'dev/trunk'
        [ https://svn.apache.org/r1520025 ]

        LUCENE-5188: Make CompressingStoredFieldsFormat more friendly to StoredFieldVisitors.

        Show
        ASF subversion and git services added a comment - Commit 1520025 from Adrien Grand in branch 'dev/trunk' [ https://svn.apache.org/r1520025 ] LUCENE-5188 : Make CompressingStoredFieldsFormat more friendly to StoredFieldVisitors.
        Hide
        ASF subversion and git services added a comment -

        Commit 1520278 from Adrien Grand in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1520278 ]

        LUCENE-5188: Make CompressingStoredFieldsFormat more friendly to StoredFieldVisitors.

        Show
        ASF subversion and git services added a comment - Commit 1520278 from Adrien Grand in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1520278 ] LUCENE-5188 : Make CompressingStoredFieldsFormat more friendly to StoredFieldVisitors.
        Hide
        Adrien Grand added a comment -

        4.5 release -> bulk close

        Show
        Adrien Grand added a comment - 4.5 release -> bulk close

          People

          • Assignee:
            Adrien Grand
            Reporter:
            Adrien Grand
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development