Lucene - Core
  1. Lucene - Core
  2. LUCENE-5602

always verify term vectors on bulk merge

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.8, Trunk
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Similar to LUCENE-5580: its scary we just copy bytes and dont verify anything. we should at least verify in the bulk merge case.

      This is just a quick hack: it would be nice, though scary to remove the clone() of the index input...

      1. LUCENE-5602.patch
        4 kB
        Adrien Grand
      2. LUCENE-5602.patch
        3 kB
        Robert Muir

        Activity

        Hide
        Adrien Grand added a comment -

        Thanks Robert for adding the version check. I think we should indeed never copy raw bytes across versions.

        The clone is necessary because otherwise, the naive merge routine that is used eg. when chunks contain deleted documents is going to seek on the underlying IndexInput without notifying the checksuming wrapper about it.

        I tried to fix it by using a checksuming index input for this naive merge routine as well, but it doesn't work since for every document it needs to seek to the start offset of the block, which implies seeking backward if you need to merge two term vectors that are in the same block.

        I added a minor modification to Robert's patch in order to make sure that the index input of the vectors reader and the clone that is used for checksuming are moved in parallel. Otherwise if you stop using bulk merging after a few documents (eg. if deletions shift chunks in such a way that all of them need to be rebuilt), the checksum input might need to read the whole file when checking integrity.

        Show
        Adrien Grand added a comment - Thanks Robert for adding the version check. I think we should indeed never copy raw bytes across versions. The clone is necessary because otherwise, the naive merge routine that is used eg. when chunks contain deleted documents is going to seek on the underlying IndexInput without notifying the checksuming wrapper about it. I tried to fix it by using a checksuming index input for this naive merge routine as well, but it doesn't work since for every document it needs to seek to the start offset of the block, which implies seeking backward if you need to merge two term vectors that are in the same block. I added a minor modification to Robert's patch in order to make sure that the index input of the vectors reader and the clone that is used for checksuming are moved in parallel. Otherwise if you stop using bulk merging after a few documents (eg. if deletions shift chunks in such a way that all of them need to be rebuilt), the checksum input might need to read the whole file when checking integrity.
        Hide
        Robert Muir added a comment -

        Thanks: this is a good solution. the only other alternative would be to make the input sneakier about seek, and i think thats more dangerous...

        Show
        Robert Muir added a comment - Thanks: this is a good solution. the only other alternative would be to make the input sneakier about seek, and i think thats more dangerous...
        Hide
        Adrien Grand added a comment -

        That was my reasoning as well.

        Show
        Adrien Grand added a comment - That was my reasoning as well.
        Hide
        Robert Muir added a comment -

        I'm setting to 4.9/5.0. While not really a bug, it would be nice for 4.8, but I'll wait on Uwe for that...

        Show
        Robert Muir added a comment - I'm setting to 4.9/5.0. While not really a bug, it would be nice for 4.8, but I'll wait on Uwe for that...
        Hide
        ASF subversion and git services added a comment -

        Commit 1586710 from rmuir@apache.org in branch 'dev/trunk'
        [ https://svn.apache.org/r1586710 ]

        LUCENE-5602: always verify term vectors on bulk merge

        Show
        ASF subversion and git services added a comment - Commit 1586710 from rmuir@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1586710 ] LUCENE-5602 : always verify term vectors on bulk merge
        Hide
        ASF subversion and git services added a comment -

        Commit 1586712 from rmuir@apache.org in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1586712 ]

        LUCENE-5602: always verify term vectors on bulk merge

        Show
        ASF subversion and git services added a comment - Commit 1586712 from rmuir@apache.org in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1586712 ] LUCENE-5602 : always verify term vectors on bulk merge
        Hide
        Uwe Schindler added a comment -

        +1 to backport to 4.8

        Show
        Uwe Schindler added a comment - +1 to backport to 4.8
        Hide
        Robert Muir added a comment -

        reopen for 4.8 backport

        Show
        Robert Muir added a comment - reopen for 4.8 backport
        Hide
        Uwe Schindler added a comment -

        One reason for me that I want to have it for 4.8:

        • I don't want partly implemented stuff: Stored fields do checksums on merging, term vectors not. But term vectors are just another variant of stored fields and merging is done in the same way.
        • The infrastructure is there, its just a simple patch to actually implement that for bulk merging. No index format change, just doing the checks live while merging.
        Show
        Uwe Schindler added a comment - One reason for me that I want to have it for 4.8: I don't want partly implemented stuff: Stored fields do checksums on merging, term vectors not. But term vectors are just another variant of stored fields and merging is done in the same way. The infrastructure is there, its just a simple patch to actually implement that for bulk merging. No index format change, just doing the checks live while merging.
        Hide
        ASF subversion and git services added a comment -

        Commit 1587577 from rmuir@apache.org in branch 'dev/branches/lucene_solr_4_8'
        [ https://svn.apache.org/r1587577 ]

        LUCENE-5602: always verify term vectors on bulk merge

        Show
        ASF subversion and git services added a comment - Commit 1587577 from rmuir@apache.org in branch 'dev/branches/lucene_solr_4_8' [ https://svn.apache.org/r1587577 ] LUCENE-5602 : always verify term vectors on bulk merge
        Hide
        ASF subversion and git services added a comment -

        Commit 1587578 from rmuir@apache.org in branch 'dev/trunk'
        [ https://svn.apache.org/r1587578 ]

        LUCENE-5602: CHANGES shuffle

        Show
        ASF subversion and git services added a comment - Commit 1587578 from rmuir@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1587578 ] LUCENE-5602 : CHANGES shuffle
        Hide
        ASF subversion and git services added a comment -

        Commit 1587579 from rmuir@apache.org in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1587579 ]

        LUCENE-5602: CHANGES shuffle

        Show
        ASF subversion and git services added a comment - Commit 1587579 from rmuir@apache.org in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1587579 ] LUCENE-5602 : CHANGES shuffle
        Hide
        Uwe Schindler added a comment -

        Close issue after release of 4.8.0

        Show
        Uwe Schindler added a comment - Close issue after release of 4.8.0

          People

          • Assignee:
            Unassigned
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development