Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-690

LazyField use of IndexInput not thread safe

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.1
    • Fix Version/s: 2.1
    • Component/s: core/store
    • Labels:
      None
    • Lucene Fields:
      Patch Available

      Description

      Hypothetical problem: IndexInput.clone() of an active IndexInput could result in a corrupt copy.
      LazyField clones the FieldsReader.fieldsStream, which could be in use via IndexReader.document()

      1. FieldsReader.patch
        3 kB
        Yonik Seeley
      2. TestThreadSafe.java
        5 kB
        Yonik Seeley

        Issue Links

          Activity

          Hide
          gsingers Grant Ingersoll added a comment -

          Not sure why this is different then anywhere else IndexInput.clone() is used, can you provide more details? Isn't the whole point of cloning to make it thread safe (which is why it is stored in the ThreadLocal)

          Show
          gsingers Grant Ingersoll added a comment - Not sure why this is different then anywhere else IndexInput.clone() is used, can you provide more details? Isn't the whole point of cloning to make it thread safe (which is why it is stored in the ThreadLocal)
          Hide
          yseeley@gmail.com Yonik Seeley added a comment -

          It seems to me like if you clone an object in an unknown state, you get another object in an unknown state unless it's clone() is able to explicitly put it back into a known state.

          Let's take TermDocs as an example of why it's use is OK:
          SegmentTermDocs() always makes a clone of the SegmentReader's freqStream, and the SegmentReader itself never directly uses the freqStream (and hence freqStream is used for nothing other than cloneing).
          This analysis only takes into account SegmentTermDocs and SegmentReader... if there is another class that directly uses SegmentReader.freqStream, then all bets are off again.

          I don't know if all other clones() of IndexInput are safe, but it does seem like LazyField's use is unsafe.

          Show
          yseeley@gmail.com Yonik Seeley added a comment - It seems to me like if you clone an object in an unknown state, you get another object in an unknown state unless it's clone() is able to explicitly put it back into a known state. Let's take TermDocs as an example of why it's use is OK: SegmentTermDocs() always makes a clone of the SegmentReader's freqStream, and the SegmentReader itself never directly uses the freqStream (and hence freqStream is used for nothing other than cloneing). This analysis only takes into account SegmentTermDocs and SegmentReader... if there is another class that directly uses SegmentReader.freqStream, then all bets are off again. I don't know if all other clones() of IndexInput are safe, but it does seem like LazyField's use is unsafe.
          Hide
          gsingers Grant Ingersoll added a comment -

          Defnitely an issue for a closed IndexInput, but that has been discussed elsewhere and there isn't really anything satisfying to say there other than it is undeclared what will happen (although my tests on local Directories indicate that it is "safe" in a simple environment).

          Otherwise, my understanding is that the seek() call before reading should put it into a known state, but I am guessing your not so sure on that point. So the question is seek() sufficient to put an IndexInput into a known safe state, right?

          If it doesn't, what would you suggest for a fix? I suppose maybe you could clone it as you create the LazyField (which would put it in right at the correct position and there would be no need to seek, right) This would save a little time at the tradeoff of memory, right?

          Show
          gsingers Grant Ingersoll added a comment - Defnitely an issue for a closed IndexInput, but that has been discussed elsewhere and there isn't really anything satisfying to say there other than it is undeclared what will happen (although my tests on local Directories indicate that it is "safe" in a simple environment). Otherwise, my understanding is that the seek() call before reading should put it into a known state, but I am guessing your not so sure on that point. So the question is seek() sufficient to put an IndexInput into a known safe state, right? If it doesn't, what would you suggest for a fix? I suppose maybe you could clone it as you create the LazyField (which would put it in right at the correct position and there would be no need to seek, right) This would save a little time at the tradeoff of memory, right?
          Hide
          gsingers Grant Ingersoll added a comment -

          Comment from Hoss, which I think should be on the issue:

          : Otherwise, my understanding is that the seek() call before reading
          : should put it into a known state, but I am guessing your not so sure on
          : that point. So the question is seek() sufficient to put an IndexInput
          : into a known safe state, right?

          isn't it more subtle then that? ... IndexInput.clone() states that the
          clone will be "positioned at the same point as the stream they were cloned
          from" ... that implies that the clone will be in a consistent usable state
          even with the client calling seek on the clone() .... is that
          invarient currently met by BufferedIndexInput in a multithreaded case?

          -Hoss

          Show
          gsingers Grant Ingersoll added a comment - Comment from Hoss, which I think should be on the issue: : Otherwise, my understanding is that the seek() call before reading : should put it into a known state, but I am guessing your not so sure on : that point. So the question is seek() sufficient to put an IndexInput : into a known safe state, right? isn't it more subtle then that? ... IndexInput.clone() states that the clone will be "positioned at the same point as the stream they were cloned from" ... that implies that the clone will be in a consistent usable state even with the client calling seek on the clone() .... is that invarient currently met by BufferedIndexInput in a multithreaded case? -Hoss
          Hide
          gsingers Grant Ingersoll added a comment -

          Possibly reading wrong, but it seems like that invariant is not met by BufferedIndexInput, right?

          If clone was called while the IndexInput was in the refill() operation before readInternal() was called but after the buffer was allocated (i.e. line 64) and then clone finished it would copy an empty buffer on line 103, right? This would be bad, b/c I don't see that it could recover from an empty buffer there when it thinks it has data.

          So, I guess that would argue that it is not safe and that we should aggressively clone (b/c we know it is in a good state) or that we should synchronize, right?

          Show
          gsingers Grant Ingersoll added a comment - Possibly reading wrong, but it seems like that invariant is not met by BufferedIndexInput, right? If clone was called while the IndexInput was in the refill() operation before readInternal() was called but after the buffer was allocated (i.e. line 64) and then clone finished it would copy an empty buffer on line 103, right? This would be bad, b/c I don't see that it could recover from an empty buffer there when it thinks it has data. So, I guess that would argue that it is not safe and that we should aggressively clone (b/c we know it is in a good state) or that we should synchronize, right?
          Hide
          yseeley@gmail.com Yonik Seeley added a comment -

          : If clone was called while the IndexInput was in the refill() operation before readInternal() was called but after the buffer was allocated (i.e. line 64) and then clone finished it would copy an empty buffer on line 103, right?

          Correct. BufferedIndexInput also has multiple "pointers", and in the absense of synchronization, you can't guarantee the order that things will be flushed back to memory.

          : So, I guess that would argue that it is not safe and that we should aggressively clone (b/c we know it is in a good state) or that we should synchronize, right?

          Right, you either never use what you are cloning, or you synchronize.

          Show
          yseeley@gmail.com Yonik Seeley added a comment - : If clone was called while the IndexInput was in the refill() operation before readInternal() was called but after the buffer was allocated (i.e. line 64) and then clone finished it would copy an empty buffer on line 103, right? Correct. BufferedIndexInput also has multiple "pointers", and in the absense of synchronization, you can't guarantee the order that things will be flushed back to memory. : So, I guess that would argue that it is not safe and that we should aggressively clone (b/c we know it is in a good state) or that we should synchronize, right? Right, you either never use what you are cloning, or you synchronize.
          Hide
          yseeley@gmail.com Yonik Seeley added a comment -

          On 10/19/06, Ning Li <ning.li.li@gmail.com> wrote:
          > What makes, for example, FSIndexInput and its clones, thread-safe is
          > the following.
          > That is, the method is synchronized on the file object.
          >
          > protected void readInternal(byte[] b, int offset, int len)
          > throws IOException {
          > synchronized (file) {

          I don't think that's sufficient in part because the IndexInput's state is manipulated outside that sync block. The sync block is to protect the file only, not the IndexInput, which isn't thread-safe (by design).

          Look at BufferedIndexInput.seek() and BufferedIndexInput.refill()

          Show
          yseeley@gmail.com Yonik Seeley added a comment - On 10/19/06, Ning Li <ning.li.li@gmail.com> wrote: > What makes, for example, FSIndexInput and its clones, thread-safe is > the following. > That is, the method is synchronized on the file object. > > protected void readInternal(byte[] b, int offset, int len) > throws IOException { > synchronized (file) { I don't think that's sufficient in part because the IndexInput's state is manipulated outside that sync block. The sync block is to protect the file only, not the IndexInput, which isn't thread-safe (by design). Look at BufferedIndexInput.seek() and BufferedIndexInput.refill()
          Hide
          gsingers Grant Ingersoll added a comment -

          Do you have a preference on aggressive cloning vs. synching? cloning is easier to implement, but then the Field is no longer as lazy as it could be, on the other hand synching on the fieldsStream in LazyField and doc() could be pretty expensive, too, right?

          Show
          gsingers Grant Ingersoll added a comment - Do you have a preference on aggressive cloning vs. synching? cloning is easier to implement, but then the Field is no longer as lazy as it could be, on the other hand synching on the fieldsStream in LazyField and doc() could be pretty expensive, too, right?
          Hide
          yseeley@gmail.com Yonik Seeley added a comment -

          Here's one possible patch... not sure if it's the best approach or not.

          Show
          yseeley@gmail.com Yonik Seeley added a comment - Here's one possible patch... not sure if it's the best approach or not.
          Hide
          yseeley@gmail.com Yonik Seeley added a comment -

          > Do you have a preference on aggressive cloning vs. synching?

          Not too much... synching was presumably avoided for better concurrency. If we wanted to sync in LazyField, it would currently need to be on the SegmentReader object, since that's the monitor protecting the fieldStream.

          Could also perhaps lazily create the "cloneable stream" the first time a lazy field is created, since we are currently in a synchronized context when that happens.

          A different route would be to convert normal document() loading to also use thread-local fieldStreams for better concurrency. It would mean a thread-local lookup per document though, and I'm not sure what the expense of that would be.

          Show
          yseeley@gmail.com Yonik Seeley added a comment - > Do you have a preference on aggressive cloning vs. synching? Not too much... synching was presumably avoided for better concurrency. If we wanted to sync in LazyField, it would currently need to be on the SegmentReader object, since that's the monitor protecting the fieldStream. Could also perhaps lazily create the "cloneable stream" the first time a lazy field is created, since we are currently in a synchronized context when that happens. A different route would be to convert normal document() loading to also use thread-local fieldStreams for better concurrency. It would mean a thread-local lookup per document though, and I'm not sure what the expense of that would be.
          Hide
          gsingers Grant Ingersoll added a comment -

          Patch seems reasonable, and your idea of waiting till we have a lazy field before cloning seems reasonable, as my guess is, at least for now, that Lazy Fields are not in widespread use just yet.

          Show
          gsingers Grant Ingersoll added a comment - Patch seems reasonable, and your idea of waiting till we have a lazy field before cloning seems reasonable, as my guess is, at least for now, that Lazy Fields are not in widespread use just yet.
          Hide
          paul.elschot@xs4all.nl Paul Elschot added a comment -

          I have not followed this, but some time ago I've had a problem with cloning and BufferedIndexInput, see LUCENE-430.
          Searching jira for BufferedIndexInput gives 5 hits currently.

          Regards,
          Paul Elschot

          Show
          paul.elschot@xs4all.nl Paul Elschot added a comment - I have not followed this, but some time ago I've had a problem with cloning and BufferedIndexInput, see LUCENE-430 . Searching jira for BufferedIndexInput gives 5 hits currently. Regards, Paul Elschot
          Hide
          yseeley@gmail.com Yonik Seeley added a comment -

          OK, I committed this fix for now.
          For BufferedIndexInput, the buffer is allocated lazily, so the extra clone doesn't use much space.

          Show
          yseeley@gmail.com Yonik Seeley added a comment - OK, I committed this fix for now. For BufferedIndexInput, the buffer is allocated lazily, so the extra clone doesn't use much space.
          Hide
          yseeley@gmail.com Yonik Seeley added a comment -

          OK, this problem is no longer hypothetical!
          I was able to produce a test that failed quickly most of the time w/o the patch.
          I've just committed it.

          Show
          yseeley@gmail.com Yonik Seeley added a comment - OK, this problem is no longer hypothetical! I was able to produce a test that failed quickly most of the time w/o the patch. I've just committed it.
          Hide
          mikemccand Michael McCandless added a comment -

          Closing all issues that were resolved for 2.1.

          Show
          mikemccand Michael McCandless added a comment - Closing all issues that were resolved for 2.1.

            People

            • Assignee:
              Unassigned
              Reporter:
              yseeley@gmail.com Yonik Seeley
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development