Jackrabbit Content Repository
  1. Jackrabbit Content Repository
  2. JCR-2576

DbInputStream does not support mark()/reset() when exhausted.

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 2.0
    • Fix Version/s: 2.1
    • Component/s: jackrabbit-core
    • Labels:
      None

      Description

      The DbDataStore implementation uses a DbInputStream to read binary properties from the database. When a new binary property is created, Jackrabbit attempts to index it. Tika's CharsetDetector is used in the process, which marks the input stream, reads the first 8000 bytes and then resets the stream.

      This results in the stacktrace shown at the end of the issue, if the following two conditions hold true:

      • the property is larger than the minRecordLength configuration of the Datastore and
      • the property is smaller than 8000 bytes

      The DbInputStream needs to have the following properties:
      1. lazy instantiation of the underlying stream
      2. auto-close underlying stream when EOF is reached
      3. fully support mark()/reset() even if the underlying stream is auto-closed due to 2.

      12.03.2010 15:53:28 *WARN * LazyTextExtractorField: Failed to extract text from a binary property (LazyTextExtractorField.java, line 165)
      java.io.EOFException
      at org.apache.jackrabbit.core.data.db.DbInputStream.reset(DbInputStream.java:180)
      at org.apache.tika.io.ProxyInputStream.reset(ProxyInputStream.java:156)
      at org.apache.tika.io.ProxyInputStream.reset(ProxyInputStream.java:156)
      at org.apache.tika.parser.txt.CharsetDetector.setText(CharsetDetector.java:131)
      at org.apache.tika.parser.txt.TXTParser.parse(TXTParser.java:77)
      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120)
      at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101)
      at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:114)
      at org.apache.jackrabbit.core.query.lucene.LazyTextExtractorField$ParsingTask.run(LazyTextExtractorField.java:160)
      at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
      at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
      at java.util.concurrent.FutureTask.run(FutureTask.java:138)
      at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:98)
      at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:207)
      at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
      at java.lang.Thread.run(Thread.java:619)

      1. DbInputStream.patch
        16 kB
        Julian Sedding

        Activity

        Julian Sedding created issue -
        Thomas Mueller made changes -
        Field Original Value New Value
        Assignee Thomas Mueller [ tmueller ]
        Hide
        Julian Sedding added a comment -

        I have started working on a patch, which is not fully functional yet. Unfortunately I currently don't have time to finish it off. It should illustrate a possible approach to solve the problem though.

        Show
        Julian Sedding added a comment - I have started working on a patch, which is not fully functional yet. Unfortunately I currently don't have time to finish it off. It should illustrate a possible approach to solve the problem though.
        Julian Sedding made changes -
        Attachment DbInputStream.patch [ 12439140 ]
        Hide
        Thomas Mueller added a comment -

        Thanks a lot for the patch! I think the only remaining issue is that closeOriginalStream() should not set originalStream to null.

        However I would like to simplify things a bit by implementing the mark()/reset() features a different layer (use BufferedInputStream if possible).

        A similar issue exists with TempFileInputStream by the way.

        Show
        Thomas Mueller added a comment - Thanks a lot for the patch! I think the only remaining issue is that closeOriginalStream() should not set originalStream to null. However I would like to simplify things a bit by implementing the mark()/reset() features a different layer (use BufferedInputStream if possible). A similar issue exists with TempFileInputStream by the way.
        Thomas Mueller made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Fix Version/s 2.0.1 [ 12314540 ]
        Resolution Fixed [ 1 ]
        Jukka Zitting made changes -
        Fix Version/s 2.1.0 [ 12314477 ]
        Fix Version/s 2.0.1 [ 12314540 ]
        Jukka Zitting made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Open Open Resolved Resolved
        4h 47m 1 Thomas Mueller 18/Mar/10 13:39
        Resolved Resolved Closed Closed
        39d 18h 43m 1 Jukka Zitting 27/Apr/10 09:23

          People

          • Assignee:
            Thomas Mueller
            Reporter:
            Julian Sedding
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development