Solr
  1. Solr
  2. SOLR-1911

File descriptor leak while indexing, may cause index corruption

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Critical Critical
    • Resolution: Cannot Reproduce
    • Affects Version/s: 1.5
    • Fix Version/s: None
    • Component/s: update
    • Labels:
      None
    • Environment:

      Description

      While adding documents to an already existing index using this build, the number of open file descriptors increases dramatically until the open file per-process limit is reached (1024) , at which point there are error messages in the log to that effect. If the server is restarted the index may be corrupt

      commits are handled by autocommit every 60 seconds or 500 documents (usually the time limit is reached first).
      mergeFactor is 10.

      It looks as though each time a commit takes place, the number of open files (obtained from " lsof -p `cat solr.pid` | egrep ' [0-9]+r ' ") increases by 40, There are several open file descriptors associated with each file in the index.

      Rerunning the same index updates with an older Solr (built from trunk in Feb 2010) doesn't show this problem - the number of open files fluctuates up and down as segments are created and merged, but stays basically constant.

      1. indexlsof.tar.gz
        33 kB
        Simon Rosenthal
      2. openafteropt.txt
        18 kB
        Simon Rosenthal

        Activity

        Hide
        Hoss Man added a comment -

        Moving log dump out of problem description and into a comment...

        May 13, 2010 12:37:04 PM org.apache.solr.update.DirectUpdateHandler2 commit
        INFO: start commit(optimize=false,waitFlush=true,waitSearcher=true,expungeDeletes=false)
        May 13, 2010 12:37:04 PM org.apache.solr.update.DirectUpdateHandler2$CommitTracker run
        SEVERE: auto commit error...
        java.io.FileNotFoundException: /home/simon/rig2/solr/core1/data/index/_j2.nrm (Too many open files)
        	at java.io.RandomAccessFile.open(Native Method)
        	at java.io.RandomAccessFile.<init>(RandomAccessFile.java:212)
        	at org.apache.lucene.store.SimpleFSDirectory$SimpleFSIndexInput$Descriptor.<init>(SimpleFSDirectory.java:69)
        	at org.apache.lucene.store.SimpleFSDirectory$SimpleFSIndexInput.<init>(SimpleFSDirectory.java:90)
        	at org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.<init>(NIOFSDirectory.java:80)
        	at org.apache.lucene.store.NIOFSDirectory.openInput(NIOFSDirectory.java:67)
        	at org.apache.lucene.index.SegmentReader.openNorms(SegmentReader.java:1093)
        	at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:532)
        	at org.apache.lucene.index.IndexWriter$ReaderPool.get(IndexWriter.java:634)
        	at org.apache.lucene.index.IndexWriter$ReaderPool.get(IndexWriter.java:610)
        	at org.apache.lucene.index.DocumentsWriter.applyDeletes(DocumentsWriter.java:1012)
        	at org.apache.lucene.index.IndexWriter.applyDeletes(IndexWriter.java:4563)
        	at org.apache.lucene.index.IndexWriter.doFlushInternal(IndexWriter.java:3775)
        	at org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:3623)
        	at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:3614)
        	at org.apache.lucene.index.IndexWriter.closeInternal(IndexWriter.java:1769)
        	at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:1732)
        	at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:1696)
        	at org.apache.solr.update.SolrIndexWriter.close(SolrIndexWriter.java:230)
        	at org.apache.solr.update.DirectUpdateHandler2.closeWriter(DirectUpdateHandler2.java:181)
        	at org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:409)
        	at org.apache.solr.update.DirectUpdateHandler2$CommitTracker.run(DirectUpdateHandler2.java:602)
        	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
        	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        	at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:98)
        	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:207)
        	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        	at java.lang.Thread.run(Thread.java:619)
        May 13, 2010 12:37:04 PM org.apache.solr.update.processor.LogUpdateProcessor finish
        INFO: {} 0 1
        May 13, 2010 12:37:04 PM org.apache.solr.common.SolrException log
        SEVERE: java.io.IOException: directory '/home/simon/rig2/solr/core1/data/index' exists and is a directory, but cannot be listed: list() returned null
        	at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:223)
        	at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:234)
        	at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:582)
        	at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:535)
        	at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:316)
        	at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:1129)
        	at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:999)
        	at org.apache.solr.update.SolrIndexWriter.<init>(SolrIndexWriter.java:191)
        	at org.apache.solr.update.UpdateHandler.createMainIndexWriter(UpdateHandler.java:99)
        	at org.apache.solr.update.DirectUpdateHandler2.openWriter(DirectUpdateHandler2.java:173)
        	at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:220)
        	at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:61)
        	at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:139)
        	at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69)
        	at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
        	at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
        	at org.apache.solr.core.SolrCore.execute(SolrCore.java:1321)
        	at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:341)
        	at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:244)
        	at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
        	at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
        	at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
        	at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
        	at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
        	at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
        	at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
        	at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
        	at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
        	at org.mortbay.jetty.Server.handle(Server.java:285)
        	at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
        	at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
        	at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
        	at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208)
        	at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
        	at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:368)
        	at org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
        
        
        
        Show
        Hoss Man added a comment - Moving log dump out of problem description and into a comment... May 13, 2010 12:37:04 PM org.apache.solr.update.DirectUpdateHandler2 commit INFO: start commit(optimize=false,waitFlush=true,waitSearcher=true,expungeDeletes=false) May 13, 2010 12:37:04 PM org.apache.solr.update.DirectUpdateHandler2$CommitTracker run SEVERE: auto commit error... java.io.FileNotFoundException: /home/simon/rig2/solr/core1/data/index/_j2.nrm (Too many open files) at java.io.RandomAccessFile.open(Native Method) at java.io.RandomAccessFile.<init>(RandomAccessFile.java:212) at org.apache.lucene.store.SimpleFSDirectory$SimpleFSIndexInput$Descriptor.<init>(SimpleFSDirectory.java:69) at org.apache.lucene.store.SimpleFSDirectory$SimpleFSIndexInput.<init>(SimpleFSDirectory.java:90) at org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.<init>(NIOFSDirectory.java:80) at org.apache.lucene.store.NIOFSDirectory.openInput(NIOFSDirectory.java:67) at org.apache.lucene.index.SegmentReader.openNorms(SegmentReader.java:1093) at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:532) at org.apache.lucene.index.IndexWriter$ReaderPool.get(IndexWriter.java:634) at org.apache.lucene.index.IndexWriter$ReaderPool.get(IndexWriter.java:610) at org.apache.lucene.index.DocumentsWriter.applyDeletes(DocumentsWriter.java:1012) at org.apache.lucene.index.IndexWriter.applyDeletes(IndexWriter.java:4563) at org.apache.lucene.index.IndexWriter.doFlushInternal(IndexWriter.java:3775) at org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:3623) at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:3614) at org.apache.lucene.index.IndexWriter.closeInternal(IndexWriter.java:1769) at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:1732) at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:1696) at org.apache.solr.update.SolrIndexWriter.close(SolrIndexWriter.java:230) at org.apache.solr.update.DirectUpdateHandler2.closeWriter(DirectUpdateHandler2.java:181) at org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:409) at org.apache.solr.update.DirectUpdateHandler2$CommitTracker.run(DirectUpdateHandler2.java:602) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:98) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:207) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) May 13, 2010 12:37:04 PM org.apache.solr.update.processor.LogUpdateProcessor finish INFO: {} 0 1 May 13, 2010 12:37:04 PM org.apache.solr.common.SolrException log SEVERE: java.io.IOException: directory '/home/simon/rig2/solr/core1/data/index' exists and is a directory, but cannot be listed: list() returned null at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:223) at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:234) at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:582) at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:535) at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:316) at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:1129) at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:999) at org.apache.solr.update.SolrIndexWriter.<init>(SolrIndexWriter.java:191) at org.apache.solr.update.UpdateHandler.createMainIndexWriter(UpdateHandler.java:99) at org.apache.solr.update.DirectUpdateHandler2.openWriter(DirectUpdateHandler2.java:173) at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:220) at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:61) at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:139) at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1321) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:341) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:244) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139) at org.mortbay.jetty.Server.handle(Server.java:285) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502) at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378) at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:368) at org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
        Hide
        Hoss Man added a comment -

        Simon: can you please provide a few more details:

        1) what mechanism are you using to index content? ie: POSTing XML from a remote client? using the stream.url or stream.file params? Using SolrCell? using DIH? (and if you are using DIH, from what source? DB? HTTP? File? .. and with what transformers?)

        2) what files does lsof show are open after each successive commit until the limit is reached? seeing how the file list grows – specifically which files are never getting closed – over time is really the only way to track down what code isn't closing files

        (The stack trace you posted shows what it's doing when it runs out of file handles, but there isn't necessarily any correlation between that and what code should be closing files but isn't )

        Show
        Hoss Man added a comment - Simon: can you please provide a few more details: 1) what mechanism are you using to index content? ie: POSTing XML from a remote client? using the stream.url or stream.file params? Using SolrCell? using DIH? (and if you are using DIH, from what source? DB? HTTP? File? .. and with what transformers?) 2) what files does lsof show are open after each successive commit until the limit is reached? seeing how the file list grows – specifically which files are never getting closed – over time is really the only way to track down what code isn't closing files (The stack trace you posted shows what it's doing when it runs out of file handles, but there isn't necessarily any correlation between that and what code should be closing files but isn't )
        Hide
        Yonik Seeley added a comment -

        Going from Feb to May crosses the point when Solr went from Lucene 2.9 to Lucene 3.0

        If there is a fd leak, one place to look would be ref counting in IndexReader and SolrIndexReader that extends FilterIndexReader. IndexReader is notoriously hard to wrap (subclass and delegate), and any change in implementation can make/expose new bugs.

        Show
        Yonik Seeley added a comment - Going from Feb to May crosses the point when Solr went from Lucene 2.9 to Lucene 3.0 If there is a fd leak, one place to look would be ref counting in IndexReader and SolrIndexReader that extends FilterIndexReader. IndexReader is notoriously hard to wrap (subclass and delegate), and any change in implementation can make/expose new bugs.
        Hide
        Hoss Man added a comment -

        one place to look would be ref counting in IndexReader

        true, but at this point w/o any idea what files we're dealing with it could be anything (SOLR-1877 comes to mind)

        Show
        Hoss Man added a comment - one place to look would be ref counting in IndexReader true, but at this point w/o any idea what files we're dealing with it could be anything ( SOLR-1877 comes to mind)
        Hide
        Simon Rosenthal added a comment -

        bq: 1) what mechanism are you using to index content? ie: POSTing XML from a remote client? using the stream.url or stream.file params? Using SolrCell? using DIH? (and if you are using DIH, from what source? DB? HTTP? File? .. and with what transformers?)

        posting XML from a local client, not using stream.url or stream.file

        2) what files does lsof show are open after each successive commit until the limit is reached? seeing how the file list grows - specifically which files are never getting closed - over time is really the only way to track down what code isn't closing files

        will attach lsof output taken after it reached the limit

        Show
        Simon Rosenthal added a comment - bq: 1) what mechanism are you using to index content? ie: POSTing XML from a remote client? using the stream.url or stream.file params? Using SolrCell? using DIH? (and if you are using DIH, from what source? DB? HTTP? File? .. and with what transformers?) posting XML from a local client, not using stream.url or stream.file 2) what files does lsof show are open after each successive commit until the limit is reached? seeing how the file list grows - specifically which files are never getting closed - over time is really the only way to track down what code isn't closing files will attach lsof output taken after it reached the limit
        Hide
        Simon Rosenthal added a comment -

        lsof output after the error occurred

        Show
        Simon Rosenthal added a comment - lsof output after the error occurred
        Hide
        Yonik Seeley added a comment -

        Hmmm, not sure I see any smoking guns in that lsof output.

        Are you using stock solrconfig.xml from the example?
        If not, can you try the latest trunk, and just with the example server only? This will help narrow down if it's a problem with some other component such as what hoss pointed out: SOLR-1877

        Show
        Yonik Seeley added a comment - Hmmm, not sure I see any smoking guns in that lsof output. Are you using stock solrconfig.xml from the example? If not, can you try the latest trunk, and just with the example server only? This will help narrow down if it's a problem with some other component such as what hoss pointed out: SOLR-1877
        Hide
        Simon Rosenthal added a comment -

        OK.. I built from latest trunk, used the schema associated with the index and example solrconfig.xml, as you asked.

        • Started with a snapshot of the index taken before this issue reared its head
        • used post.sh to add a file with around 800 documents (different one each time)
        • did a commit (no autocommit)
        • did an lsof on the process

        repeated the add/commit/lsof 5 times.

        The attached tarball contains the lsof outputs, and we're still seeing the number of fds incrementing by 38-40 after each commit. I didn't go to the bitter end, but I assume we'd get there...

        Here's a clue -?? I looked for file descriptors associated with one .prx file that was present in the original snapshot in each lsof output

        grep -c _r8.prx lsof.*
        lsof.0:1
        lsof.1:2
        lsof.2:3
        lsof.3:4
        lsof.4:5
        lsof.5:6

        The .frq files seem to have the same pattern.

        I'm assuming that's not good...

        Show
        Simon Rosenthal added a comment - OK.. I built from latest trunk, used the schema associated with the index and example solrconfig.xml, as you asked. Started with a snapshot of the index taken before this issue reared its head used post.sh to add a file with around 800 documents (different one each time) did a commit (no autocommit) did an lsof on the process repeated the add/commit/lsof 5 times. The attached tarball contains the lsof outputs, and we're still seeing the number of fds incrementing by 38-40 after each commit. I didn't go to the bitter end, but I assume we'd get there... Here's a clue -?? I looked for file descriptors associated with one .prx file that was present in the original snapshot in each lsof output grep -c _r8.prx lsof.* lsof.0:1 lsof.1:2 lsof.2:3 lsof.3:4 lsof.4:5 lsof.5:6 The .frq files seem to have the same pattern. I'm assuming that's not good...
        Hide
        Yonik Seeley added a comment -

        Here's a clue -?? I looked for file descriptors associated with one .prx file that was present in the original snapshot in each lsof output

        grep -c _r8.prx lsof.*
        lsof.0:1
        lsof.1:2
        lsof.2:3
        lsof.3:4
        lsof.4:5
        lsof.5:6

        OK... this definitely sounds like a leak then. And the fact that you're using the stock solrconfig.xml should rule out spellchecker, replication, etc.

        Show
        Yonik Seeley added a comment - Here's a clue -?? I looked for file descriptors associated with one .prx file that was present in the original snapshot in each lsof output grep -c _r8.prx lsof.* lsof.0:1 lsof.1:2 lsof.2:3 lsof.3:4 lsof.4:5 lsof.5:6 OK... this definitely sounds like a leak then. And the fact that you're using the stock solrconfig.xml should rule out spellchecker, replication, etc.
        Hide
        Simon Rosenthal added a comment -

        Not able to repeat the problem with a build from branch_3x done earlier today

        Solr Specification Version: 3.0.0.2010.06.18.12.54.02
        Solr Implementation Version: 3.1-dev 956045 - simon - 2010-06-18 12:54:02
        Lucene Specification Version: 3.1-dev
        Lucene Implementation Version: 3.1-dev 956045 - 2010-06-18 12:54:56

        Show
        Simon Rosenthal added a comment - Not able to repeat the problem with a build from branch_3x done earlier today Solr Specification Version: 3.0.0.2010.06.18.12.54.02 Solr Implementation Version: 3.1-dev 956045 - simon - 2010-06-18 12:54:02 Lucene Specification Version: 3.1-dev Lucene Implementation Version: 3.1-dev 956045 - 2010-06-18 12:54:56
        Hide
        Yonik Seeley added a comment -

        Not able to repeat the problem with a build from branch_3x done earlier today

        Interesting... but you see the problem with trunk still?

        Show
        Yonik Seeley added a comment - Not able to repeat the problem with a build from branch_3x done earlier today Interesting... but you see the problem with trunk still?
        Hide
        Simon Rosenthal added a comment -

        No - seems to have cleared up with trunk also,.

        I'm OK with closing it but am really curious to know what changed between mid May and today to clear up the problem.

        Show
        Simon Rosenthal added a comment - No - seems to have cleared up with trunk also,. I'm OK with closing it but am really curious to know what changed between mid May and today to clear up the problem.

          People

          • Assignee:
            Unassigned
            Reporter:
            Simon Rosenthal
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development