ManifoldCF
  1. ManifoldCF
  2. CONNECTORS-576

Manifold gets repeated service interruptions and stops

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Not a Problem
    • Affects Version/s: ManifoldCF next
    • Fix Version/s: ManifoldCF 1.1
    • Component/s: None
    • Labels:
      None
    • Environment:

      solr 4.0 manifoldcf v1.1-dev on windows 7

      Description

      Manifold gets repeated service interruptions and stops.
      Is there a way to get more detailed error information?
      such as, the document name/url/location that it's having a problem with?
      In v.5.1 these errors would appear at the very end (the last 130 to 184 document) and then stop.
      The solr logs always reported vague TIKA errors
      I'm unsure where the problems lie.
      Here's the manifoldcf log
      WARN 2012-12-04 10:27:40,722 (Worker thread '0') - Service interruption reported for job 1343845636068 connection 'LISA-DEV': Error 500 from ingestion request; ingestion will be retried again later
      ERROR 2012-12-04 10:27:40,754 (Worker thread '0') - Exception tossed: Repeated service interruptions - failure processing document: Ingestion HTTP error code 500
      org.apache.manifoldcf.core.interfaces.ManifoldCFException: Repeated service interruptions - failure processing document: Ingestion HTTP error code 500
      at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:585)
      Caused by: org.apache.manifoldcf.core.interfaces.ManifoldCFException: Ingestion HTTP error code 500
      at org.apache.manifoldcf.agents.output.solr.HttpPoster$IngestThread.run(HttpPoster.java:1386)
      WARN 2012-12-04 10:27:40,847 (Worker thread '24') - Service interruption reported for job 1343845636068 connection 'LISA-DEV': Job no longer active

      And here's the solr log if it helps:
      org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: XML parse error at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:215) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1561) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:442) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:263) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1337) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:484) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524) at

        Activity

        Hide
        Karl Wright added a comment -

        The exception indicates that the cause of the problem is a 500 error from Solr. This is likely to be due to the Tika exceptions you are seeing.

        It is quite possible that Tika is misconfigured. Unfortunately I cannot help you with that problem; you will have to go to the Tika project for that.

        Show
        Karl Wright added a comment - The exception indicates that the cause of the problem is a 500 error from Solr. This is likely to be due to the Tika exceptions you are seeing. It is quite possible that Tika is misconfigured. Unfortunately I cannot help you with that problem; you will have to go to the Tika project for that.
        Hide
        Karl Wright added a comment -

        The reason you see these at the end is because ManifoldCF retries documents that fail if there is any possibility that it could succeed when tried again. But it only does this for so long before it concludes that the problem isn't going away. If specific documents are causing the issue, then they all tend to accumulate at the end of a job run.

        If you want to see which documents are failing, just go to the Simple History report and you should see failed ingestion attempts, which provide the document URL.

        Show
        Karl Wright added a comment - The reason you see these at the end is because ManifoldCF retries documents that fail if there is any possibility that it could succeed when tried again. But it only does this for so long before it concludes that the problem isn't going away. If specific documents are causing the issue, then they all tend to accumulate at the end of a job run. If you want to see which documents are failing, just go to the Simple History report and you should see failed ingestion attempts, which provide the document URL.
        Hide
        Karl Wright added a comment -

        Not a ManifoldCF problem; see Tika

        Show
        Karl Wright added a comment - Not a ManifoldCF problem; see Tika
        Hide
        David Morana added a comment -

        I agree; I checked the simple history and I only saw Ok and 200; no errors.
        Would there be any value in making manifold go around these?
        Such as, finish everything else and then go back and try these later?
        Corral these docs and put them in a report? Documents successfully crawled but solr couldn't take them...
        let me know,
        Thanks,
        BTW, I really like ManifoldCF; I looked at a lot of other crawlers and this is only one that does what I need it to do.
        Thanks again,

        Show
        David Morana added a comment - I agree; I checked the simple history and I only saw Ok and 200; no errors. Would there be any value in making manifold go around these? Such as, finish everything else and then go back and try these later? Corral these docs and put them in a report? Documents successfully crawled but solr couldn't take them... let me know, Thanks, BTW, I really like ManifoldCF; I looked at a lot of other crawlers and this is only one that does what I need it to do. Thanks again,
        Hide
        Karl Wright added a comment -

        Any errors should also be printed in the manifoldcf.log, as warnings. Do you see anything there?

        I believe there is a configuration setting in Solr that will allow it to ignore Tika errors. But I don't recall what it is.

        Show
        Karl Wright added a comment - Any errors should also be printed in the manifoldcf.log, as warnings. Do you see anything there? I believe there is a configuration setting in Solr that will allow it to ignore Tika errors. But I don't recall what it is.
        Hide
        David Morana added a comment -

        All I see in the log is vague errors.
        Is there anyway to see exactly what docs are not getting into solr?
        I can't seem to find that ignore tika errors command. If you happen to find it send it my way please...
        I'm almost certain that the tika error is a problem with ppt docs.

        Show
        David Morana added a comment - All I see in the log is vague errors. Is there anyway to see exactly what docs are not getting into solr? I can't seem to find that ignore tika errors command. If you happen to find it send it my way please... I'm almost certain that the tika error is a problem with ppt docs.
        Hide
        David Morana added a comment -

        Here's the log; the errors just say it gave up...
        WARN 2012-12-04 11:54:20,556 (Worker thread '1') - Service interruption reported for job 1343845636068 connection 'LISA-DEV': Job no longer active
        WARN 2012-12-04 12:55:40,689 (Worker thread '41') - Service interruption reported for job 1343845636068 connection 'LISA-DEV': Error 500 from ingestion request; ingestion will be retried again later
        ERROR 2012-12-04 12:55:40,709 (Worker thread '41') - Exception tossed: Repeated service interruptions - failure processing document: Ingestion HTTP error code 500
        org.apache.manifoldcf.core.interfaces.ManifoldCFException: Repeated service interruptions - failure processing document: Ingestion HTTP error code 500
        at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:585)
        Caused by: org.apache.manifoldcf.core.interfaces.ManifoldCFException: Ingestion HTTP error code 500
        at org.apache.manifoldcf.agents.output.solr.HttpPoster$IngestThread.run(HttpPoster.java:1386)
        WARN 2012-12-04 12:55:41,899 (Worker thread '18') - Pre-ingest service interruption reported for job 1343845636068 connection 'LISA-DEV': Job no longer active
        WARN 2012-12-04 12:55:42,299 (Worker thread '30') - Pre-ingest service interruption reported for job 1343845636068 connection 'LISA-DEV': Job no longer active
        WARN 2012-12-04 12:55:43,456 (Worker thread '27') - Pre-ingest service interruption reported for job 1343845636068 connection 'LISA-DEV': Job no longer active
        WARN 2012-12-04 12:55:43,877 (Worker thread '19') - Pre-ingest service interruption reported for job 1343845636068 connection 'LISA-DEV': Job no longer active
        WARN 2012-12-04 12:55:44,876 (Worker thread '0') - Pre-ingest service interruption reported for job 1343845636068 connection 'LISA-DEV': Job no longer active
        WARN 2012-12-04 12:55:45,266 (Worker thread '31') - Pre-ingest service interruption reported for job 1343845636068 connection 'LISA-DEV': Job no longer active
        WARN 2012-12-04 12:55:46,420 (Worker thread '7') - Pre-ingest service interruption reported for job 1343845636068 connection 'LISA-DEV': Job no longer active
        WARN 2012-12-04 12:55:46,467 (Worker thread '43') - Pre-ingest service interruption reported for job 1343845636068 connection 'LISA-DEV': Job no longer active
        WARN 2012-12-04 12:55:46,826 (Worker thread '2') - Pre-ingest service interruption reported for job 1343845636068 connection 'LISA-DEV': Job no longer active
        WARN 2012-12-04 12:55:47,840 (Worker thread '16') - Service interruption reported for job 1343845636068 connection 'LISA-DEV': Job no longer active
        WARN 2012-12-04 12:55:47,871 (Worker thread '17') - Pre-ingest service interruption reported for job 1343845636068 connection 'LISA-DEV': Job no longer active
        WARN 2012-12-04 12:55:47,902 (Worker thread '28') - Pre-ingest service interruption reported for job 1343845636068 connection 'LISA-DEV': Job no longer active
        WARN 2012-12-04 12:55:48,261 (Worker thread '9') - Pre-ingest service interruption reported for job 1343845636068 connection 'LISA-DEV': Job no longer active
        WARN 2012-12-04 12:55:48,682 (Worker thread '49') - Service interruption reported for job 1343845636068 connection 'LISA-DEV': Job no longer active
        WARN 2012-12-04 12:55:49,118 (Worker thread '12') - Pre-ingest service interruption reported for job 1343845636068 connection 'LISA-DEV': Job no longer active
        WARN 2012-12-04 12:55:49,148 (Worker thread '20') - Pre-ingest service interruption reported for job 1343845636068 connection 'LISA-DEV': Job no longer active
        WARN 2012-12-04 12:55:49,178 (Worker thread '48') - Pre-ingest service interruption reported for job 1343845636068 connection 'LISA-DEV': Job no longer active
        WARN 2012-12-04 12:55:49,478 (Worker thread '33') - Pre-ingest service interruption reported for job 1343845636068 connection 'LISA-DEV': Job no longer active
        WARN 2012-12-04 12:55:49,878 (Worker thread '46') - Pre-ingest service interruption reported for job 1343845636068 connection 'LISA-DEV': Job no longer active
        WARN 2012-12-04 12:55:50,398 (Worker thread '35') - Pre-ingest service interruption reported for job 1343845636068 connection 'LISA-DEV': Job no longer active
        WARN 2012-12-04 12:55:51,760 (Worker thread '11') - Pre-ingest service interruption reported for job 1343845636068 connection 'LISA-DEV': Job no longer active
        WARN 2012-12-04 12:55:54,812 (Worker thread '47') - Service interruption reported for job 1343845636068 connection 'LISA-DEV': Job no longer active
        WARN 2012-12-04 12:55:55,402 (Worker thread '24') - Service interruption reported for job 1343845636068 connection 'LISA-DEV': Job no longer active
        WARN 2012-12-04 12:55:58,548 (Worker thread '29') - Service interruption reported for job 1343845636068 connection 'LISA-DEV': Error 500 from ingestion request; ingestion will be retried again later
        WARN 2012-12-04 12:55:59,328 (Worker thread '8') - Service interruption reported for job 1343845636068 connection 'LISA-DEV': Job no longer active

        Show
        David Morana added a comment - Here's the log; the errors just say it gave up... WARN 2012-12-04 11:54:20,556 (Worker thread '1') - Service interruption reported for job 1343845636068 connection 'LISA-DEV': Job no longer active WARN 2012-12-04 12:55:40,689 (Worker thread '41') - Service interruption reported for job 1343845636068 connection 'LISA-DEV': Error 500 from ingestion request; ingestion will be retried again later ERROR 2012-12-04 12:55:40,709 (Worker thread '41') - Exception tossed: Repeated service interruptions - failure processing document: Ingestion HTTP error code 500 org.apache.manifoldcf.core.interfaces.ManifoldCFException: Repeated service interruptions - failure processing document: Ingestion HTTP error code 500 at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:585) Caused by: org.apache.manifoldcf.core.interfaces.ManifoldCFException: Ingestion HTTP error code 500 at org.apache.manifoldcf.agents.output.solr.HttpPoster$IngestThread.run(HttpPoster.java:1386) WARN 2012-12-04 12:55:41,899 (Worker thread '18') - Pre-ingest service interruption reported for job 1343845636068 connection 'LISA-DEV': Job no longer active WARN 2012-12-04 12:55:42,299 (Worker thread '30') - Pre-ingest service interruption reported for job 1343845636068 connection 'LISA-DEV': Job no longer active WARN 2012-12-04 12:55:43,456 (Worker thread '27') - Pre-ingest service interruption reported for job 1343845636068 connection 'LISA-DEV': Job no longer active WARN 2012-12-04 12:55:43,877 (Worker thread '19') - Pre-ingest service interruption reported for job 1343845636068 connection 'LISA-DEV': Job no longer active WARN 2012-12-04 12:55:44,876 (Worker thread '0') - Pre-ingest service interruption reported for job 1343845636068 connection 'LISA-DEV': Job no longer active WARN 2012-12-04 12:55:45,266 (Worker thread '31') - Pre-ingest service interruption reported for job 1343845636068 connection 'LISA-DEV': Job no longer active WARN 2012-12-04 12:55:46,420 (Worker thread '7') - Pre-ingest service interruption reported for job 1343845636068 connection 'LISA-DEV': Job no longer active WARN 2012-12-04 12:55:46,467 (Worker thread '43') - Pre-ingest service interruption reported for job 1343845636068 connection 'LISA-DEV': Job no longer active WARN 2012-12-04 12:55:46,826 (Worker thread '2') - Pre-ingest service interruption reported for job 1343845636068 connection 'LISA-DEV': Job no longer active WARN 2012-12-04 12:55:47,840 (Worker thread '16') - Service interruption reported for job 1343845636068 connection 'LISA-DEV': Job no longer active WARN 2012-12-04 12:55:47,871 (Worker thread '17') - Pre-ingest service interruption reported for job 1343845636068 connection 'LISA-DEV': Job no longer active WARN 2012-12-04 12:55:47,902 (Worker thread '28') - Pre-ingest service interruption reported for job 1343845636068 connection 'LISA-DEV': Job no longer active WARN 2012-12-04 12:55:48,261 (Worker thread '9') - Pre-ingest service interruption reported for job 1343845636068 connection 'LISA-DEV': Job no longer active WARN 2012-12-04 12:55:48,682 (Worker thread '49') - Service interruption reported for job 1343845636068 connection 'LISA-DEV': Job no longer active WARN 2012-12-04 12:55:49,118 (Worker thread '12') - Pre-ingest service interruption reported for job 1343845636068 connection 'LISA-DEV': Job no longer active WARN 2012-12-04 12:55:49,148 (Worker thread '20') - Pre-ingest service interruption reported for job 1343845636068 connection 'LISA-DEV': Job no longer active WARN 2012-12-04 12:55:49,178 (Worker thread '48') - Pre-ingest service interruption reported for job 1343845636068 connection 'LISA-DEV': Job no longer active WARN 2012-12-04 12:55:49,478 (Worker thread '33') - Pre-ingest service interruption reported for job 1343845636068 connection 'LISA-DEV': Job no longer active WARN 2012-12-04 12:55:49,878 (Worker thread '46') - Pre-ingest service interruption reported for job 1343845636068 connection 'LISA-DEV': Job no longer active WARN 2012-12-04 12:55:50,398 (Worker thread '35') - Pre-ingest service interruption reported for job 1343845636068 connection 'LISA-DEV': Job no longer active WARN 2012-12-04 12:55:51,760 (Worker thread '11') - Pre-ingest service interruption reported for job 1343845636068 connection 'LISA-DEV': Job no longer active WARN 2012-12-04 12:55:54,812 (Worker thread '47') - Service interruption reported for job 1343845636068 connection 'LISA-DEV': Job no longer active WARN 2012-12-04 12:55:55,402 (Worker thread '24') - Service interruption reported for job 1343845636068 connection 'LISA-DEV': Job no longer active WARN 2012-12-04 12:55:58,548 (Worker thread '29') - Service interruption reported for job 1343845636068 connection 'LISA-DEV': Error 500 from ingestion request; ingestion will be retried again later WARN 2012-12-04 12:55:59,328 (Worker thread '8') - Service interruption reported for job 1343845636068 connection 'LISA-DEV': Job no longer active
        Hide
        Karl Wright added a comment -

        Well, if you want lots of output, you can turn on connector debugging. In your properties.xml, add the following:
        <property name="org.apache.manifoldcf.connectors" value="DEBUG"/>

        then restart the agents process, and rerun the job.

        Show
        Karl Wright added a comment - Well, if you want lots of output, you can turn on connector debugging. In your properties.xml, add the following: <property name="org.apache.manifoldcf.connectors" value="DEBUG"/> then restart the agents process, and rerun the job.

          People

          • Assignee:
            Unassigned
            Reporter:
            David Morana
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development