Uploaded image for project: 'ManifoldCF'
  1. ManifoldCF
  2. CONNECTORS-1317

Hang crawling job on some ZIP documents

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • ManifoldCF 2.3
    • ManifoldCF 2.5
    • File system connector
    • None
    • Ubuntu 14.04 Linux 3.13.0-86-generic i686 i686

      java version "1.8.0_31"
      Java(TM) SE Runtime Environment (build 1.8.0_31-b13)
      Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode)

      DB: Postgres 9.5.1

    Description

      I use ManifolCF as file crawler. But I found, that crawling process hangs on some zip files. Although some files parsing normally.

      Steps:
      1. Run ManfoldCF by "example/start.sh" and Posgres as DB
      2. Create manifold pipeline: File -> Tika -> Solr
      3. Put zip file in folder (in attach below)
      4. Run job

      Here zip file that should reproduce bug:
      "ManifoldCF_ISSUE_Dive.Into.Python.3.Mark.Pilgrim.2009.zip"
      https://yadi.sk/d/0uSdrR5GrsgmG

      Note:
      As I investigated (by strace) - crawler process tries to open and parse same zip file again and again (it seems from different workers threads). And It seems that document not removes from queue.

      I am newbie in ManifoldCF, so it is hard task to me to find problem in source code.

      I can send some additional info if needed.

      Attachments

        Activity

          People

            kwright@metacarta.com Karl Wright
            mrkeuz Mr.Keuz
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: