Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1854

./bin/crawl fails with a parsing fetcher

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.9
    • 1.10
    • parser

    Description

      If you run ./bin/crawl with a parsing fetcher e.g.

      <property>
      > <name>fetcher.parse</name>
      > <value>false</value>
      > <description>If true, fetcher will parse content. Default is false,
      > which means
      > that a separate parsing step is required after fetching is
      > finished.</description>
      > </property>

      we get a horrible message as follows

      Exception in thread "main" java.io.IOException: Segment already parsed!

      We could improve this by making logging more complete and by adding a trigger to the crawl script which would check for crawl_parse for a given segment and then skip parsing if this is present.

      Attachments

        1. NUTCH-1854ver1.patch
          2 kB
          Asitang Mishra
        2. NUTCH-1854ver2.patch
          3 kB
          Asitang Mishra
        3. NUTCH-1854ver3.patch
          2 kB
          Asitang Mishra
        4. NUTCH-1854ver4.patch
          2 kB
          Asitang Mishra

        Activity

          People

            snagel Sebastian Nagel
            lewismc Lewis John McGibbney
            Votes:
            1 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: