Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1893

Parse-tika fails to parse feed files

VotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 2.3, 1.9
    • 1.10, 2.3.1
    • parser
    • None
    • Windows 7 + Cygwin + JDK 7

    Description

      In the Nutch parse step, I received the following error. It seems the parse-tika plugin has broken.

      $ /cygdrive/d/nutch_trunk/runtime/local/bin/nutch parse -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true -D mapred.skip.attempts.to.start.skipping=2 -D mapred.skip.map.max.skip.records=1 crawlId/segments/20141118235323

      java.lang.ExceptionInInitializerError
      at com.sun.syndication.io.SyndFeedInput.build(SyndFeedInput.java:136)
      at org.apache.tika.parser.feed.FeedParser.parse(FeedParser.java:70)
      at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:103)
      at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:95)
      at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:101)
      at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44)
      at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
      at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
      at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
      at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
      Caused by: java.lang.NullPointerException
      at java.util.Properties$LineReader.readLine(Properties.java:434)
      at java.util.Properties.load0(Properties.java:353)
      at java.util.Properties.load(Properties.java:341)
      at com.sun.syndication.io.impl.PropertiesLoader.<init>(PropertiesLoader.java:74)
      at com.sun.syndication.io.impl.PropertiesLoader.getPropertiesLoader(PropertiesLoader.java:46)
      at com.sun.syndication.io.impl.PluginManager.<init>(PluginManager.java:54)
      at com.sun.syndication.io.impl.PluginManager.<init>(PluginManager.java:46)
      at com.sun.syndication.feed.synd.impl.Converters.<init>(Converters.java:40)
      at com.sun.syndication.feed.synd.SyndFeedImpl.<clinit>(SyndFeedImpl.java:59)
      ... 10 more

      Attachments

        1. NUTCH-1893.mywang.141209.txt
          1 kB
          Angela Wang
        2. NUTCH-1893-v1.patch
          1.0 kB
          Sebastian Nagel

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            snagel Sebastian Nagel
            angela_wang Angela Wang
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment