Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1516

Downgrade Rome dependency to 0.9 to avoid nasty NPE

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.6
    • Fix Version/s: 1.12
    • Component/s: parser
    • Labels:
      None

      Description

      As documented in this thread Nutch's parse-tika uses Rome 1.0, this is inherited directly from the Tika pom.xml for the same depenency.
      A downgrade is required.

      java.lang.Exception: java.lang.ExceptionInInitializerError
              at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)
      Caused by: java.lang.ExceptionInInitializerError
              at com.sun.syndication.io.SyndFeedInput.build(SyndFeedInput.java:136)
              at org.apache.tika.parser.feed.FeedParser.parse(FeedParser.java:70)
              at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:105)
              at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:95)
              at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:101)
              at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44)
              at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
              at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
              at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366)
              at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)
              at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
              at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
              at java.util.concurrent.FutureTask.run(FutureTask.java:138)
              at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
              at java.lang.Thread.run(Thread.java:662)
      Caused by: java.lang.NullPointerException
              at java.util.Properties$LineReader.readLine(Properties.java:418)
              at java.util.Properties.load0(Properties.java:337)
              at java.util.Properties.load(Properties.java:325)
              at com.sun.syndication.io.impl.PropertiesLoader.<init>(PropertiesLoader.java:74)
              at com.sun.syndication.io.impl.PropertiesLoader.getPropertiesLoader(PropertiesLoader.java:46)
              at com.sun.syndication.io.impl.PluginManager.<init>(PluginManager.java:54)
              at com.sun.syndication.io.impl.PluginManager.<init>(PluginManager.java:46)
              at com.sun.syndication.feed.synd.impl.Converters.<init>(Converters.java:40)
              at com.sun.syndication.feed.synd.SyndFeedImpl.<clinit>(SyndFeedImpl.java:59)
              ... 16 more
      
      1. TIKA-1516.patch
        0.4 kB
        Lewis John McGibbney

        Issue Links

          Activity

          Hide
          lewismc Lewis John McGibbney added a comment -

          Small change. I apologies for not including test case. I cannot locate the RSS Feed which this applies to or else I would write the trivial case.
          My build fails with the following

          Results :
          Failed tests:   testPPTXOCR(org.apache.tika.parser.ocr.TesseractOCRParserTest)
            testDOCXOCR(org.apache.tika.parser.ocr.TesseractOCRParserTest)
            testPDFOCR(org.apache.tika.parser.ocr.TesseractOCRParserTest)
            testSingleImage(org.apache.tika.parser.ocr.TesseractOCRParserTest): OCR Testing not found in:(..)
          
          Show
          lewismc Lewis John McGibbney added a comment - Small change. I apologies for not including test case. I cannot locate the RSS Feed which this applies to or else I would write the trivial case. My build fails with the following Results : Failed tests: testPPTXOCR(org.apache.tika.parser.ocr.TesseractOCRParserTest) testDOCXOCR(org.apache.tika.parser.ocr.TesseractOCRParserTest) testPDFOCR(org.apache.tika.parser.ocr.TesseractOCRParserTest) testSingleImage(org.apache.tika.parser.ocr.TesseractOCRParserTest): OCR Testing not found in:(..)
          Hide
          lewismc Lewis John McGibbney added a comment -

          Anyone to approve this folks?

          Show
          lewismc Lewis John McGibbney added a comment - Anyone to approve this folks?
          Hide
          gagravarr Nick Burch added a comment -

          Is it not possible to upgrade to a newer version of Rome with a fix for the NPE in it?

          What would we loose with the downgrade?

          What % of feeds are you seeing that fail? If that's a low number, is adding a try / catch / log an option?

          Show
          gagravarr Nick Burch added a comment - Is it not possible to upgrade to a newer version of Rome with a fix for the NPE in it? What would we loose with the downgrade? What % of feeds are you seeing that fail? If that's a low number, is adding a try / catch / log an option?
          Hide
          grossws Konstantin Gribov added a comment -

          Upstream bug isn't fixed yet. See https://github.com/rometools/rome/issues/130.

          Show
          grossws Konstantin Gribov added a comment - Upstream bug isn't fixed yet. See https://github.com/rometools/rome/issues/130 .
          Hide
          grossws Konstantin Gribov added a comment -

          Also, it seems to be an classloader issue, so it should depend only on callstack, not data feed.

          Show
          grossws Konstantin Gribov added a comment - Also, it seems to be an classloader issue, so it should depend only on callstack, not data feed.
          Hide
          lewismc Lewis John McGibbney added a comment -

          Nick Burch
          This Rome library is not under heavy development.
          There is no new version as far as I know.
          Based upon the current tests we have in Tika for the RSSParser there exists limited reliable information regarding regression.
          Konstantin Gribov, can you expand? It leaves me curious as to use of language should. As I said, I do not have the feed that this failed on or else I would make best efforts to reproduce and provide a test. Is the statement you've made on classloading not specific to the business logic of the Web server --> RSS Feed parsing using Rome within the case you've cited.

          Show
          lewismc Lewis John McGibbney added a comment - Nick Burch This Rome library is not under heavy development. There is no new version as far as I know. Based upon the current tests we have in Tika for the RSSParser there exists limited reliable information regarding regression. Konstantin Gribov , can you expand? It leaves me curious as to use of language should . As I said, I do not have the feed that this failed on or else I would make best efforts to reproduce and provide a test. Is the statement you've made on classloading not specific to the business logic of the Web server --> RSS Feed parsing using Rome within the case you've cited.
          Hide
          grossws Konstantin Gribov added a comment -

          Lewis John McGibbney, if I understood you correct (English isn't my strong point), you asked is this case general or specific for for some input. My point is that this issue has general cause which can be reproduced only in non-trivial classloading scheme. Little analysis below.

          Rome 1.0 use Thread.currentThread().getContextClassLoader() for loading both com/sun/syndication/rome.properties (default rome config) and rome.properties (user config). If classloader, used to load rome, is current thread's context classloader (CCL) or its ancestor classloader this works fine.

          Earlier rome (0.9) used same classloader as used when loading rome's PluginManager class (which invokes PropertiesLoader). This method finds default rome config in since they are loaded from same jar. But if user code is loaded by other classloader (e. g. for security reason) in different classloading branch it rome can't find user config. I don't know hadoop classloading scheme (and nutch use hadoop) but such case can be simply reproduced in servlet container if rome is loaded by ext/common classloader and app – by webapp classloader.

          I think, this was a reason to use CCL, but it lead to new problem. If is set to system and rome is loaded by descandant classloader (PluginClassLoader in this case) rome can't load its default config and ends with NPE as above.

          I'll try to create test for this case soon.

          Show
          grossws Konstantin Gribov added a comment - Lewis John McGibbney , if I understood you correct (English isn't my strong point), you asked is this case general or specific for for some input. My point is that this issue has general cause which can be reproduced only in non-trivial classloading scheme. Little analysis below. Rome 1.0 use Thread.currentThread().getContextClassLoader() for loading both com/sun/syndication/rome.properties (default rome config) and rome.properties (user config). If classloader, used to load rome, is current thread's context classloader (CCL) or its ancestor classloader this works fine. Earlier rome (0.9) used same classloader as used when loading rome's PluginManager class (which invokes PropertiesLoader). This method finds default rome config in since they are loaded from same jar. But if user code is loaded by other classloader (e. g. for security reason) in different classloading branch it rome can't find user config. I don't know hadoop classloading scheme (and nutch use hadoop) but such case can be simply reproduced in servlet container if rome is loaded by ext/common classloader and app – by webapp classloader. I think, this was a reason to use CCL, but it lead to new problem. If is set to system and rome is loaded by descandant classloader ( PluginClassLoader in this case) rome can't load its default config and ends with NPE as above. I'll try to create test for this case soon.
          Hide
          grossws Konstantin Gribov added a comment -

          Also, patch from their bugtracker reverts default behavior to pre-1.0, so it should fix it in some environments. But it looks like kludge to me.

          Show
          grossws Konstantin Gribov added a comment - Also, patch from their bugtracker reverts default behavior to pre-1.0, so it should fix it in some environments. But it looks like kludge to me.
          Hide
          davemeikle Dave Meikle added a comment -
          • Pushed to 1.11 following 1.10 release
          Show
          davemeikle Dave Meikle added a comment - Pushed to 1.11 following 1.10 release
          Hide
          lewismc Lewis John McGibbney added a comment - - edited

          Committed revision 1722935 in trunk with patch from TIKA-1820

          Show
          lewismc Lewis John McGibbney added a comment - - edited Committed revision 1722935 in trunk with patch from TIKA-1820
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in tika-trunk-jdk1.7 #898 (See https://builds.apache.org/job/tika-trunk-jdk1.7/898/)
          TIKA-1820 Upgrade rome to 1.5.1 && TIKA-1516 Downgrade Rome dependency to 0.9 to avoid nasty NPE (lewismc: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1722935)

          • trunk/CHANGES.txt
          • trunk/tika-bundle/pom.xml
          • trunk/tika-parsers/pom.xml
          • trunk/tika-parsers/src/main/java/org/apache/tika/parser/feed/FeedParser.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in tika-trunk-jdk1.7 #898 (See https://builds.apache.org/job/tika-trunk-jdk1.7/898/ ) TIKA-1820 Upgrade rome to 1.5.1 && TIKA-1516 Downgrade Rome dependency to 0.9 to avoid nasty NPE (lewismc: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1722935 ) trunk/CHANGES.txt trunk/tika-bundle/pom.xml trunk/tika-parsers/pom.xml trunk/tika-parsers/src/main/java/org/apache/tika/parser/feed/FeedParser.java

            People

            • Assignee:
              lewismc Lewis John McGibbney
              Reporter:
              lewismc Lewis John McGibbney
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development