Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-106

Remove dependency on Jakarta ORO - use JDK 1.4 Regex

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • None
    • 0.1-incubating
    • general
    • None

    Description

      Jakarta ORO is only used in one place in Tika - the RegexUtils's extract() method (which is only called in one place in ParserPostProcessor). JDK 1.4 introduced built in regular expression support and changing the RegexUtils to use this would remove the need for Jakarta ORO as a dependency.

      From the comments in RegexUtils it apears that this code was copied from Nutch's OutlinkExtractor[1] - there seems to have been a similar move in Nutch back in March in r516754[2] - however it was reverted the next day in r517015[3] - I couldn't really see anything on the Nutch dev list to explain this, except possibly this post http://tinyurl.com/2s2y9r

      [1] http://svn.apache.org/repos/asf/lucene/nutch/trunk/src/java/org/apache/nutch/parse/OutlinkExtractor.java
      [2] http://svn.apache.org/viewvc?view=rev&revision=516754
      [3] http://svn.apache.org/viewvc?view=rev&revision=517015

      Attachments

        Activity

          People

            jukkaz Jukka Zitting
            niallp Niall Pemberton
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: