Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-20

Extract urls from plain texts

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Trivial
    • Resolution: Fixed
    • None
    • 0.8
    • fetcher
    • None

    Description

      Some parsers have no Outlinks returned. E.g. the Word-Parser.
      This class is able to extract (absolute) hyperlinks from a plain String (content) and generates outlinks from them.
      This would be very usful for parser which have no explicite extraction of hyperlinks.

      Excample:

      Outlink[] links = OutlinkExtractor.getOutlinks("Nutch is located at http://www.apache.org and ...");

      Will return an array of Outlinks containing the one element of "http://www.apache.org".


      transfered from: http://sourceforge.net/tracker/index.php?func=detail&aid=1109328&group_id=59548&atid=491356
      submitted by: Stephan Strittmatter

      Attachments

        1. OutlinkExtractor.java
          7 kB
          Stephan Strittmatter
        2. OutlinkExtractor.java
          7 kB
          Stephan Strittmatter
        3. OutlinkExtractor.java
          6 kB
          Stephan Strittmatter
        4. patch.txt
          7 kB
          Stephan Strittmatter
        5. TestOutlink.java
          3 kB
          Stephan Strittmatter
        6. TestOutlink.java
          2 kB
          Stephan Strittmatter

        Activity

          People

            Unassigned Unassigned
            joa23 Stefan Groschupf
            Votes:
            3 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: