[NUTCH-20] Extract urls from plain texts - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Trivial
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.8
Component/s: fetcher
Labels:
None

Description

Some parsers have no Outlinks returned. E.g. the Word-Parser.
This class is able to extract (absolute) hyperlinks from a plain String (content) and generates outlinks from them.
This would be very usful for parser which have no explicite extraction of hyperlinks.

Excample:

Outlink[] links = OutlinkExtractor.getOutlinks("Nutch is located at http://www.apache.org and ...");

Will return an array of Outlinks containing the one element of "http://www.apache.org".

transfered from: http://sourceforge.net/tracker/index.php?func=detail&aid=1109328&group_id=59548&atid=491356
submitted by: Stephan Strittmatter

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

OutlinkExtractor.java
02/Aug/05 20:58
7 kB
Stephan Strittmatter
OutlinkExtractor.java
22/Apr/05 21:17
7 kB
Stephan Strittmatter
OutlinkExtractor.java
29/Mar/05 17:32
6 kB
Stephan Strittmatter
patch.txt
29/Mar/05 05:12
7 kB
Stephan Strittmatter
TestOutlink.java
22/Apr/05 21:17
3 kB
Stephan Strittmatter
TestOutlink.java
29/Mar/05 17:32
2 kB
Stephan Strittmatter

Activity

People

Assignee:: Unassigned

Reporter:: Stefan Groschupf

Votes:: 3 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 26/Mar/05 23:35

Updated:: 20/Aug/05 06:21

Resolved:: 20/Aug/05 06:22