[CONNECTORS-1550] HTML Tag mapping - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Wish
Status: Resolved
Priority: Major
Resolution: Not A Problem
Affects Version/s: ManifoldCF 2.10
Fix Version/s: None
Component/s: Elastic Search connector, Tika extractor, Web connector
Labels:
None

Description

I’ll be crawling a website with the standard Web connecter. I want to extract just certain html tags like <h1>, <h2> and <p>.
I’ve set up an HTML extractor transformation connector and the internal Tika transformation connector. But I can’t find any place to do a mapping to the output for this.

Do I have to write my own transformation connector to extract the content of these tags? Or is there a built in solution?

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Donald Van den Driessche

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 19/Oct/18 11:25

Updated:: 19/Oct/18 11:31

Resolved:: 19/Oct/18 11:31