Uploaded image for project: 'ManifoldCF'
  1. ManifoldCF
  2. CONNECTORS-1500

HTML Extractor transformation connector contribution

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • ManifoldCF 2.9.1
    • ManifoldCF 2.10
    • None
    • None

    Description

      Hi,

      I developed a transformation connector based on Jsoup. The goal of this code is to simply choose an encompassing tag in a HTML document for text extracting. And inside this tag, this connector allows you to remove subparts that you do no want : all the tags corresponding to declared types or specific attribute tag names for example.
      The code is in Apache V2 licence  and it is in attachment.

      It needs some work including code refactoring, renaming classes, unit tests that I will be able to do if you are interested by the code.
      The documentation is here :

      https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/237240321/HTML+Extractor+Transformation+connector<https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/237240321/HTML+Extractor+Transformation+connector

       

      It does not use additional libraries that the ones already present in MCF project. It is based on Jsoup library on lib folder.

      Best regards,

      Olivier

      Attachments

        1. fix_englobing_tag_selection.txt
          1 kB
          Olivier Tavard
        2. global_patch.txt
          49 kB
          Olivier Tavard
        3. html_extractor_transformation_connector.txt
          116 kB
          Olivier Tavard
        4. patch_html_extractor_08_14_18.txt
          1 kB
          Olivier Tavard
        5. patch_HTML_extractor_connector_05_06_19.txt
          0.9 kB
          Olivier Tavard
        6. patch_html_extractor_fix_logs_08_10_18.txt
          3 kB
          Olivier Tavard

        Issue Links

          Activity

            People

              kwright@metacarta.com Karl Wright
              olivierfl Olivier Tavard
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: