Uploaded image for project: 'Apache Any23 (Retired)'
  1. Apache Any23 (Retired)
  2. ANY23-280

Refactor ContentExtractor to improve extraction flexibility

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • 1.1
    • 2.2
    • core, extractors
    • None

    Description

      As discussed on ANY23-247, the ContentExtractor is simply not fit for purpose. This issue was discovered and the cause has plagued our builds ever since. Any extractors which implement BaseRDFExtractor are based on the Extractor.ContentExtractor and hence work off of an 'unfixed' raw data stream as oppose to a more flexible model such as the TagSoupDOMExtractor.
      This issue should refactor RDF extractors to enable more flexibility and to avoid issues we encounter with the strict SAX parsing logic.

      Attachments

        Issue Links

          Activity

            People

              lewismc Lewis John McGibbney
              lewismc Lewis John McGibbney
              Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: