Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2391

Extract <script> elements in html as "attachment" type MACRO like we do in the PDFParser

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.16
    • Component/s: None
    • Labels:
      None
    1. proposed_output.txt
      28 kB
      Tim Allison
    2. testScripts.htm
      53 kB
      Tim Allison
    3. TIKA_2391____first_draft.patch
      8 kB
      Tim Allison

      Issue Links

        Activity

        Hide
        tallison@mitre.org Tim Allison added a comment -

        I downloaded the html page suggested by Jayesh Shende on TIKA-2382, and I've dumped the proposed output in the RecursiveParserWrapper format.

        There are 10 metadata objects. The first contains the main page, and then there are 9 scripts.

        I'm not sure what we should do with the src= info, when a script relies on an external resource rather than inlining the code.

        Dumb question: what other types besides js can we have? Should we have a mapping from type= to mimetype that we can pass in to the child's metadata?

        For now, we're still ignoring <style> elements.

        I'd want to require users to turn this behavior on via an HTMLParserConfig.

        Big question, what do you think? Other areas for improvements?

        Show
        tallison@mitre.org Tim Allison added a comment - I downloaded the html page suggested by Jayesh Shende on TIKA-2382 , and I've dumped the proposed output in the RecursiveParserWrapper format. There are 10 metadata objects. The first contains the main page, and then there are 9 scripts. I'm not sure what we should do with the src= info, when a script relies on an external resource rather than inlining the code. Dumb question: what other types besides js can we have? Should we have a mapping from type= to mimetype that we can pass in to the child's metadata? For now, we're still ignoring <style> elements. I'd want to require users to turn this behavior on via an HTMLParserConfig. Big question, what do you think? Other areas for improvements?
        Hide
        tallison@mitre.org Tim Allison added a comment -

        draft of patch

        Show
        tallison@mitre.org Tim Allison added a comment - draft of patch
        Hide
        tallison@mitre.org Tim Allison added a comment -

        For now, this requires that there be content in the <script><script> for there to be an embedded docuement. If there's a src attribute, as in <script src="something.js"></script>, this is represented in the xhtml as it was before, but no MACRO embedded document/attachment is created.

        Show
        tallison@mitre.org Tim Allison added a comment - For now, this requires that there be content in the <script><script> for there to be an embedded docuement. If there's a src attribute, as in <script src="something.js"></script>, this is represented in the xhtml as it was before, but no MACRO embedded document/attachment is created.
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Jenkins build Tika-trunk #1293 (See https://builds.apache.org/job/Tika-trunk/1293/)
        TIKA-2391 – extract <script> elements as embedded documents (tallison: https://github.com/apache/tika/commit/132d3e7ff0387234c65cfc2983dd2b6ba59ff38b)

        • (add) tika-parsers/src/test/resources/org/apache/tika/parser/html/tika-config.xml
        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlHandler.java
        • (edit) tika-parsers/src/test/java/org/apache/tika/parser/html/HtmlParserTest.java
        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlParser.java
        • (edit) CHANGES.txt
        • (add) tika-core/src/main/java/org/apache/tika/metadata/HTML.java
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Jenkins build Tika-trunk #1293 (See https://builds.apache.org/job/Tika-trunk/1293/ ) TIKA-2391 – extract <script> elements as embedded documents (tallison: https://github.com/apache/tika/commit/132d3e7ff0387234c65cfc2983dd2b6ba59ff38b ) (add) tika-parsers/src/test/resources/org/apache/tika/parser/html/tika-config.xml (edit) tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlHandler.java (edit) tika-parsers/src/test/java/org/apache/tika/parser/html/HtmlParserTest.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlParser.java (edit) CHANGES.txt (add) tika-core/src/main/java/org/apache/tika/metadata/HTML.java

          People

          • Assignee:
            tallison@mitre.org Tim Allison
            Reporter:
            tallison@mitre.org Tim Allison
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development