Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2362

Skipping Header and Footer data from documents

    Details

    • Type: Wish
    • Status: Open
    • Priority: Trivial
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: general, handler
    • Labels:
      None

      Description

      Is there any method to skip header and footer data of documents(pdf,docx,doc,odt)?

        Activity

        Hide
        tallison@mitre.org Tim Allison added a comment -

        There isn't, and it shouldn't be hard to add. Prob won't make it into 1.15 (unless another dev wants to take this), but shortly thereafter.

        Show
        tallison@mitre.org Tim Allison added a comment - There isn't, and it shouldn't be hard to add. Prob won't make it into 1.15 (unless another dev wants to take this), but shortly thereafter.
        Hide
        mujahidateeb Mujahid Ateeb Khan added a comment -

        Is there any alternate way to skip headers and footers in current version?

        Show
        mujahidateeb Mujahid Ateeb Khan added a comment - Is there any alternate way to skip headers and footers in current version?
        Hide
        ThejanWijesinghe Thejan Wijesinghe added a comment -

        Can't we use regular expressions to detect headers & footers in a document?

        Show
        ThejanWijesinghe Thejan Wijesinghe added a comment - Can't we use regular expressions to detect headers & footers in a document?
        Hide
        gagravarr Nick Burch added a comment -

        On the whole, the headers and footers should be in their own div tags with sensible sounding names. As long as you're working at the xhtml level, you should be able to filter those out with an xpath content handler. (You can then turn that back into plain text later if you want)

        Show
        gagravarr Nick Burch added a comment - On the whole, the headers and footers should be in their own div tags with sensible sounding names. As long as you're working at the xhtml level, you should be able to filter those out with an xpath content handler. (You can then turn that back into plain text later if you want)
        Hide
        mujahidateeb Mujahid Ateeb Khan added a comment - - edited

        Yes I tried that method using XHTML handler but headers not displaying in div tag it is in p tag with class header and some body data also exist in p tag with class header @Nick Burch

        Show
        mujahidateeb Mujahid Ateeb Khan added a comment - - edited Yes I tried that method using XHTML handler but headers not displaying in div tag it is in p tag with class header and some body data also exist in p tag with class header @Nick Burch
        Hide
        gagravarr Nick Burch added a comment -

        Which format(s) are you having that problem with? Is that all documents of that format, or just one?

        Show
        gagravarr Nick Burch added a comment - Which format(s) are you having that problem with? Is that all documents of that format, or just one?
        Hide
        mujahidateeb Mujahid Ateeb Khan added a comment -

        I tried with odt doc docx and pdf format...

        Show
        mujahidateeb Mujahid Ateeb Khan added a comment - I tried with odt doc docx and pdf format...
        Hide
        tallison@mitre.org Tim Allison added a comment -

        I added configurability for doc, docx, xls and xlsx.

        PDFs are their own ball of wax...they store text with x/y coordinates on a page so we'd have to apply heuristics or, gasp, machine learning to guess the headers/footers. This is not insurmountable, but I think this is beyond the scope of Tika.

        If you attach an example odt, I can look into adding configurability for that as well.

        Show
        tallison@mitre.org Tim Allison added a comment - I added configurability for doc, docx, xls and xlsx. PDFs are their own ball of wax...they store text with x/y coordinates on a page so we'd have to apply heuristics or, gasp, machine learning to guess the headers/footers. This is not insurmountable, but I think this is beyond the scope of Tika. If you attach an example odt, I can look into adding configurability for that as well.
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Jenkins build Tika-trunk #1290 (See https://builds.apache.org/job/Tika-trunk/1290/)
        TIKA-2362 – Allow users to turn off extraction of headers and footers (tallison: https://github.com/apache/tika/commit/5cbaed87235c2cee49c9d4fa15d84158d000e986)

        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/SXWPFWordExtractorDecorator.java
        • (edit) CHANGES.txt
        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFBExcelExtractorDecorator.java
        • (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java
        • (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java
        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XWPFWordExtractorDecorator.java
        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java
        • (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ExcelParserTest.java
        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ExcelExtractor.java
        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OfficeParserConfig.java
        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java
        • (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/SXWPFExtractorTest.java
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Jenkins build Tika-trunk #1290 (See https://builds.apache.org/job/Tika-trunk/1290/ ) TIKA-2362 – Allow users to turn off extraction of headers and footers (tallison: https://github.com/apache/tika/commit/5cbaed87235c2cee49c9d4fa15d84158d000e986 ) (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/SXWPFWordExtractorDecorator.java (edit) CHANGES.txt (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFBExcelExtractorDecorator.java (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XWPFWordExtractorDecorator.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ExcelParserTest.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ExcelExtractor.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OfficeParserConfig.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/SXWPFExtractorTest.java
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Builds are failing because of this: https://issues.jenkins-ci.org/browse/JENKINS-43446?

        Show
        tallison@mitre.org Tim Allison added a comment - Builds are failing because of this: https://issues.jenkins-ci.org/browse/JENKINS-43446?
        Hide
        mujahidateeb Mujahid Ateeb Khan added a comment -

        In odt file header data display at end of String. Tim Allison

        Show
        mujahidateeb Mujahid Ateeb Khan added a comment - In odt file header data display at end of String. Tim Allison

          People

          • Assignee:
            tallison@mitre.org Tim Allison
            Reporter:
            mujahidateeb Mujahid Ateeb Khan
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:

              Development