Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: parser
    • Labels:
      None

      Description

      The PDF parser currently extracts and outputs document content as a single string. PDFBox could be used to support structuring at least down to page and paragraph (not sure how accurate) level.

        Activity

        Hide
        David vandendriessche added a comment -

        At the moment I'm using pdfbox to upload my data to solr(seachengine). Since it doesn't support page extraction.

        I'm pretty sure if tika(Solr uses tika if you use the extracthandler) gets this. They might change solr so it can return page hits for pdf's.

        Show
        David vandendriessche added a comment - At the moment I'm using pdfbox to upload my data to solr(seachengine). Since it doesn't support page extraction. I'm pretty sure if tika(Solr uses tika if you use the extracthandler) gets this. They might change solr so it can return page hits for pdf's.
        Hide
        Malik Hemani added a comment -

        Since PDFTextStripper can extract at page level, here is one possible solution that can let Tika extract text for a single page or a range of pages (excuse the formatting lost in translation):

        1. Add a new method to Parser interface:
        void parse(
        InputStream stream, ContentHandler handler,
        Metadata metadata, ParseContext context, int startPage, int endPage)
        throws IOException, SAXException, TikaException;

        2. Implement the method PDFParser class:
        public void parse(
        InputStream stream, ContentHandler handler,
        Metadata metadata, ParseContext context, int startPage, int endPage)
        throws IOException, SAXException, TikaException {
        PDDocument pdfDocument = PDDocument.load(stream, true);
        try {
        if (pdfDocument.isEncrypted()) {
        try {
        String password = metadata.get(PASSWORD);
        if (password == null)

        { password = ""; }

        pdfDocument.decrypt(password);
        } catch (Exception e)

        { // Ignore }

        }
        metadata.set(Metadata.CONTENT_TYPE, "application/pdf");
        extractMetadata(pdfDocument, metadata);
        PDF2XHTML.process(pdfDocument, handler, metadata, startPage, endPage);
        } finally

        { pdfDocument.close(); }

        }

        3. Add a new method in PDF2XHTML class:
        public static void process(
        PDDocument document, ContentHandler handler, Metadata metadata, int startPage, int endPage)
        throws SAXException, TikaException {
        try {
        // Extract text using a dummy Writer as we override the
        // key methods to output to the given content handler.
        PDF2XHTML pdf2XHTML = new PDF2XHTML(handler, metadata);

        // Set start and end page
        if (startPage > 0)

        { pdf2XHTML.setStartPage(startPage); }

        if (endPage > 0)

        { pdf2XHTML.setEndPage(endPage); }

        pdf2XHTML.writeText(document, new Writer() {
        @Override
        public void write(char[] cbuf, int off, int len) {
        }
        @Override
        public void flush() {
        }
        @Override
        public void close() {
        }
        });
        } catch (IOException e) {
        if (e.getCause() instanceof SAXException)

        { throw (SAXException) e.getCause(); }

        else

        { throw new TikaException("Unable to extract PDF content", e); }

        }
        }

        4. Example of a call to extract page 2 of a PDF:
        ...
        int startPage = 2;
        int endPage = 2;
        PDFParser parser = new PDFParser();
        parser.parse(input, textHandler, metadata, new ParseContext(), startPage, endPage);

        Show
        Malik Hemani added a comment - Since PDFTextStripper can extract at page level, here is one possible solution that can let Tika extract text for a single page or a range of pages (excuse the formatting lost in translation): 1. Add a new method to Parser interface: void parse( InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context, int startPage, int endPage) throws IOException, SAXException, TikaException; 2. Implement the method PDFParser class: public void parse( InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context, int startPage, int endPage) throws IOException, SAXException, TikaException { PDDocument pdfDocument = PDDocument.load(stream, true); try { if (pdfDocument.isEncrypted()) { try { String password = metadata.get(PASSWORD); if (password == null) { password = ""; } pdfDocument.decrypt(password); } catch (Exception e) { // Ignore } } metadata.set(Metadata.CONTENT_TYPE, "application/pdf"); extractMetadata(pdfDocument, metadata); PDF2XHTML.process(pdfDocument, handler, metadata, startPage, endPage); } finally { pdfDocument.close(); } } 3. Add a new method in PDF2XHTML class: public static void process( PDDocument document, ContentHandler handler, Metadata metadata, int startPage, int endPage) throws SAXException, TikaException { try { // Extract text using a dummy Writer as we override the // key methods to output to the given content handler. PDF2XHTML pdf2XHTML = new PDF2XHTML(handler, metadata); // Set start and end page if (startPage > 0) { pdf2XHTML.setStartPage(startPage); } if (endPage > 0) { pdf2XHTML.setEndPage(endPage); } pdf2XHTML.writeText(document, new Writer() { @Override public void write(char[] cbuf, int off, int len) { } @Override public void flush() { } @Override public void close() { } }); } catch (IOException e) { if (e.getCause() instanceof SAXException) { throw (SAXException) e.getCause(); } else { throw new TikaException("Unable to extract PDF content", e); } } } 4. Example of a call to extract page 2 of a PDF: ... int startPage = 2; int endPage = 2; PDFParser parser = new PDFParser(); parser.parse(input, textHandler, metadata, new ParseContext(), startPage, endPage);
        Hide
        Gregory Kanevsky added a comment -

        The issue with 'sortByPosition' is addressed by TIKA-612.

        Show
        Gregory Kanevsky added a comment - The issue with 'sortByPosition' is addressed by TIKA-612 .
        Hide
        Gregory Kanevsky added a comment - - edited

        This issue seems to be partially fixed. PDF2XHTML generates <div><p> and </p></div> to start and end each page.

        Another issue that is part of this is ordering of pdf content. PDF2XHTML extends PDFBox PDFTextStripper to extract text. By default (for performance reasons) 'sortByPosition' mode is turned off for PDFTextStripper.

        I propose to introduce metadata property (input) that would turn it on if desired. I am not sure about conventions on how such metadata properties are defined (if any) though. The setting of the mode would take place in the PDF2XHTML constructor:

        private PDF2XHTML(ContentHandler handler, Metadata metadata)
        throws IOException {

        if (metadata.get("SortByPosition").equalsIgnoreCase("true"))

        { setSortByPosition(true); }

        ....

        Show
        Gregory Kanevsky added a comment - - edited This issue seems to be partially fixed. PDF2XHTML generates <div><p> and </p></div> to start and end each page. Another issue that is part of this is ordering of pdf content. PDF2XHTML extends PDFBox PDFTextStripper to extract text. By default (for performance reasons) 'sortByPosition' mode is turned off for PDFTextStripper. I propose to introduce metadata property (input) that would turn it on if desired. I am not sure about conventions on how such metadata properties are defined (if any) though. The setting of the mode would take place in the PDF2XHTML constructor: private PDF2XHTML(ContentHandler handler, Metadata metadata) throws IOException { if (metadata.get("SortByPosition").equalsIgnoreCase("true")) { setSortByPosition(true); } ....

          People

          • Assignee:
            Unassigned
            Reporter:
            Jukka Zitting
          • Votes:
            5 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:

              Development