Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1699

Integrate the GROBID PDF extractor in Tika

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.11
    • Component/s: parser
    • Labels:

      Description

      GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning library for extracting, parsing and re-structuring raw documents such as PDF into structured TEI-encoded documents with a particular focus on technical and scientific publications.
      It has a java api which can be used to augment PDF parsing for journals and help extract extra metadata about the paper like authors, publication, citations, etc.

      It would be nice to have this integrated into Tika, I have tried it on my local, will issue a pull request soon.

        Attachments

          Activity

            People

            • Assignee:
              chrismattmann Chris A. Mattmann
              Reporter:
              sujenshah Sujen Shah

              Dates

              • Created:
                Updated:
                Resolved:

                Issue deployment