[TIKA-1699] Integrate the GROBID PDF extractor in Tika - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.11
Component/s: parser
Labels:
- memex

Description

GROBID (http://grobid.readthedocs.org/en/latest/) is a machine learning library for extracting, parsing and re-structuring raw documents such as PDF into structured TEI-encoded documents with a particular focus on technical and scientific publications.
It has a java api which can be used to augment PDF parsing for journals and help extract extra metadata about the paper like authors, publication, citations, etc.

It would be nice to have this integrated into Tika, I have tried it on my local, will issue a pull request soon.