Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
1.12
-
None
-
None
Description
I am using tika-server to extract content from word files via DotNet.
For the extraction i use the following rest endpoint (https://wiki.apache.org/tika/TikaJAXRS#Get_the_Text_of_a_Document).
If I extract the content of a DOCX file the content contains some hidden bookmarks like: "[bookmark:_GoBack] hello world"
When i do the same with the tika-app via console i get "hello world"
I didn't find a way to prevent tika-server from extracting the hidden bookmarks. Also specifying the mime-type did not work.
Here is a test file (only a few chars) http://en.file-upload.net/download-11584028/ContentWord.docx.html