[TIKA-2900] Removing comments from *.docx, *.pdf files - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Wish
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 1.21
Fix Version/s: None
Component/s: app, example
Labels:
None

Description

Hello,

I do use Apache tika to extract text from mostly *.doc, *docx and *pdf files. Sometimes there are comments in the file and tika is extracting them and adding them at the end of the file. I am wondering to know is there a way to exclude comments when it will be extracting text.

Here is the following code I am using

     StringBuilder fileContent = new StringBuilder();
        Parser parser = new AutoDetectParser();
        ContentHandlerFactory factory = new BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.HTML,
                -1);
        //InputStream inputStream = new BufferedInputStream(new FileInputStream(inputFileName));
        RecursiveParserWrapper wrapper = new RecursiveParserWrapper(parser, factory);
        Metadata metadata = new Metadata();

        ParseContext parseContext = new ParseContext();
        OfficeParserConfig officeParserConfig = new OfficeParserConfig();
        officeParserConfig.setUseSAXDocxExtractor(true);
        officeParserConfig.setIncludeDeletedContent(false);
        officeParserConfig.setIncludeMoveFromContent(false);
        officeParserConfig.setIncludeHeadersAndFooters(false);
        parseContext.set(OfficeParserConfig.class, officeParserConfig);

        wrapper.parse(inputStream, new DefaultHandler(), metadata, parseContext);
        String contents = metadata.get(RecursiveParserWrapper.TIKA_CONTENT);

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

Document_with_Comments_Text_extarction_Tika_APP.docx
08/Jul/19 16:16
14 kB
Md
Document_with_Comments_Text_extarction_Tika_APP.docx.txt
08/Jul/19 16:50
0.1 kB
Md

Issue Links

links to

GitHub Pull Request #294

Activity

People

Assignee:: Dave Meikle

Reporter:: Md

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 08/Jul/19 16:12

Updated:: 27/Oct/19 09:24