[TIKA-2901] Tika extracting points data from Chart - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 1.21
Fix Version/s: None
Component/s: app
Labels:
None

Description

I am using Tika to extract content from *.docx and other files. I am noticing Tika is extracting points from charts and putting them at the end of the file.
I am using following code for extraction

     StringBuilder fileContent = new StringBuilder();
        Parser parser = new AutoDetectParser();
        ContentHandlerFactory factory = new BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.HTML,
                -1);
        //InputStream inputStream = new BufferedInputStream(new FileInputStream(inputFileName));
        RecursiveParserWrapper wrapper = new RecursiveParserWrapper(parser, factory);
        Metadata metadata = new Metadata();

        ParseContext parseContext = new ParseContext();
        OfficeParserConfig officeParserConfig = new OfficeParserConfig();
        officeParserConfig.setUseSAXDocxExtractor(true);
        officeParserConfig.setIncludeDeletedContent(false);
        officeParserConfig.setIncludeMoveFromContent(false);
        officeParserConfig.setIncludeHeadersAndFooters(false);
        parseContext.set(OfficeParserConfig.class, officeParserConfig);

        wrapper.parse(inputStream, new DefaultHandler(), metadata, parseContext);
        String contents = metadata.get(RecursiveParserWrapper.TIKA_CONTENT);

Please find the attach files for input and output from Tika.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

Chart_data_sample_text_possible_issue.docx
08/Jul/19 16:38
17 kB
Md
Chart_data_sample_text_possible_issue.docx.txt
08/Jul/19 16:38
2 kB
Md

Activity

People

Assignee:: Unassigned

Reporter:: Md

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 08/Jul/19 16:36

Updated:: 08/Jul/19 16:44