Summary: | Content on master slide is not extracted | ||
---|---|---|---|
Product: | POI | Reporter: | mikemccand <lucene> |
Component: | HSLF | Assignee: | POI Developers List <dev> |
Status: | RESOLVED FIXED | ||
Severity: | normal | ||
Priority: | P2 | ||
Version: | unspecified | ||
Target Milestone: | --- | ||
Hardware: | PC | ||
OS: | All | ||
Attachments: |
Simple PPT with a master slide w/ footer text "Master footer is here"
Patch w/ fix, plus test PPT for test case (replaces current test-data/slideshow/master_text.ppt). |
Description
mikemccand
2011-09-13 15:01:50 UTC
Turns out it's actually already supported by HSLF, you just need to set the flag to include master text (off by default) I think there is still a problem here: with the example PPT I attached, I see boiler-plate text when I run PowerPointExtract (which does set to flag to include master slide text, in its static main method). I see code in HSLF for detecting that a given Shape is a placeholder (MasterSheet.isPlaceholder), so it seems possible we can avoid extracting such text? But I'm not familiar enough with the APIs, eg when Sheet.findTextRuns is invoked for a MasterSlide, how can it get the Shape for each run and then skip its text if it's a placeholder? MasterSlide.getTextRuns is used in other places and should return all runs including boilerplate ones from placeholders. I think the correct fix would be as follows: The line 224 in PowerPointExtractor.java invokes textRunsToText, but this method can't tell runs from placeholders from normal text. textRunsToText(ret, master.getTextRuns()); It is better to re-write it and iterate over shapes in the master sheet: for(Shape sh : master.getShapes()){ if(sh instanceof TextShape){ if(MasterSheet.isPlaceholder(sh)) { // don't bother about boiler plate text on master sheets continue; } TextShape tsh = (TextShape)sh; String text = tsh.getText(); ret.append(text); if (!text.endsWith("\n")) { ret.append("\n"); } } } Any volunteers to help me with testing? I never worked with text extractors and don't want to occasionally break things. I guess the best would be to apply this fix, build and test from inside Tika. Yegor (In reply to comment #2) > I think there is still a problem here: with the example PPT I > attached, I see boiler-plate text when I run PowerPointExtract (which > does set to flag to include master slide text, in its static main > method). > > I see code in HSLF for detecting that a given Shape is a placeholder > (MasterSheet.isPlaceholder), so it seems possible we can avoid > extracting such text? But I'm not familiar enough with the APIs, eg > when Sheet.findTextRuns is invoked for a MasterSlide, how can it get > the Shape for each run and then skip its text if it's a placeholder? (In reply to comment #2) > I think there is still a problem here: with the example PPT I > attached, I see boiler-plate text when I run PowerPointExtract (which > does set to flag to include master slide text, in its static main > method). > > I see code in HSLF for detecting that a given Shape is a placeholder > (MasterSheet.isPlaceholder), so it seems possible we can avoid > extracting such text? But I'm not familiar enough with the APIs, eg > when Sheet.findTextRuns is invoked for a MasterSlide, how can it get > the Shape for each run and then skip its text if it's a placeholder? Hi Yegor, That sounds like a great fix! (Iterating over shapes ourselves). I'm happy to test a patch to confirm we're skipping the boiler plate text. Thanks! In fact you already described exactly what to test, so I'll take your proposed change, test it, and report back! Thanks. Created attachment 27907 [details]
Patch w/ fix, plus test
OK the change works well! I see no more placeholder text from the
master slide.
I worked out a patch (attached), containing the above fix, and fixes
to the tests, and some minor improvements to Sheet.java (cutover from
Vector -> List<TextRun>).
One test failed with this change (TestExtracter.testMasterText),
because the master text in the test PPT for that test was in fact
placeholder text.
So, I replaced that test file (test-data/slideshow/master_text.ppt)
with a new PPT, that has non-placeholder text from the master slide,
and also added an assert to verify placeholder text doesn't come
through.
The assert fails on trunk today and passes with the patch, so I think
we are good.
Thanks Yegor!
Created attachment 27908 [details]
PPT for test case (replaces current test-data/slideshow/master_text.ppt).
Hi, I think this patch is ready to go in? It has a patch with a test case... is there a committer who can double check and commit? Thanks! Sorry for the delay. Patch applied in r1203295 Thanks, Yegor Awesome, thanks Yegor! |