Bug 51803

Summary: Content on master slide is not extracted
Product: POI Reporter: mikemccand <lucene>
Component: HSLFAssignee: POI Developers List <dev>
Status: RESOLVED FIXED    
Severity: normal    
Priority: P2    
Version: unspecified   
Target Milestone: ---   
Hardware: PC   
OS: All   
Attachments: Simple PPT with a master slide w/ footer text "Master footer is here"
Patch w/ fix, plus test
PPT for test case (replaces current test-data/slideshow/master_text.ppt).

Description mikemccand 2011-09-13 15:01:50 UTC
Created attachment 27490 [details]
Simple PPT with a master slide w/ footer text "Master footer is here"

I hit this in Tika https://issues.apache.org/jira/browse/TIKA-712 but it sounds like we need to fix it in POI first.

Ideally each slide's master slide would have the elements from its corresponding master slide inlined, but if this is problematic, a fallback would just be to see each master slide once.

The attached PPT has a slide whose master slide has "Master footer is here" footer; you can see it rendered when you view this in Office.  But when I extract text using Tika, the footer isn't included.

This also happens for PPTX; I'll open a separate issue.
Comment 1 Nick Burch 2011-09-21 16:53:16 UTC
Turns out it's actually already supported by HSLF, you just need to set the flag to include master text (off by default)
Comment 2 mikemccand 2011-11-05 10:55:13 UTC
I think there is still a problem here: with the example PPT I
attached, I see boiler-plate text when I run PowerPointExtract (which
does set to flag to include master slide text, in its static main
method).

I see code in HSLF for detecting that a given Shape is a placeholder
(MasterSheet.isPlaceholder), so it seems possible we can avoid
extracting such text?  But I'm not familiar enough with the APIs, eg
when Sheet.findTextRuns is invoked for a MasterSlide, how can it get
the Shape for each run and then skip its text if it's a placeholder?
Comment 3 Yegor Kozlov 2011-11-07 13:43:00 UTC
MasterSlide.getTextRuns is used in other places and should return all runs including boilerplate ones from placeholders. I think the correct fix would be as follows: The line 224 in PowerPointExtractor.java invokes textRunsToText, but this method can't tell runs from placeholders from normal text.

                    textRunsToText(ret, master.getTextRuns());

It is better to re-write it and iterate over shapes in the master sheet:

                    for(Shape sh : master.getShapes()){
                        if(sh instanceof TextShape){
                            if(MasterSheet.isPlaceholder(sh)) {
                                // don't bother about boiler plate text on master sheets
                                continue;
                            }
                            TextShape tsh = (TextShape)sh;
                            String text = tsh.getText();
                            ret.append(text);
                            if (!text.endsWith("\n")) {
                                ret.append("\n");
                            }
                        }
                    }
 
Any volunteers to help me with testing? I never worked with text extractors and don't want to occasionally break things. I guess the best would be  to apply this fix, build and test from inside Tika. 

Yegor

(In reply to comment #2)
> I think there is still a problem here: with the example PPT I
> attached, I see boiler-plate text when I run PowerPointExtract (which
> does set to flag to include master slide text, in its static main
> method).
> 
> I see code in HSLF for detecting that a given Shape is a placeholder
> (MasterSheet.isPlaceholder), so it seems possible we can avoid
> extracting such text?  But I'm not familiar enough with the APIs, eg
> when Sheet.findTextRuns is invoked for a MasterSlide, how can it get
> the Shape for each run and then skip its text if it's a placeholder?

(In reply to comment #2)
> I think there is still a problem here: with the example PPT I
> attached, I see boiler-plate text when I run PowerPointExtract (which
> does set to flag to include master slide text, in its static main
> method).
> 
> I see code in HSLF for detecting that a given Shape is a placeholder
> (MasterSheet.isPlaceholder), so it seems possible we can avoid
> extracting such text?  But I'm not familiar enough with the APIs, eg
> when Sheet.findTextRuns is invoked for a MasterSlide, how can it get
> the Shape for each run and then skip its text if it's a placeholder?
Comment 4 mikemccand 2011-11-07 13:50:31 UTC
Hi Yegor,

That sounds like a great fix!  (Iterating over shapes ourselves).

I'm happy to test a patch to confirm we're skipping the boiler plate text.  Thanks!
Comment 5 mikemccand 2011-11-07 13:55:40 UTC
In fact you already described exactly what to test, so I'll take your proposed change, test it, and report back!  Thanks.
Comment 6 mikemccand 2011-11-07 17:41:06 UTC
Created attachment 27907 [details]
Patch w/ fix, plus test

OK the change works well!  I see no more placeholder text from the
master slide.

I worked out a patch (attached), containing the above fix, and fixes
to the tests, and some minor improvements to Sheet.java (cutover from
Vector -> List<TextRun>).

One test failed with this change (TestExtracter.testMasterText),
because the master text in the test PPT for that test was in fact
placeholder text.

So, I replaced that test file (test-data/slideshow/master_text.ppt)
with a new PPT, that has non-placeholder text from the master slide,
and also added an assert to verify placeholder text doesn't come
through.

The assert fails on trunk today and passes with the patch, so I think
we are good.

Thanks Yegor!
Comment 7 mikemccand 2011-11-07 17:42:08 UTC
Created attachment 27908 [details]
PPT for test case (replaces current test-data/slideshow/master_text.ppt).
Comment 8 mikemccand 2011-11-17 16:18:35 UTC
Hi, I think this patch is ready to go in?  It has a patch with a test case... is there a committer who can  double check and commit?  Thanks!
Comment 9 Yegor Kozlov 2011-11-17 17:36:38 UTC
Sorry for the delay. Patch applied in r1203295

Thanks,
Yegor
Comment 10 mikemccand 2011-11-17 22:02:04 UTC
Awesome, thanks Yegor!