Tika
  1. Tika
  2. TIKA-712

Master slide text isn't extracted

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: parser
    • Labels:
      None

      Description

      It looks like we are not getting text from the master slide for PPT
      and PPTX.

      1. TIKA-712.patch
        4 kB
        Michael McCandless
      2. TIKA-712-master-slide.xml
        14 kB
        Michael McCandless
      3. testPPT_masterFooter2.pptx
        34 kB
        Michael McCandless
      4. testPPT_masterFooter2.ppt
        137 kB
        Michael McCandless
      5. testPPT_masterFooter.pptx
        29 kB
        Michael McCandless
      6. testPPT_masterFooter.ppt
        113 kB
        Michael McCandless
      7. TIKA-712.patch
        3 kB
        Michael McCandless

        Issue Links

          Activity

          Hide
          Michael McCandless added a comment -

          Test case that fails.

          Show
          Michael McCandless added a comment - Test case that fails.
          Hide
          Nick Burch added a comment -

          We'll probably need to add this in POI, but it shouldn't be too hard

          Do you have a feeling for whether we should process all the master slides after the regular ones, or if we should try to tie each slide back to it's master and place the master text inline with the slide's own text?

          Show
          Nick Burch added a comment - We'll probably need to add this in POI, but it shouldn't be too hard Do you have a feeling for whether we should process all the master slides after the regular ones, or if we should try to tie each slide back to it's master and place the master text inline with the slide's own text?
          Hide
          Michael McCandless added a comment -

          I think ideally we'd have each slide inline the text from its corresponding master?

          But if this is too hard then I think outputting text for each master slide just once somewhere is better than nothing?

          Show
          Michael McCandless added a comment - I think ideally we'd have each slide inline the text from its corresponding master? But if this is too hard then I think outputting text for each master slide just once somewhere is better than nothing?
          Hide
          Nick Burch added a comment -

          Makes sense to me. Any chance you could open two POI bugs, one for HSLF and one for XSLF, and include the test files there too?

          Show
          Nick Burch added a comment - Makes sense to me. Any chance you could open two POI bugs, one for HSLF and one for XSLF, and include the test files there too?
          Hide
          Michael McCandless added a comment -

          Any chance you could open two POI bugs, one for HSLF and one for XSLF, and include the test files there too?

          Will do!

          Show
          Michael McCandless added a comment - Any chance you could open two POI bugs, one for HSLF and one for XSLF, and include the test files there too? Will do!
          Show
          Michael McCandless added a comment - OK I opened https://issues.apache.org/bugzilla/show_bug.cgi?id=51803 (PPT) and https://issues.apache.org/bugzilla/show_bug.cgi?id=51804 (PPTX).
          Hide
          Michael McCandless added a comment -

          Corrected attachments – the last attachments didn't actually render the master slide's footer text onto the slide.

          Show
          Michael McCandless added a comment - Corrected attachments – the last attachments didn't actually render the master slide's footer text onto the slide.
          Hide
          Nick Burch added a comment -

          POI enhancements done, and Tika code (some interim) committed in r1173761.

          Michael - any chance you could test, and then commit your unit test if all looks fine for you too?

          Show
          Nick Burch added a comment - POI enhancements done, and Tika code (some interim) committed in r1173761. Michael - any chance you could test, and then commit your unit test if all looks fine for you too?
          Hide
          Michael McCandless added a comment -

          Michael - any chance you could test, and then commit your unit test if all looks fine for you too?

          Excellent, thanks Nick! I'll test & commit.

          Show
          Michael McCandless added a comment - Michael - any chance you could test, and then commit your unit test if all looks fine for you too? Excellent, thanks Nick! I'll test & commit.
          Hide
          Michael McCandless added a comment -

          OK, the good news is: I now see the master slide's text being
          extracted; thanks Nick!

          But the bad news is: we are now also extracting all the "boilerplate"
          text that is included in the master slide by default.

          For example if I open Powerpoint 2007, make no changes and just save
          that one blank slide as PPTX, then get the text from it using TikaCLI, I see
          this:

          Click to edit Master title style
          Click to edit Master text styles
          Second level
          Third level
          Fourth level
          Fifth level
          

          This is the boiler-plate text from that initial title slide's master
          slide. I think we should somehow not include it, but, I have no idea
          how... does PPT/X somehow note that this is "fake" boilerplate text!?
          Somehow Powerpoint knows not to display this when I view the slide...

          Show
          Michael McCandless added a comment - OK, the good news is: I now see the master slide's text being extracted; thanks Nick! But the bad news is: we are now also extracting all the "boilerplate" text that is included in the master slide by default. For example if I open Powerpoint 2007, make no changes and just save that one blank slide as PPTX, then get the text from it using TikaCLI, I see this: Click to edit Master title style Click to edit Master text styles Second level Third level Fourth level Fifth level This is the boiler-plate text from that initial title slide's master slide. I think we should somehow not include it, but, I have no idea how... does PPT/X somehow note that this is "fake" boilerplate text!? Somehow Powerpoint knows not to display this when I view the slide...
          Hide
          Nick Burch added a comment -

          I'd suggest you take the pptx file (it'll be simpler to poke around in that the ppt one), and unzip it. Then, look at the xml file for the master slide, and see how the text you've added differs from the boilerplate parts. Are there any obvious differences between the two? Are they in different sections? Different xml? Anything we could filter on?

          Show
          Nick Burch added a comment - I'd suggest you take the pptx file (it'll be simpler to poke around in that the ppt one), and unzip it. Then, look at the xml file for the master slide, and see how the text you've added differs from the boilerplate parts. Are there any obvious differences between the two? Are they in different sections? Different xml? Anything we could filter on?
          Hide
          Michael McCandless added a comment -

          Good idea! Nice how approachable OOXML is...

          In theory the answer is here:
          http://www.ecma-international.org/publications/standards/Ecma-376.htm
          but I have not tried to dig.

          So, here's a boilerplate-only chunk from the master slide (PowerPoint does not display this on the slide):

                <p:sp>
          	<p:nvSpPr>
          	  <p:cNvPr id="2" name="Title Placeholder 1"/>
          	  <p:cNvSpPr>
          	    <a:spLocks noGrp="1"/>
          	  </p:cNvSpPr>
          	  <p:nvPr>
          	    <p:ph type="title"/>
          	  </p:nvPr>
          	</p:nvSpPr>
          	<p:spPr>
          	  <a:xfrm>
          	    <a:off x="457200" y="274638"/>
          	    <a:ext cx="8229600" cy="1143000"/>
          	  </a:xfrm>
          	  <a:prstGeom prst="rect">
          	    <a:avLst/>
          	  </a:prstGeom>
          	</p:spPr>
          	<p:txBody>
          	  <a:bodyPr vert="horz" lIns="91440" tIns="45720" rIns="91440" bIns="45720" rtlCol="0" anchor="ctr">
          	    <a:normAutofit/>
          	  </a:bodyPr>
          	  <a:lstStyle/>
          	  <a:p>
          	    <a:r>
          	      <a:rPr lang="en-US" smtClean="0"/>
          	      <a:t>Click to edit Master title style
          	      </a:t>
          	    </a:r>
          	    <a:endParaRPr lang="en-US"/>
          	  </a:p>
          	</p:txBody>
                </p:sp>
          

          And here's the footer I edited (PowerPoint does display this on the slide):

                <p:sp>
          	<p:nvSpPr>
          	  <p:cNvPr id="5" name="Footer Placeholder 4"/>
          	  <p:cNvSpPr>
          	    <a:spLocks noGrp="1"/>
          	  </p:cNvSpPr>
          	  <p:nvPr>
          	    <p:ph type="ftr" sz="quarter" idx="3"/>
          	  </p:nvPr>
          	</p:nvSpPr>
          	<p:spPr>
          	  <a:xfrm>
          	    <a:off x="3124200" y="6356350"/>
          	    <a:ext cx="2895600" cy="365125"/>
          	  </a:xfrm>
          	  <a:prstGeom prst="rect">
          	    <a:avLst/>
          	  </a:prstGeom>
          	</p:spPr>
          	<p:txBody>
          	  <a:bodyPr vert="horz" lIns="91440" tIns="45720" rIns="91440" bIns="45720" rtlCol="0" anchor="ctr"/>
          	  <a:lstStyle>
          	    <a:lvl1pPr algn="ctr">
          	      <a:defRPr sz="1200">
          		<a:solidFill>
          		  <a:schemeClr val="tx1">
          		    <a:tint val="75000"/>
          		  </a:schemeClr>
          		</a:solidFill>
          	      </a:defRPr>
          	    </a:lvl1pPr>
          	  </a:lstStyle>
          	  <a:p>
          	    <a:r>
          	      <a:rPr lang="en-US" smtClean="0"/>
          	      <a:t>Slide footer is right here
          	      </a:t>
          	    </a:r>
          	    <a:endParaRPr lang="en-US"/>
          	  </a:p>
          	</p:txBody>
                </p:sp>
          

          I can't spot any obvious ideas on quick glance... I'll attach the full
          master slide XML (there's lots of other stuff); could be the
          difference is elsewhere in there.

          Show
          Michael McCandless added a comment - Good idea! Nice how approachable OOXML is... In theory the answer is here: http://www.ecma-international.org/publications/standards/Ecma-376.htm but I have not tried to dig. So, here's a boilerplate-only chunk from the master slide (PowerPoint does not display this on the slide): <p:sp> <p:nvSpPr> <p:cNvPr id="2" name="Title Placeholder 1"/> <p:cNvSpPr> <a:spLocks noGrp="1"/> </p:cNvSpPr> <p:nvPr> <p:ph type="title"/> </p:nvPr> </p:nvSpPr> <p:spPr> <a:xfrm> <a:off x="457200" y="274638"/> <a:ext cx="8229600" cy="1143000"/> </a:xfrm> <a:prstGeom prst="rect"> <a:avLst/> </a:prstGeom> </p:spPr> <p:txBody> <a:bodyPr vert="horz" lIns="91440" tIns="45720" rIns="91440" bIns="45720" rtlCol="0" anchor="ctr"> <a:normAutofit/> </a:bodyPr> <a:lstStyle/> <a:p> <a:r> <a:rPr lang="en-US" smtClean="0"/> <a:t>Click to edit Master title style </a:t> </a:r> <a:endParaRPr lang="en-US"/> </a:p> </p:txBody> </p:sp> And here's the footer I edited (PowerPoint does display this on the slide): <p:sp> <p:nvSpPr> <p:cNvPr id="5" name="Footer Placeholder 4"/> <p:cNvSpPr> <a:spLocks noGrp="1"/> </p:cNvSpPr> <p:nvPr> <p:ph type="ftr" sz="quarter" idx="3"/> </p:nvPr> </p:nvSpPr> <p:spPr> <a:xfrm> <a:off x="3124200" y="6356350"/> <a:ext cx="2895600" cy="365125"/> </a:xfrm> <a:prstGeom prst="rect"> <a:avLst/> </a:prstGeom> </p:spPr> <p:txBody> <a:bodyPr vert="horz" lIns="91440" tIns="45720" rIns="91440" bIns="45720" rtlCol="0" anchor="ctr"/> <a:lstStyle> <a:lvl1pPr algn="ctr"> <a:defRPr sz="1200"> <a:solidFill> <a:schemeClr val="tx1"> <a:tint val="75000"/> </a:schemeClr> </a:solidFill> </a:defRPr> </a:lvl1pPr> </a:lstStyle> <a:p> <a:r> <a:rPr lang="en-US" smtClean="0"/> <a:t>Slide footer is right here </a:t> </a:r> <a:endParaRPr lang="en-US"/> </a:p> </p:txBody> </p:sp> I can't spot any obvious ideas on quick glance... I'll attach the full master slide XML (there's lots of other stuff); could be the difference is elsewhere in there.
          Hide
          Michael McCandless added a comment -

          Full master slide XML.

          Show
          Michael McCandless added a comment - Full master slide XML.
          Hide
          Michael McCandless added a comment -

          I suppose a hackish solution would be to explicitly filter out the known boiler-plate text that PowerPoint includes. But this is scary of course because in theory a PPT/PPTX may in fact legitimately have this text on their master slides, which would be rather confusing. Hmm lemme try actually making that my text, saving, and diffing the two.

          Show
          Michael McCandless added a comment - I suppose a hackish solution would be to explicitly filter out the known boiler-plate text that PowerPoint includes. But this is scary of course because in theory a PPT/PPTX may in fact legitimately have this text on their master slides, which would be rather confusing. Hmm lemme try actually making that my text, saving, and diffing the two.
          Hide
          Michael McCandless added a comment -

          Maybe, until we work this out, we should turn off extracting anything
          from the master slides? Chris is about to build the release bits for
          0.10...

          So I did some sleuthing. This is all new to me so this is really just
          speculative but I think I learned a few things:

          • Each slide refers to a slideLayouts/slideLayoutN.xml, from the
            _rels/slideN.xml.rels file.
          • In turn, each slideLayoutN.xml refers to a
            slideMaster/slideMasterN.xml, from the _rels/slideLayoutN.xml.rels
            file.
          • Simply editing footer text on the slide's master is not sufficient
            to see that text on the slide; you must also go to Insert ->
            Header & Footer and check the box to display footer/slide
            number/date and time.
          • If I enable footers like that, the slideN.xml actually includes
            the footer text; now, I'm not sure why Tika didn't see this before
            we changed anything.
          • If, instead, I go to the slide master and manually insert my own
            text box, then it comes through on the slides, however Tika
            (current trunk) fails to extract this onto the slide even though
            PowerPoint renders it... so we are still missing something here,
            maybe because we only render the master for the slide and not
            its layout?
          • That manually inserted element has a unique {{<p:nvPr
            userDrawn="1"/>}} under p:sp -> p:nvSpPr... maybe POI/Tika can
            interpret that to mean "include this text".
          • I suspect the p:ph element (under p:sp -> p:nvSpPr -> p:nvPr) may
            be important here... it seems to specify the "type" of the
            element, and it seems to be included in all the "boilerplate"
            elements but NOT in the new element I added to the master. You
            can see it in my examples above (type="ftr" and type="title").
            Maybe POI/Tika can interpret the presence of this p:ph element
            to mean that text should not be included in the slide?

          I'm not yet sure how to boil this all down to what POI/Tika can
          concretely use to identify what should be included and what should
          not but it seems like progress...

          Show
          Michael McCandless added a comment - Maybe, until we work this out, we should turn off extracting anything from the master slides? Chris is about to build the release bits for 0.10... So I did some sleuthing. This is all new to me so this is really just speculative but I think I learned a few things: Each slide refers to a slideLayouts/slideLayoutN.xml, from the _rels/slideN.xml.rels file. In turn, each slideLayoutN.xml refers to a slideMaster/slideMasterN.xml, from the _rels/slideLayoutN.xml.rels file. Simply editing footer text on the slide's master is not sufficient to see that text on the slide; you must also go to Insert -> Header & Footer and check the box to display footer/slide number/date and time. If I enable footers like that, the slideN.xml actually includes the footer text; now, I'm not sure why Tika didn't see this before we changed anything. If, instead, I go to the slide master and manually insert my own text box, then it comes through on the slides, however Tika (current trunk) fails to extract this onto the slide even though PowerPoint renders it... so we are still missing something here, maybe because we only render the master for the slide and not its layout? That manually inserted element has a unique {{<p:nvPr userDrawn="1"/>}} under p:sp -> p:nvSpPr... maybe POI/Tika can interpret that to mean "include this text". I suspect the p:ph element (under p:sp -> p:nvSpPr -> p:nvPr) may be important here... it seems to specify the "type" of the element, and it seems to be included in all the "boilerplate" elements but NOT in the new element I added to the master. You can see it in my examples above (type="ftr" and type="title"). Maybe POI/Tika can interpret the presence of this p:ph element to mean that text should not be included in the slide? I'm not yet sure how to boil this all down to what POI/Tika can concretely use to identify what should be included and what should not but it seems like progress...
          Hide
          Michael McCandless added a comment -

          I committed a change to temporarily turn off master text.

          Curiously, the new unit tests still passed Somehow we are now extracting the footer text properly for both PPT and PPTX! I think this is because footer is somehow "special".

          I'll make a new unit test that shows we are failing to extract master text...

          Show
          Michael McCandless added a comment - I committed a change to temporarily turn off master text. Curiously, the new unit tests still passed Somehow we are now extracting the footer text properly for both PPT and PPTX! I think this is because footer is somehow "special". I'll make a new unit test that shows we are failing to extract master text...
          Hide
          Michael McCandless added a comment -

          OK I committed four new failing (disabled) test cases, showing that we
          don't extract text elements inherited from master/layout slide.

          I played around some more with the master layout/slides and I think I
          know what we need to do for XSLF (but I have no idea for HSLF;
          hopefully it's somehow "parallel"):

          • We'll have to look at the inheritence from slide -> layout ->
            master, so that a slide's text is the union of its actual text,
            plus text from its slide layout, plus text from the master. The
            files in the _rels dir link a slide to its slideLayout, and a
            slideLayout to its slideMaster.
          • For each text element on slideLayout and slideMaster, we must
            check for the presence of the p:sp -> p:nvSpPr -> p:nvPr -> p:ph
            element. For example, <p:ph type="body" idx="1"/>. The ph
            stands for "place holder", and it seems to mean it's not really
            rendered. When I manually edited the XML in my doc to insert a
            p:ph on text I had added, and viewed that in PowerPoint, it indeed
            stopped rendering it. So if p:ph is present we should skip that text.

          I think that should work! But I don't know where/how to do this;
          likely we need to do this first in POI? Should I open an issue there?

          Show
          Michael McCandless added a comment - OK I committed four new failing (disabled) test cases, showing that we don't extract text elements inherited from master/layout slide. I played around some more with the master layout/slides and I think I know what we need to do for XSLF (but I have no idea for HSLF; hopefully it's somehow "parallel"): We'll have to look at the inheritence from slide -> layout -> master, so that a slide's text is the union of its actual text, plus text from its slide layout, plus text from the master. The files in the _rels dir link a slide to its slideLayout, and a slideLayout to its slideMaster. For each text element on slideLayout and slideMaster, we must check for the presence of the p:sp -> p:nvSpPr -> p:nvPr -> p:ph element. For example, <p:ph type="body" idx="1"/> . The ph stands for "place holder", and it seems to mean it's not really rendered. When I manually edited the XML in my doc to insert a p:ph on text I had added, and viewed that in PowerPoint, it indeed stopped rendering it. So if p:ph is present we should skip that text. I think that should work! But I don't know where/how to do this; likely we need to do this first in POI? Should I open an issue there?
          Hide
          Nick Burch added a comment -

          It looks like we only want to exclude the placeholder ones on the layout and master slides, and only then if they're not custom

          Well, unless there isn't a matching placeholder on the slide itself....

          Ideally we'll want to expand POI to have a full model for this. For now, I've got something roughly working in POI in XSLFPowerPointExtractor. If the logic in there seems ok, we can implement the same in Tika when we move to POI 3.8 beta 5

          Show
          Nick Burch added a comment - It looks like we only want to exclude the placeholder ones on the layout and master slides, and only then if they're not custom Well, unless there isn't a matching placeholder on the slide itself.... Ideally we'll want to expand POI to have a full model for this. For now, I've got something roughly working in POI in XSLFPowerPointExtractor. If the logic in there seems ok, we can implement the same in Tika when we move to POI 3.8 beta 5
          Hide
          Michael McCandless added a comment -

          I tested the current XSLFPowerPointExtraction on POI's trunk and it works great (preserves the footer text and no placeholder text for my PPTX test case).

          But for PPT files (using PowerPointExtractor) we still pull the boiler plate text. That's expected right? (Ie we haven't fixed that case yet).

          Show
          Michael McCandless added a comment - I tested the current XSLFPowerPointExtraction on POI's trunk and it works great (preserves the footer text and no placeholder text for my PPTX test case). But for PPT files (using PowerPointExtractor) we still pull the boiler plate text. That's expected right? (Ie we haven't fixed that case yet).
          Hide
          Michael McCandless added a comment -

          I think I found a committable workaround (patch) for including text from the master slide for PPT documents: I uncommented the existing code, but then exclude text that is type 0 (TITLE_TYPE) or 1 (BODY_TYPE), just for the master slide. In my ad-hoc testing this eliminates the boilerplate text but lets other user changes to the master slide come through correctly ... this isn't perfect but I think it's a good step forward.

          Show
          Michael McCandless added a comment - I think I found a committable workaround (patch) for including text from the master slide for PPT documents: I uncommented the existing code, but then exclude text that is type 0 (TITLE_TYPE) or 1 (BODY_TYPE), just for the master slide. In my ad-hoc testing this eliminates the boilerplate text but lets other user changes to the master slide come through correctly ... this isn't perfect but I think it's a good step forward.
          Hide
          Michael McCandless added a comment -

          I committed the patch; I'll leave this issue open for a possible future correct fix where we can detect boilerplate text in PPT.

          Show
          Michael McCandless added a comment - I committed the patch; I'll leave this issue open for a possible future correct fix where we can detect boilerplate text in PPT.
          Hide
          Tim Allison added a comment -

          Borrowed code from POI's PowerPointExtractor to extract Shapes instead of Runs. This fixed TIKA-1171, and the existing tests pass.

          Show
          Tim Allison added a comment - Borrowed code from POI's PowerPointExtractor to extract Shapes instead of Runs. This fixed TIKA-1171 , and the existing tests pass.

            People

            • Assignee:
              Unassigned
              Reporter:
              Michael McCandless
            • Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:

                Development