To quote from the javadoc of this single class: * This class will get all the text from a Powerpoint Document, including * all the bits you didn't want, and in a somewhat random order, but will * do it very fast. * The class ignores most of the hslf classes, and doesn't use * HSLFSlideShow. Instead, it just does a very basic scan through the * file, grabbing all the text records as it goes. It then returns the * text, either as a single string, or as a vector of all the individual * strings. * Because of how it works, it will return a lot of "crud" text that you * probably didn't want! It will return text from master slides. It will * return duplicate text, and some mangled text (powerpoint files often * have duplicate copies of slide text in them). You don't get any idea * what the text was associated with. * Almost everyone will want to use @see PowerPointExtractor instead. There * are only a very small number of cases (eg some performance sensitive * lucene indexers) that would ever want to use this! File should go in org.apache.poi.hslf.extractor. Also needs a single line change in org.apache.poi.hslf.record.Record: Index: Record.java =================================================================== RCS file: /home/cvspublic/jakarta-poi/src/scratchpad/src/org/apache/poi/hslf/record/Record.java,v retrieving revision 1.1 diff -u -r1.1 Record.java --- Record.java 28 May 2005 05:36:00 -0000 1.1 +++ Record.java 3 Jun 2005 16:31:00 -0000 @@ -122,7 +122,7 @@ * (not including the size of the header), this code assumes you're * passing in corrected lengths */ - protected static Record createRecordForType(long type, byte[] b, int start, int len) { + public static Record createRecordForType(long type, byte[] b, int start, int len) { // Default is to use UnknownRecordPlaceholder // When you create classes for new Records, add them here switch((int)type) {
Created attachment 15292 [details] org.apache.poi.hslf.extractor.QuickButCruddyTextExtractor
Added to cvs