35208 – [PATCH] HSLF Update: new (quicker but greedy) text extractor

Bug 35208 - [PATCH] HSLF Update: new (quicker but greedy) text extractor

Summary: [PATCH] HSLF Update: new (quicker but greedy) text extractor

Status:	RESOLVED FIXED

Alias:	None

Product:	POI
Classification:	Unclassified
Component:	POI Overall (show other bugs)
Version:	unspecified
Hardware:	Other other

Importance:	P2 normal (vote)
Target Milestone:	---
Assignee:	POI Developers List

URL:
Keywords:

Depends on:
Blocks:

Reported:	2005-06-03 18:33 UTC by Nick Burch
Modified:	2005-06-09 09:15 UTC (History)
CC List:	0 users

Attachments
org.apache.poi.hslf.extractor.QuickButCruddyTextExtractor (6.20 KB, text/x-java) 2005-06-03 18:34 UTC, Nick Burch	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Nick Burch 2005-06-03 18:33:48 UTC

To quote from the javadoc of this single class:
 * This class will get all the text from a Powerpoint Document, including
 *  all the bits you didn't want, and in a somewhat random order, but will
 *  do it very fast.
 * The class ignores most of the hslf classes, and doesn't use 
 *  HSLFSlideShow. Instead, it just does a very basic scan through the
 *  file, grabbing all the text records as it goes. It then returns the
 *  text, either as a single string, or as a vector of all the individual
 *  strings.
 * Because of how it works, it will return a lot of "crud" text that you 
 *  probably didn't want! It will return text from master slides. It will
 *  return duplicate text, and some mangled text (powerpoint files often
 *  have duplicate copies of slide text in them). You don't get any idea
 *  what the text was associated with.
 * Almost everyone will want to use @see PowerPointExtractor instead. There
 *  are only a very small number of cases (eg some performance sensitive
 *  lucene indexers) that would ever want to use this!


File should go in org.apache.poi.hslf.extractor. Also needs a single line change
in org.apache.poi.hslf.record.Record:


Index: Record.java
===================================================================
RCS file:
/home/cvspublic/jakarta-poi/src/scratchpad/src/org/apache/poi/hslf/record/Record.java,v
retrieving revision 1.1
diff -u -r1.1 Record.java
--- Record.java 28 May 2005 05:36:00 -0000      1.1
+++ Record.java 3 Jun 2005 16:31:00 -0000
@@ -122,7 +122,7 @@
         *  (not including the size of the header), this code assumes you're
         *  passing in corrected lengths
         */
-       protected static Record createRecordForType(long type, byte[] b, int
start, int len) {
+       public static Record createRecordForType(long type, byte[] b, int start,
int len) {
                // Default is to use UnknownRecordPlaceholder
                // When you create classes for new Records, add them here
                switch((int)type) {

Comment 1 Nick Burch 2005-06-03 18:34:20 UTC

Created attachment 15292 [details]
org.apache.poi.hslf.extractor.QuickButCruddyTextExtractor

Comment 2 Nick Burch 2005-06-09 17:15:52 UTC

Added to cvs