Attached patch, with a first cut at using a simple (shallow) tokenizer
to interpret the specific RTF control words that determine what text
is rendered. I built this using the 1.9.1 RTF specification:
It's still rough (many nocommits) but I think it's close. All tests
pass, including a few new RTF test cases I've added.
I just created a custom tokenizer (the allowed RTF tokens are very
simple) and shallow parser. I think later we can/should cutover to a
"real" tokenizer/parser (eg JFlex)...
The new parser does a better job at extracting some doc structure; the
current parser just makes a single paragraph, but the new one makes a
paragraph whenever the doc said there was one. But it doesn't give
structure for tables, lists (it does extract their text).
It finds text that the old parser missed, eg footnotes, hyperlink,
header/footer, text inside a picture, and [generally] does not add
extra whitespace (the old one sometimes breaks a word by inserting a
space). Finally the new parser fixes the unicode character doubling
One thing I still have to fix is that it can output mis-matched tags
for i/b styles (spookily nothing failed; maybe we should add simple
validation (under asserts) eg to XHTMLContentHandler?).