To assess relative performance of the modified StandardTokenizerImpl, I ran luceneutil's TestAnalyzerPerf (the history of results of the 4.x version of which are shown here: http://people.apache.org/~mikemccand/lucenebench/analyzers.html).
Here are the raw results of ten runs (after a run to populate the OS filesystem cache) on Linux with Oracle 1.7.0_60, against unmodified trunk, using enwiki-20130102-lines.txt:
Standard time=48581.34 msec hash=-16468587987622665 tokens=203498795
Standard time=48103.02 msec hash=-16468587987622665 tokens=203498795
Standard time=44514.19 msec hash=-16468587987622665 tokens=203498795
Standard time=48997.35 msec hash=-16468587987622665 tokens=203498795
Standard time=47794.26 msec hash=-16468587987622665 tokens=203498795
Standard time=48973.45 msec hash=-16468587987622665 tokens=203498795
Standard time=52409.88 msec hash=-16468587987622665 tokens=203498795
Standard time=49674.48 msec hash=-16468587987622665 tokens=203498795
Standard time=48257.42 msec hash=-16468587987622665 tokens=203498795
Standard time=48075.62 msec hash=-16468587987622665 tokens=203498795
Mean time=48538.10 msec
and the patched results:
Standard time=49561.77 msec hash=-16468594357435165 tokens=203498791
Standard time=49465.50 msec hash=-16468594357435165 tokens=203498791
Standard time=50194.16 msec hash=-16468594357435165 tokens=203498791
Standard time=48548.19 msec hash=-16468594357435165 tokens=203498791
Standard time=49449.01 msec hash=-16468594357435165 tokens=203498791
Standard time=52377.06 msec hash=-16468594357435165 tokens=203498791
Standard time=52433.60 msec hash=-16468594357435165 tokens=203498791
Standard time=50495.17 msec hash=-16468594357435165 tokens=203498791
Standard time=46098.29 msec hash=-16468594357435165 tokens=203498791
Standard time=48078.95 msec hash=-16468594357435165 tokens=203498791
Mean time=49670.17 msec
Comparing the mean throughput numbers, the patched version is ~2.3% slower.
Comparing the highest throughput numbers, the patched version is ~3.5% slower.
I believe the reason for the relative slowdown is the use of Java's codepoint APIs (Character.codePointAt(), .charCount(), etc.) over the input char buffer. I think this is an acceptable reduction in performance in exchange for the more easily maintainable single-source specifications.
The number of tokens, and the hash (calculated over the token text and their positions and offsets) differ slightly - I tracked this down to an unrelated change I made to the specification: I changed the ComplexContext rule, a specialization for Southeast Asian scripts, to include following WB:Format and/or WB:Extend characters, as is done with most other rules in the specification, following the UAX#29 WB4 rule. All tokenization differences are caused by the orginal specification triggering breaks at U+200C ZERO WIDTH NON-JOINER, which is a WB:Extend character, after and between Myanmar characters. When I reverted changes to that rule in the patched version, the same hash and number of tokens is produced as in the original unpatched version.