Thanks again for a thorough review. New patch is attached.
in Solr, it's bad form to call Class.forName() ..
minor: you're using 4-spaces indent instead of 2 in the main value loop in your URP
Fixed (it was actually a left-over from removing an outer-level loop, all other indents are 2 spaces).
in those log.debug() calls, it's creating the string to potentially not even log it
Looking at what FieldType.createField() is doing, I propose you do the same in this URP ...
Funny thing, I had this in one version of the patch, and then decided to reuse SF.createField(..) to avoid code duplication. The problem is that SchemaField.isTokenized() has package-level visibility so it's not visible in the UP's package. I fixed this by providing a utility method in PreAnalyzedField to create a FieldType. Also, I moved there the chunk of logic for setting / resetting the Field content and type flags based on SchemaField. Overall, it simplifies the UP.
The resulting "Field" instance can be shared from one document to the next I believe, and so you can cache this in the URP and reset its value & tokenStream.
Hmm, this doesn't seem feasible at all. First, this cache would have to be thread safe, and prevent reuse of Field instances until the document is actually processed by IndexWriter - I don't think there's a mechanism to enforce this in the context of this class? Also it would have to cache multiple instances of Field, because processing a single document may result in creating multiple instances (at least one per pre-analyzed field, more if fields are multi-valued).