Varun, thank you for the comments.
I'm curious as to why the core.properties file is empty in the tar that you uploaded. Even the existing rss example is has an empty core.properties . Maybe I am missing something here?
What would you expect in that file? The core name is by default the same as directory name. File is present, so Solr autodiscovers the core on startup, but there is no need for any extra configuration.
I personally don't like the concept of these catch all fields. I understand that this is helpful as "/select" can then use "df=text"
If we switch to eDisMax to search the original fields, then the string fields such as author will not be easily searchable and/or will require a secondary copy into a text field to be searched properly. As it is, one could facet on string field and search on catch-all text field.
I would change these three fieldTypes
I will look into that. I don't know much about points for now, so this is definitely a good suggestion to check.
I did not want to create another type unless needed (that was my big problem with Tika example), so instead I have kept the protwords.txt and put 'lucene' in there. However, if other type is better I have no objections.
Do we need to strip out html ? When I see a sample summary on http://stackoverflow.com/feeds/tag/solr I see html chars in there.
The HTML is stripped by using two DIH transformers, so the text ends up without any HTML. There is also a new-style URP in solrconfig.xml to trim the post-DIH whitespace and - importantly in my opinion - to show that it is possible to have URPs with DIH. The stored summary field content at the end looks quite presentable.