Can this be customized to accomodate those languages?
Maybe, but we have to do work first. the dictionary is limited to GB2312 encoding, so we can't add support for new languages until this is fixed.
Is there any wiki link or document to help us understand how this tool works? Sort of behind the scenes....
There are some sparse javadocs or code comments. also see the original jira ticket:
What exactly does the dictionary contain? Is it any ordinary chinese dictionary or some sort of a customized/lemmatized dictionary?
There are two dictionaries: word dictionary, and bigram dictionary.
These dictionaries contain words and bigrams respectively, along with frequency, in a "trie"-like structure organized by chinese character.
Also, how can one add new words to the dictionary?
This is currently really difficult. please see
LUCENE-1817 for some background information.
For the moment you will have to recompile your own custom jar file, and be familiar with the file formats the analyzer uses.
Note, we put strong warnings as we would like to change the file formats in an upcoming release, to something based on Unicode.
This way, we can support more languages, and perhaps also make it easier to customize the dictionary data