I've looked into the source codes for the cause of encoding problems.
the encoding detection of the input files heavily rely on
default system encoding.
In the site generation process, The Stirng to byte array conversions occur many times.
This leads to problems difficult to solve.
With problem 1, I have some idea about the solutions.
there are some types of input files, for example
- property resource file
- XML file
- apt file
and there should be an method
of specifying encoding according to the file type .
With property resource file, I like to use native2ascii.
Certainly, that's not human readable, but rarely causes the encoding problems.
And the problem of readability can be avoided by automating
native2ascii processing. the build lifecycle phase
"process-resource" will be
good place to hold such a process.
With XML file , I think the encoding detection should
follow XML specification of w3c.
So, MXParser should be changed to support the auto
With apt file , I think the encoding detection should follow
POM configuration. The configuration will be like following:
With problem 2, I have no idea about the good solutions, yet.
the string to byte array conversion occur many times
in the process of getting the site descriptor. In that process,
the characters seems to be converted wrongly.