The Leipzig language data files are sorted by the first token of a sentence and the output is also sorted bylanguge.
To improve this the following should be done:
- The samples should be build from randomly picked lines taken from a sentences file
- The samples in the stream should be shuffled