Details
-
Improvement
-
Status: Closed
-
Minor
-
Resolution: Fixed
-
None
-
None
-
New
Description
DataSplitter currently creates 3 indexes (train/test/cv) out of an original index for evaluation of Classifiers however "class coverage" in such generated indexes is not guaranteed; that means e.g. in training index only documents belonging to 50% of the class set could be indexed and hence classifiers may not be very effective. In order to provide more consistent evaluation the generated index should contain _ split-ratio * | docs in c |_ documents for each class c.