Description
sentence boundary matters for sliding window, we shouldn't train model from a window across sentences.
the current 1000 word as a hard split for sentences doesn't really make sense which is not consistent with both original c version or other implementation like deeplearning4j etc.
the max sentence length is fixed and not tunable. Made it tunable as well.
I made changes to address above issues.
here is the pull request: https://github.com/apache/spark/pull/10152
Attachments
Issue Links
- links to