The following issues are found with CommonCrawlDumper;
1. Documents get duplicated in dump files
How to reproduce
The first ever written will contain 1 document.
second file includes two documents
third file includes first three documents and this grows linearly.
2.If a segment has many parts (part-00000, part-00001,...) only the first part (part-00000 ) is being dumped
How to reproduce ?
Create segment with two parts (part-00000 and part-00001)