Description
If you're dealing with bigger avro files (>100MB) it would be nice to have a way to quickly count the amount of records contained within that file.
With the current state of avro-tools the only way to achieve this (to my current knowledge) is to dump the data to json and count the amount of records. For bigger files this might take a while due to the serialization overhead and since every record needs to be looked at.
I added a new tool which is optimized for counting records, it does not serialize the records and reads only the block count for each block.
# the input file had a size of ~300MB $ du -sh sample.avro 323M sample.avro # using the new count tool $ time java -jar avro-tools.jar count sample.avro 331439 real 0m4.670s user 0m6.167s sys 0m0.513s # the current way of counting records $ time java -jar avro-tools.jar tojson sample.avro | wc 331439 54904484 1838231743 real 0m52.760s user 1m42.317s sys 0m3.209s # the overhead of wc is rather minor $ time java -jar avro-tools.jar tojson sample.avro > /dev/null real 0m47.834s user 0m53.317s sys 0m1.194s
This tool uses the HDFS API to handle files from any supported filesystem. I added the unit tests to the already existing TestDataFileTools since it provided convenient utility functions which I could reuse for my test scenarios.
Attachments
Attachments
Issue Links
- is related to
-
AVRO-1917 DataFileStream Skips Blocks with hasNext and nextBlock calls
- Open