[AVRO-1720] Add an avro-tool to count records in an avro file - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.10.0
Component/s: java
Labels:
- starter

Description

If you're dealing with bigger avro files (>100MB) it would be nice to have a way to quickly count the amount of records contained within that file.

With the current state of avro-tools the only way to achieve this (to my current knowledge) is to dump the data to json and count the amount of records. For bigger files this might take a while due to the serialization overhead and since every record needs to be looked at.

I added a new tool which is optimized for counting records, it does not serialize the records and reads only the block count for each block.

Naive benchmark

# the input file had a size of ~300MB
$ du -sh sample.avro 
323M    sample.avro

# using the new count tool
$ time java -jar avro-tools.jar count sample.avro
331439

real    0m4.670s
user    0m6.167s
sys 0m0.513s

# the current way of counting records
$ time java -jar avro-tools.jar tojson sample.avro | wc
331439 54904484 1838231743

real    0m52.760s
user    1m42.317s
sys 0m3.209s

# the overhead of wc is rather minor
$ time java -jar avro-tools.jar tojson sample.avro > /dev/null

real    0m47.834s
user    0m53.317s
sys 0m1.194s

This tool uses the HDFS API to handle files from any supported filesystem. I added the unit tests to the already existing TestDataFileTools since it provided convenient utility functions which I could reuse for my test scenarios.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

AVRO-1720.patch
24/Aug/15 11:57
5 kB
Janosch Woschitz
AVRO-1720-with-extended-unittests.patch
25/Aug/15 12:51
7 kB
Janosch Woschitz

Issue Links

is related to

AVRO-1917 DataFileStream Skips Blocks with hasNext and nextBlock calls

Open

Activity

People

Assignee:: Unassigned

Reporter:: Janosch Woschitz

Votes:: 5 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 24/Aug/15 11:54

Updated:: 02/Jul/20 12:23

Resolved:: 22/May/20 18:05