[AVRO-1307] Add an avro-tool to extract samples from avro files - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.7.5
Component/s: java
Labels:
None
Environment:

java

Release Note:
new avro-tool cat that picks only some records from avro files

Description

It would be nice to have an avro-tool that picks only some records from avro files.

I implemented a new avro-tool cat, which takes a list of avro files with identical schemas and concatenates them into a single file, with options to discard the first n records, to limit the output size and to collect records at a certain samplerate.

This tool allows a quicker peek into large avro files, e.g.:

java -jar avro-tools.jar cat input.avro output.avro --offset 50 --limit 10
# creates output.avro that contains records
# 51 to 60 from input.avro.

java -jar avro-tools.jar cat input.avro output.avro --offset 1000 --limit 100 --samplerate .01
# samples every hundredth record from input,
# beginning at the 1000th record and limiting
# the output to 100 records.

The tool allows multiple input files or folders, in which case all files inside the folder will be used for input.

java -jar avro-tools.jar cat data_folder output.avro --samplerate .01
# reads all the files from the data folder and
# writes every 100th record into the output file.

This tool uses the hadoop FileSystem api to handle files from any supported filesystem.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

AVRO-1307-addedUnitTests-fixed.patch
25/Apr/13 12:05
24 kB
Vincenz Priesnitz
AVRO-1307.patch
22/Apr/13 16:36
9 kB
Vincenz Priesnitz

Activity

People

Assignee:: Vincenz Priesnitz

Reporter:: Vincenz Priesnitz

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 22/Apr/13 16:34

Updated:: 20/Aug/13 17:46

Resolved:: 26/Apr/13 22:12