[PIG-3865] Remodel the XMLLoader to work to be faster and more maintainable - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.13.0
Component/s: piggybank
Labels:
None

Hadoop Flags:

Reviewed

Description

I recreated the XMLLoader in PiggyBank to work line by line instead of character by character. This makes it more efficient as it uses precompiled regular expressions on each line instead of doing checks on a character by character basis. The code is also significantly smaller which makes it more maintainable.

Just to put you in perspective. I'm a PhD student in University of Minnesota. I built SpatialHadoop http://spatialhadoop.cs.umn.edu which is an extension to Hadoop that adds spatial data types and indexes in HDFS. The system is open source and have been downloads more than 75,000 times so far. Part of it is to provide a simple high level language that works with spatial data.

I proposed Pigeon http://spatialhadoop.cs.umn.edu/pigeon as a spatial extension to Pig. My case study is the planet file from OpenStreetMap. This is a 450GB XML file that contains all the information about the whole planet. I previously used XMLLoader to parse it. I found some bugs and fixed it in previous issues. Now, I found that it takes a lot of time to parse the XML file. To be a good citizen, I remodeled the XMLLoader to work line by line and use precompiled regular expressions which makes it faster. The parsing time of the compressed OSM planet file drops from 5:30 hours to 3:30 hours in my cluster setup with Hadoop 1.2.1. By the way, Pigeon was presented in ICDE 2014 http://ieee-icde2014.eecs.northwestern.edu/program.html, a top conference in data engineering.

The code is now more maintainable. For example, I can easily modify it to add to accept a regular expression for the XML identifier so that it matches all tags that satisfy the regular expression instead of just returning a fixed static tag. In this version, I didn't add any new features but they can be added in the future.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

XMLLoader.java
05/Apr/14 16:51
12 kB
Ahmed Eldawy
PIG-3865-test.txt
05/Apr/14 17:04
11 kB
Ahmed Eldawy
PIG-3865-2.patch
11/Apr/14 01:52
44 kB
Daniel Dai
bad-file.xml.bz2
23/Apr/14 15:33
209 kB
Ahmed Eldawy
test-file-2.xml.bz2
23/Apr/14 15:34
119 kB
Ahmed Eldawy

Issue Links

breaks

PIG-4617 XML loader is not working fine with pig 0.14 version

Open

incorporates

PIG-3373 XMLLoader returns non-matching nodes when a tag name spans through the block boundary

Closed

Activity

People

Assignee:: Ahmed Eldawy

Reporter:: Ahmed Eldawy

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 03/Apr/14 15:52

Updated:: 28/Dec/15 07:25

Resolved:: 24/Apr/14 23:59