Issue Details (XML | Word | Printable)

Key: HADOOP-2826
Type: Bug Bug
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Amareshwari Sriramadasu
Reporter: Amareshwari Sriramadasu
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

FileSplit.getFile(), LineRecordReader. readLine() need to be removed

Created: 14/Feb/08 05:54 AM   Updated: 08/Jul/09 04:52 PM
Return to search
Component/s: None
Affects Version/s: None
Fix Version/s: 0.17.0

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works patch-2826.txt 2008-04-03 04:22 AM Amareshwari Sriramadasu 14 kB
Text File Licensed for inclusion in ASF works patch-2826.txt 2008-04-02 11:44 AM Amareshwari Sriramadasu 14 kB
Text File Licensed for inclusion in ASF works patch-2826.txt 2008-04-02 04:59 AM Amareshwari Sriramadasu 13 kB
Text File Licensed for inclusion in ASF works patch-2826.txt 2008-03-31 10:51 AM Amareshwari Sriramadasu 12 kB

Hadoop Flags: Reviewed, Incompatible change
Release Note:
The deprecated methods, public File org.apache.hadoop.mapred.FileSplit.getFile() and
  public static long org.apache.hadoop.mapred.LineRecordReader.readLine(InputStream in, OutputStream out)
are removed.
The constructor org.apache.hadoop.mapred.LineRecordReader.LineReader(InputStream in, Configuration conf) 's visibility is made public.
The signature of the public org.apache.hadoop.streaming.UTF8ByteArrayUtils.readLIne(InputStream) method is changed to UTF8ByteArrayUtils.readLIne(LineReader, Text). Since the old signature is not deprecated, any code using the old method must be changed to use the new method.
Resolution Date: 04/Apr/08 03:50 PM


 Description  « Hide
The methods FileSplit.getFile(), LineRecordReader. readLine() need to be removed as they are deprecated.

 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Amareshwari Sriramadasu added a comment - 31/Mar/08 10:51 AM
Here is patch removing FileSplit.getFile() and LineRecordReader.readLine()

Hadoop QA added a comment - 31/Mar/08 11:01 AM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12378944/patch-2826.txt
against trunk revision 619744.

@author +1. The patch does not contain any @author tags.

tests included +1. The patch appears to include 6 new or modified tests.

patch -1. The patch command could not apply the patch.

Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2102/console

This message is automatically generated.


Mahadev konar added a comment - 31/Mar/08 06:45 PM
instead of public static byte [] readLine(LineReader lineReader)

why not pass a text object ?

public static readLine(LineReader lineReader, Text t)

read into t
}

the call would be something like ....
t.clear()
call readline(reader, t)

that would save creating object Text for each readline invocation?


Amareshwari Sriramadasu added a comment - 01/Apr/08 05:39 AM
Canceling patch to address Mahadev's comment.

Amareshwari Sriramadasu added a comment - 02/Apr/08 04:59 AM
Here is patch addressing Mahadev's suggestion.

Hadoop QA added a comment - 02/Apr/08 06:22 AM
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12379102/patch-2826.txt
against trunk revision 643282.

@author +1. The patch does not contain any @author tags.

tests included +1. The patch appears to include 6 new or modified tests.

javadoc +1. The javadoc tool did not generate any warning messages.

javac +1. The applied patch does not generate any new javac compiler warnings.

release audit +1. The applied patch does not generate any new release audit warnings.

findbugs +1. The patch does not introduce any new Findbugs warnings.

core tests +1. The patch passed core unit tests.

contrib tests +1. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2119/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2119/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2119/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2119/console

This message is automatically generated.


Amareshwari Sriramadasu added a comment - 02/Apr/08 11:37 AM
Cancelling the patch, because In the patch pipemapred does a arrayCopy to get byteArray of the Text's length. But Text is doing a arrayCopy for readLine. This involves two array copies for each readline call of streaming.

Amareshwari Sriramadasu added a comment - 02/Apr/08 11:44 AM
This patch removes the arrayCopy added in earlier patch and byte array is passed along with its for the split method. This solution should be
better than the earlier since we are avoiding the additional arraycopy.

Hadoop QA added a comment - 02/Apr/08 01:13 PM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12379123/patch-2826.txt
against trunk revision 643282.

@author +1. The patch does not contain any @author tags.

tests included +1. The patch appears to include 6 new or modified tests.

javadoc -1. The javadoc tool appears to have generated 1 warning messages.

javac +1. The applied patch does not generate any new javac compiler warnings.

release audit +1. The applied patch does not generate any new release audit warnings.

findbugs +1. The patch does not introduce any new Findbugs warnings.

core tests +1. The patch passed core unit tests.

contrib tests +1. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2122/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2122/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2122/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2122/console

This message is automatically generated.


Mahadev konar added a comment - 02/Apr/08 05:49 PM
+1 . code looks good.

Amareshwari Sriramadasu added a comment - 03/Apr/08 04:22 AM
Removed javadoc warning.

Hadoop QA added a comment - 03/Apr/08 06:14 PM
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12379229/patch-2826.txt
against trunk revision 643282.

@author +1. The patch does not contain any @author tags.

tests included +1. The patch appears to include 6 new or modified tests.

javadoc +1. The javadoc tool did not generate any warning messages.

javac +1. The applied patch does not generate any new javac compiler warnings.

release audit +1. The applied patch does not generate any new release audit warnings.

findbugs +1. The patch does not introduce any new Findbugs warnings.

core tests +1. The patch passed core unit tests.

contrib tests +1. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2141/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2141/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2141/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2141/console

This message is automatically generated.


Devaraj Das added a comment - 04/Apr/08 03:50 PM
I just committed this. Thanks, Amareshwari!

Hudson added a comment - 05/Apr/08 12:13 PM

Joydeep Sen Sarma added a comment - 14/Apr/08 10:54 PM
i was pulling these changes into our production environment. quick comment:

we copy over an entire line (from readLine) and then we break it into two strings by splitting on tab. So there is an extra scan of the input data and an extra copy based on splitting by tab.

instead if we generalized LineReader to instead read until it hits a delimiter - then we can do it with one less scan and copy. Something like:

byte [] tabDelimiter = new byte [1]; tabDelimiter[0] = '\t';
byte [] newlineDelimiter = new byte[2]; newlineDelimiter[0] = '\n'; newlineDelimiter[1] = '\r';

while() {
lineReader.setDelimiter(tabDelimiter);
lineReader.readLine(key);
lineReader.setDelimiter(newlineDelimiter);
lineReader.readLine(value);
}

getting rid of one copy and scan would be pretty big!