Issue Details (XML | Word | Printable)

Key: HADOOP-4437
Type: Improvement Improvement
Status: Closed Closed
Resolution: Fixed
Priority: Minor Minor
Assignee: Tsz Wo (Nicholas), SZE
Reporter: Tsz Wo (Nicholas), SZE
Votes: 0
Watchers: 2
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

Use qMC sequence to improve the accuracy of PiEstimator

Created: 17/Oct/08 03:02 AM   Updated: 14/Jul/09 06:00 PM
Return to search
Component/s: None
Affects Version/s: None
Fix Version/s: 0.20.0

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works 4437_20081019.patch 2008-10-19 02:26 PM Tsz Wo (Nicholas), SZE 6 kB
Text File Licensed for inclusion in ASF works 4437_20081103.patch 2008-11-04 01:00 AM Tsz Wo (Nicholas), SZE 6 kB
Issue Links:
Reference
 

Hadoop Flags: Reviewed
Resolution Date: 04/Nov/08 11:40 PM


 Description  « Hide
Currently, PiEstimator uses java.util.Random to generate random 2d-points for estimating pi. The numbers generated by java.util.Random are uniformly distributed. The 2d-points generated tense to have clump and gap. So the accuracy of the estimated pi is low. The accuracy can be improved by using a quasi-Monte Carlo (qMC) sequence.

 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Tsz Wo (Nicholas), SZE added a comment - 19/Oct/08 02:26 PM
4437_20081019.patch: replace java.util.Random with Halton sequence.

Try totally 100000000 samples

  • Before the patch:
    Job Finished in 22.422 seconds
    Estimated value of PI is 3.14145832
  • After the patch:
    Job Finished in 13.375 seconds
    Estimated value of PI is 3.14159256000000000000

Chris Douglas added a comment - 04/Nov/08 12:09 AM
Since it's in the examples, a few brief notes on the initialization of HaltonSequence and on nextPoint would help orient readers. It doesn't need to be a course in statistics- the existing code isn't, either- but even a sentence or two on why PiEstimator is using it and perhaps a couple comments identifying the variables would be helpful.

+1 on the patch, though; documentation is just a suggestion.


Tsz Wo (Nicholas), SZE added a comment - 04/Nov/08 01:00 AM
4437_20081103.patch: added javadoc, no code changes.

Konstantin Shvachko added a comment - 04/Nov/08 07:34 PM
This is nice. I understand it is just an example but if we run more maps for a longer period of time can we get more Pi digits?
Now, as an example it should imo have much better documentation.

Hadoop QA added a comment - 04/Nov/08 07:38 PM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12393275/4437_20081103.patch
against trunk revision 709609.

+1 @author. The patch does not contain any @author tags.

-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no tests are needed for this patch.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

+1 Eclipse classpath. The patch retains Eclipse classpath integrity.

+1 core tests. The patch passed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3527/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3527/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3527/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3527/console

This message is automatically generated.


Tsz Wo (Nicholas), SZE added a comment - 04/Nov/08 08:07 PM
> This is nice. I understand it is just an example but if we run more maps for a longer period of time can we get more Pi digits?

Yes, the more samples used the more digits will get in both Monte Carlo method (java.util.Random) and qMC method (Halton sequence).

However, the discrepancy for Halton sequence is smaller than java.util.Random. The expected error of java.util.Random is O(1/sqrt(N)) while the expected error of using Halton sequence is O((ln N)/N), where N is the number for samples. For estimating Pi with 100,000,000 samples, the accuracy of Halton is ~7 digits but java.util.Random is only ~4 digits as shown previously.

> Now, as an example it should imo have much better documentation.

This is a good point. I plan to further improve the PiEstimator. Let me also improve the documentation in the next issue.


Tsz Wo (Nicholas), SZE added a comment - 04/Nov/08 11:40 PM
Tested manually with a 100-nodes cluster. The biggest sample set I have run was
  • 1000 maps and 10000000 samples per map.
    Job Finished in 67.337 seconds
    Estimated value of PI is 3.14159264520000000000

I just committed this.