Open Relevance Project
  1. Open Relevance Project
  2. ORP-1

Use existing collections for relevance testing

    Details

      Description

      I created a list of existing collections with queries and judgements on the wiki here: http://cwiki.apache.org/ORP/existingcollections.html
      These can be downloaded from the internet.
      (please add more if you know)

      I've created source code (ant and java) to download these collections, and reformat them to the trec format that the lucene benchmark expects.
      each collection has its own ant script to download the collection, and java code to reformat, although I have some shared code at the top level.

      The resulting output for each collection is a "corpus.gz" file, a queries file, and a judgements file,
      The corpus.gz is a gzipped file that can be indexed with ant via the lucene benchmark package (using TrecContentSource)
      Once the index is created, the command-line tool QueryDriver under the lucene benchmark quality/trec package can be used to run the evaluation.
      It will print some summary output to stdout, but will also create a submission file that can be fed to trec_eval.

      For starters, I will only have support for one collection in the patch, the Indonesian "Tempo" collection (around 23,000 docs)
      We can simply add subdirectories for additional collections (it does a contrib-crawl like thing).

      Once I finish wrapping up some documentation (such as description of the formats, some javadocs, and an example), I'll upload the patch.
      These formats are actually documented in the lucene-java benchmark package already, but I think it would be nice to add this for non-java users.

      1. ORP-1.patch
        46 kB
        Robert Muir

        Activity

        Hide
        Robert Muir added a comment -

        here's the patch that supports the tempo collection (initially)

        theres a README.txt in the root that describes how to use this with lucene benchmark package.

        (you will need to use the lucene trunk for this, i just committed 2 minor patches so this will work)

        Show
        Robert Muir added a comment - here's the patch that supports the tempo collection (initially) theres a README.txt in the root that describes how to use this with lucene benchmark package. (you will need to use the lucene trunk for this, i just committed 2 minor patches so this will work)
        Hide
        Simon Willnauer added a comment -

        Good stuff robert! I guess we should split it up and have one issue for the basic stuff like base ant scripts and LICENCE files etc. and another one for the first collection code.

        thoughts?

        Show
        Simon Willnauer added a comment - Good stuff robert! I guess we should split it up and have one issue for the basic stuff like base ant scripts and LICENCE files etc. and another one for the first collection code. thoughts?
        Hide
        Robert Muir added a comment -

        Simon, hard to kinda split the issues when there is nothing in openrelevance svn!

        Though this might look large/overkill for one collection, maybe I should have done two to illustrate better? the patch is mostly "basic stuff" you speak of, License files, build.xml's, ...

        Show
        Robert Muir added a comment - Simon, hard to kinda split the issues when there is nothing in openrelevance svn! Though this might look large/overkill for one collection, maybe I should have done two to illustrate better? the patch is mostly "basic stuff" you speak of, License files, build.xml's, ...
        Hide
        Marvin Humphrey added a comment -

        > maybe I should have done two

        MHO: Commit this one, do a second one, plan to refactor out common code.
        Three JIRA issues.

        Show
        Marvin Humphrey added a comment - > maybe I should have done two MHO: Commit this one, do a second one, plan to refactor out common code. Three JIRA issues.
        Hide
        Robert Muir added a comment -

        > IMHO: Commit this one, do a second one, plan to refactor out common code.

        Marvin, I would like this too. I would like to do a persian one next, just to make sure everything is groovy with unicode.

        Show
        Robert Muir added a comment - > IMHO: Commit this one, do a second one, plan to refactor out common code. Marvin, I would like this too. I would like to do a persian one next, just to make sure everything is groovy with unicode.
        Hide
        Simon Willnauer added a comment -

        IMHO: Commit this one, do a second one, plan to refactor out common code.

        You are right, lets get it going! I will commit this soon.

        simon

        Show
        Simon Willnauer added a comment - IMHO: Commit this one, do a second one, plan to refactor out common code. You are right, lets get it going! I will commit this soon. simon
        Hide
        Simon Willnauer added a comment -

        BAAAH! I have SVN problems:
        svn: Server sent unexpected return value (403 Forbidden) in response to CHECKOUT request for '/repos/asf/!svn/ver/783110/lucene/openrelevance/trunk'

        That is what I get when I want to commit it... Ideas?

        simon

        Show
        Simon Willnauer added a comment - BAAAH! I have SVN problems: svn: Server sent unexpected return value (403 Forbidden) in response to CHECKOUT request for '/repos/asf/!svn/ver/783110/lucene/openrelevance/trunk' That is what I get when I want to commit it... Ideas? simon
        Hide
        Robert Muir added a comment -

        > That is what I get when I want to commit it... Ideas?

        are you using https?

        Show
        Robert Muir added a comment - > That is what I get when I want to commit it... Ideas? are you using https?
        Hide
        Simon Willnauer added a comment -

        are you using https?

        Yep!

        Show
        Simon Willnauer added a comment - are you using https? Yep!
        Hide
        Robert Muir added a comment -

        > Yep!

        then next i would guess it is a legitimate permissions problem in svn

        Show
        Robert Muir added a comment - > Yep! then next i would guess it is a legitimate permissions problem in svn
        Hide
        Grant Ingersoll added a comment -

        Can you get to the SVN via a browser: https://svn.apache.org/repos/asf/lucene/openrelevance/

        Your user name is simonw, right? Are you on the EU mirror? Can you try the US one?

        Show
        Grant Ingersoll added a comment - Can you get to the SVN via a browser: https://svn.apache.org/repos/asf/lucene/openrelevance/ Your user name is simonw, right? Are you on the EU mirror? Can you try the US one?
        Hide
        Simon Willnauer added a comment - - edited

        Can you get to the SVN via a browser: https://svn.apache.org/repos/asf/lucene/openrelevance/

        Yes I can!

        Your user name is simonw, right?

        right

        Are you on the EU mirror? Can you try the US one?

        I'm on EU - I will try to do the commit on the US once I have a stable INet connection (sitting in a train right now.)

        Show
        Simon Willnauer added a comment - - edited Can you get to the SVN via a browser: https://svn.apache.org/repos/asf/lucene/openrelevance/ Yes I can! Your user name is simonw, right? right Are you on the EU mirror? Can you try the US one? I'm on EU - I will try to do the commit on the US once I have a stable INet connection (sitting in a train right now.)
        Hide
        Simon Willnauer added a comment -

        Grant, I tried it on US and EU. I always get the same stupid error.
        I googled a bit and found some possible issues that maybe the URL in the authz file is slightly wrong (Upper / Lower case issues). Are you able to check this?

        simon

        Show
        Simon Willnauer added a comment - Grant, I tried it on US and EU. I always get the same stupid error. I googled a bit and found some possible issues that maybe the URL in the authz file is slightly wrong (Upper / Lower case issues). Are you able to check this? simon
        Hide
        Simon Willnauer added a comment -

        Commited in revision 881953

        Thanks you robert! We eventually fixed the SVN issue

        Show
        Simon Willnauer added a comment - Commited in revision 881953 Thanks you robert! We eventually fixed the SVN issue
        Hide
        Robert Muir added a comment -

        hey this is good to see!

        I will try to do a persian one tonight, then we can rewrite/refactor/redesign everything

        Show
        Robert Muir added a comment - hey this is good to see! I will try to do a persian one tonight, then we can rewrite/refactor/redesign everything
        Hide
        Robert Muir added a comment -

        Simon, when you get a chance, can you set eol style to native in svn? Here is the list that need it:
        M LICENSE.txt
        M common-build.xml
        M src\java\org\apache\or\util\TrecQrel.java
        M src\java\org\apache\or\util\TrecDocumentWriter.java
        M src\java\org\apache\or\util\TrecTopicWriter.java
        M src\java\org\apache\or\util\TrecDocument.java
        M src\java\org\apache\or\util\TrecTopic.java
        M src\java\org\apache\or\util\TrecQrelWriter.java
        M FILEFORMATS.txt
        M build.xml
        M collections\tempo\src\java\org\apache\or\collections\tempo\TempoQrelConverter.java
        M collections\tempo\src\java\org\apache\or\collections\tempo\TempoCorpusConverter.java
        M collections\tempo\src\java\org\apache\or\collections\tempo\TempoTopicConverter.java
        M collections\tempo\build.xml
        M collections\collections-build.xml

        Show
        Robert Muir added a comment - Simon, when you get a chance, can you set eol style to native in svn? Here is the list that need it: M LICENSE.txt M common-build.xml M src\java\org\apache\or\util\TrecQrel.java M src\java\org\apache\or\util\TrecDocumentWriter.java M src\java\org\apache\or\util\TrecTopicWriter.java M src\java\org\apache\or\util\TrecDocument.java M src\java\org\apache\or\util\TrecTopic.java M src\java\org\apache\or\util\TrecQrelWriter.java M FILEFORMATS.txt M build.xml M collections\tempo\src\java\org\apache\or\collections\tempo\TempoQrelConverter.java M collections\tempo\src\java\org\apache\or\collections\tempo\TempoCorpusConverter.java M collections\tempo\src\java\org\apache\or\collections\tempo\TempoTopicConverter.java M collections\tempo\build.xml M collections\collections-build.xml

          People

          • Assignee:
            Simon Willnauer
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development