Lucene - Core
  1. Lucene - Core
  2. LUCENE-4590

WriteEnwikiLineDoc which writes Wikipedia category pages to a separate file

    Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: modules/benchmark
    • Labels:
      None

      Description

      It may be convenient to split Wikipedia's line file into two separate files: category-pages and non-category ones.
      It is possible to split the original line file with grep or such.
      It is more efficient to do it in advance.

        Activity

        Hide
        Shai Erera added a comment -

        Do you think perhaps that EnwikiContentSource should let the caller know whether the returned DocData represents a content page or category page? In general, for text indexing benchmarks, I don't think that indexing the category pages adds much value, because they are very short and will often not come back as a result to any query. Rather, their content seems to represent the Wikipedia's taxonomy.

        So maybe, if someone wants to generate a line file from the pages only, EnwikiContentSource can support filtering out category pages entirely. That will allow the flexibility that I think you are trying to achieve:

        • If discardCategories = true, WriteLineDocTask will write just the content pages. Otherwise, it will write both.
        • WriteEnwikiLineDocTask will detect the page type from a specialized DocData and know easily to which file to write it.
        Show
        Shai Erera added a comment - Do you think perhaps that EnwikiContentSource should let the caller know whether the returned DocData represents a content page or category page? In general, for text indexing benchmarks, I don't think that indexing the category pages adds much value, because they are very short and will often not come back as a result to any query. Rather, their content seems to represent the Wikipedia's taxonomy. So maybe, if someone wants to generate a line file from the pages only, EnwikiContentSource can support filtering out category pages entirely. That will allow the flexibility that I think you are trying to achieve: If discardCategories = true, WriteLineDocTask will write just the content pages. Otherwise, it will write both. WriteEnwikiLineDocTask will detect the page type from a specialized DocData and know easily to which file to write it.
        Hide
        Doron Cohen added a comment -

        Do you think perhaps that EnwikiContentSource should let the caller know whether the returned DocData represents a content page or category page?

        That's what I planned at start, but decided to leave WriteLineDoc intact because it is general, that is, not aware of the unique structure of Wikipedia data, where some of the pages represent categories.

        So maybe, if someone wants to generate a line file from the pages only... flexibility that I think you are trying to achieve...

        Actually I am after the two files... These category pages are (unique) taxonomy node names, but without the taxonomy structure, which can be deduced from the (parent) categories of the category pages. Having this separate category pages can be useful for deducing that taxonomy.

        Show
        Doron Cohen added a comment - Do you think perhaps that EnwikiContentSource should let the caller know whether the returned DocData represents a content page or category page? That's what I planned at start, but decided to leave WriteLineDoc intact because it is general, that is, not aware of the unique structure of Wikipedia data, where some of the pages represent categories. So maybe, if someone wants to generate a line file from the pages only... flexibility that I think you are trying to achieve... Actually I am after the two files... These category pages are (unique) taxonomy node names, but without the taxonomy structure, which can be deduced from the (parent) categories of the category pages. Having this separate category pages can be useful for deducing that taxonomy.
        Hide
        Shai Erera added a comment -

        That's what I planned at start, but decided to leave WriteLineDoc intact because it is general, that is, not aware of the unique structure of Wikipedia data, where some of the pages represent categories.

        I think that you misunderstood me, or I wasn't clear enough. WriteLineDoc would not change, EnwikiContentSource would. If someone is interested in creating a line file over all Wikipedia pages, he'll put in his .alg something like content.source=EnwikiContentSource and enwiki.source.exclude.categories=false, otherwise enwiki.source.exclude.categories=true. WriteLineDocTask would still write the DocData that the source writes.

        EnwikiContentSource will return either DocData or CategoryDocData, or a single object EnwikiDocData with an extra boolean isCategory. WriteLineDoc will still read just the DocData fields it knows about. WriteEnwikiLineDoc will write the DocData to the relevant file, per isCategory.

        Actually I am after the two files

        I know . I don't propose anything different, just discussing how the code could be designed to achieve that, and as a bonus, allow someone to exclude from regular benchmarks the category pages.

        Show
        Shai Erera added a comment - That's what I planned at start, but decided to leave WriteLineDoc intact because it is general, that is, not aware of the unique structure of Wikipedia data, where some of the pages represent categories. I think that you misunderstood me, or I wasn't clear enough. WriteLineDoc would not change, EnwikiContentSource would. If someone is interested in creating a line file over all Wikipedia pages, he'll put in his .alg something like content.source=EnwikiContentSource and enwiki.source.exclude.categories=false , otherwise enwiki.source.exclude.categories=true . WriteLineDocTask would still write the DocData that the source writes. EnwikiContentSource will return either DocData or CategoryDocData, or a single object EnwikiDocData with an extra boolean isCategory. WriteLineDoc will still read just the DocData fields it knows about. WriteEnwikiLineDoc will write the DocData to the relevant file, per isCategory. Actually I am after the two files I know . I don't propose anything different, just discussing how the code could be designed to achieve that, and as a bonus, allow someone to exclude from regular benchmarks the category pages.
        Hide
        Doron Cohen added a comment -

        Now I see what you mean. Spooky, it is as if you were looking into the patch I did not post here.. How did you know I chose not to modify EnwikiConentSource...

        I agree that if someone wishes to index just the non-category pages, the new WriteEnwikiLineDoc would create the category pages file for no use. Also, if indexing is conducted straight away, not through a line file first, categories will be indexed. But then anyone could check the title and decide not to index those docs. So I see the advantage, just not tempted to add this at the moment, but it can be added.

        Show
        Doron Cohen added a comment - Now I see what you mean. Spooky, it is as if you were looking into the patch I did not post here.. How did you know I chose not to modify EnwikiConentSource... I agree that if someone wishes to index just the non-category pages, the new WriteEnwikiLineDoc would create the category pages file for no use. Also, if indexing is conducted straight away, not through a line file first, categories will be indexed. But then anyone could check the title and decide not to index those docs. So I see the advantage, just not tempted to add this at the moment, but it can be added.
        Hide
        Doron Cohen added a comment -

        Patch with the new task and a test.

        Show
        Doron Cohen added a comment - Patch with the new task and a test.
        Hide
        Commit Tag Bot added a comment -

        [trunk commit] Doron Cohen
        http://svn.apache.org/viewvc?view=revision&revision=1418852

        LUCENE-4590: Added WriteEnwikiLineDocTask.

        Show
        Commit Tag Bot added a comment - [trunk commit] Doron Cohen http://svn.apache.org/viewvc?view=revision&revision=1418852 LUCENE-4590 : Added WriteEnwikiLineDocTask.
        Hide
        Commit Tag Bot added a comment -

        [branch_4x commit] Doron Cohen
        http://svn.apache.org/viewvc?view=revision&revision=1418955

        LUCENE-4590: Merge from trunk: Add WriteEnwikiLineDoc which writes Wikipedia category pages to a separate file.

        Show
        Commit Tag Bot added a comment - [branch_4x commit] Doron Cohen http://svn.apache.org/viewvc?view=revision&revision=1418955 LUCENE-4590 : Merge from trunk: Add WriteEnwikiLineDoc which writes Wikipedia category pages to a separate file.
        Hide
        Doron Cohen added a comment -

        Done.

        Show
        Doron Cohen added a comment - Done.
        Hide
        Doron Cohen added a comment -

        Reopen issue for making the categories file name method public: categoriesLineFile() so that it can easily be modified in the future without breaking apps logic.

        Show
        Doron Cohen added a comment - Reopen issue for making the categories file name method public: categoriesLineFile() so that it can easily be modified in the future without breaking apps logic.
        Hide
        Commit Tag Bot added a comment -

        [trunk commit] Doron Cohen
        http://svn.apache.org/viewvc?view=revision&revision=1419317

        LUCENE-4590: WriteEnwikiLineDoc "trailing change": make categoriesLineFile(File) public.

        Show
        Commit Tag Bot added a comment - [trunk commit] Doron Cohen http://svn.apache.org/viewvc?view=revision&revision=1419317 LUCENE-4590 : WriteEnwikiLineDoc "trailing change": make categoriesLineFile(File) public.
        Hide
        Commit Tag Bot added a comment -

        [branch_4x commit] Doron Cohen
        http://svn.apache.org/viewvc?view=revision&revision=1419323

        LUCENE-4590: WriteEnwikiLineDoc "trailing change": make categoriesLineFile(File) public.

        Show
        Commit Tag Bot added a comment - [branch_4x commit] Doron Cohen http://svn.apache.org/viewvc?view=revision&revision=1419323 LUCENE-4590 : WriteEnwikiLineDoc "trailing change": make categoriesLineFile(File) public.
        Hide
        Doron Cohen added a comment -

        done.

        Show
        Doron Cohen added a comment - done.
        Hide
        Commit Tag Bot added a comment -

        [branch_4x commit] Doron Cohen
        http://svn.apache.org/viewvc?view=revision&revision=1418955

        LUCENE-4590: Merge from trunk: Add WriteEnwikiLineDoc which writes Wikipedia category pages to a separate file.

        Show
        Commit Tag Bot added a comment - [branch_4x commit] Doron Cohen http://svn.apache.org/viewvc?view=revision&revision=1418955 LUCENE-4590 : Merge from trunk: Add WriteEnwikiLineDoc which writes Wikipedia category pages to a separate file.

          People

          • Assignee:
            Doron Cohen
            Reporter:
            Doron Cohen
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development