Details

      Description

      Currently

      • The CSV example has 10 documents.
      • The JSON example has 4 documents.
      • The XML example has 32 documents.

      1. We should have equal number of documents and the same documents in all the example formats
      2. A data set which is slightly more comprehensive.

      1. film.csv
        25 kB
        Varun Thacker
      2. film.json
        60 kB
        Varun Thacker
      3. film.xml
        52 kB
        Varun Thacker
      4. freebase_film_dump.py
        4 kB
        Varun Thacker
      5. freebase_film_dump.py
        4 kB
        Varun Thacker
      6. freebase_film_dump.py
        3 kB
        Varun Thacker
      7. freebase_film_dump.py
        3 kB
        Varun Thacker
      8. freebase_film_dump.py
        3 kB
        Varun Thacker
      9. freebase_film_dump.py
        4 kB
        Varun Thacker
      10. freebase_film_dump.py
        3 kB
        Varun Thacker
      11. LICENSE.txt
        0.3 kB
        Varun Thacker
      12. README.txt
        2 kB
        Varun Thacker
      13. README.txt
        2 kB
        Varun Thacker
      14. SOLR-6127.patch
        187 kB
        Varun Thacker

        Activity

        Hide
        varunthacker Varun Thacker added a comment -

        I thought Freebase would be a good place to get data from.

        Uwe Schindler - Would using the data from freebase ( https://developers.google.com/freebase/faq#rules_for_using_data ) be a licensing issue?

        If thats not a concern here is a script which fetches 200 rows of film data ( http://www.freebase.com/film ) and dumps it into JSON, XML and CSV.

        The number of documents can be adjusted. You would need to put in the API KEY for it to run.

        Any opinions if this is a good idea?

        Show
        varunthacker Varun Thacker added a comment - I thought Freebase would be a good place to get data from. Uwe Schindler - Would using the data from freebase ( https://developers.google.com/freebase/faq#rules_for_using_data ) be a licensing issue? If thats not a concern here is a script which fetches 200 rows of film data ( http://www.freebase.com/film ) and dumps it into JSON, XML and CSV. The number of documents can be adjusted. You would need to put in the API KEY for it to run. Any opinions if this is a good idea?
        Hide
        iorixxx Ahmet Arslan added a comment -

        I tried to run it with

        Python 3.4.0 (v3.4.0:04f714765c13, Mar 15 2014, 23:02:41) 
        [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
        Type "help", "copyright", "credits" or "license" for more information.
        

        it complains :

        Traceback (most recent call last):
          File "freebase_film_dump.py", line 5, in <module>
            import cStringIO
        ImportError: No module named 'cStringIO'
        

        I see that in example documents xml files has licence headers, json and cvs don't. Does this add Licence headers to XML?

        Show
        iorixxx Ahmet Arslan added a comment - I tried to run it with Python 3.4.0 (v3.4.0:04f714765c13, Mar 15 2014, 23:02:41) [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin Type "help", "copyright", "credits" or "license" for more information. it complains : Traceback (most recent call last): File "freebase_film_dump.py", line 5, in <module> import cStringIO ImportError: No module named 'cStringIO' I see that in example documents xml files has licence headers, json and cvs don't. Does this add Licence headers to XML?
        Hide
        varunthacker Varun Thacker added a comment -

        I used python 2.7

        You might need to modify the script to run on Python 3x - http://stackoverflow.com/questions/11914472/stringio-in-python3

        Yes indeed the current exampledocs don't have the license in the JSON and CSV files while XML do. I guess we should fix that in this issue as well.

        Show
        varunthacker Varun Thacker added a comment - I used python 2.7 You might need to modify the script to run on Python 3x - http://stackoverflow.com/questions/11914472/stringio-in-python3 Yes indeed the current exampledocs don't have the license in the JSON and CSV files while XML do. I guess we should fix that in this issue as well.
        Hide
        iorixxx Ahmet Arslan added a comment -

        I get exceptions with

        Python 2.7.1 (r271:86832, Jun 16 2011, 16:59:05) 

        too. Sorry I am python ignorant, may be it is a good idea to use same/compatible version that can run smokeTestRelease.py ?

        Show
        iorixxx Ahmet Arslan added a comment - I get exceptions with Python 2.7.1 (r271:86832, Jun 16 2011, 16:59:05) too. Sorry I am python ignorant, may be it is a good idea to use same/compatible version that can run smokeTestRelease.py ?
        Hide
        steve_rowe Steve Rowe added a comment -

        Uwe Schindler - Would using the data from freebase ( https://developers.google.com/freebase/faq#rules_for_using_data ) be a licensing issue?

        Apache releases may contain material licensed under CC-A, which AFAICT is the same thing as CC-BY, under which Freebase licenses everything except for full images - see http://www.apache.org/legal/resolved.html#category-a - Category A includes CC-A 2.5 and 3.0.

        If thats not a concern here is a script which fetches 200 rows of film data ( http://www.freebase.com/film ) and dumps it into JSON, XML and CSV.
        The number of documents can be adjusted. You would need to put in the API KEY for it to run.
        Any opinions if this is a good idea?

        +1

        I get exceptions with

        Python 2.7.1 (r271:86832, Jun 16 2011, 16:59:05) 

        too. Sorry I am python ignorant, may be it is a good idea to use same/compatible version that can run smokeTestRelease.py ?

        +1

        Show
        steve_rowe Steve Rowe added a comment - Uwe Schindler - Would using the data from freebase ( https://developers.google.com/freebase/faq#rules_for_using_data ) be a licensing issue? Apache releases may contain material licensed under CC-A, which AFAICT is the same thing as CC-BY, under which Freebase licenses everything except for full images - see http://www.apache.org/legal/resolved.html#category-a - Category A includes CC-A 2.5 and 3.0. If thats not a concern here is a script which fetches 200 rows of film data ( http://www.freebase.com/film ) and dumps it into JSON, XML and CSV. The number of documents can be adjusted. You would need to put in the API KEY for it to run. Any opinions if this is a good idea? +1 I get exceptions with Python 2.7.1 (r271:86832, Jun 16 2011, 16:59:05) too. Sorry I am python ignorant, may be it is a good idea to use same/compatible version that can run smokeTestRelease.py ? +1
        Hide
        varunthacker Varun Thacker added a comment -

        Updated patch with the Apache License. Also I attached the outputs in all 3 formats.

        Once you put your developer key on L24 you should be able to run it without any exceptions. If you run into any exceptions post the stack trace and I will fix it.

        You can get your key from https://code.google.com/apis/console/

        I will soon start working on updating the solrconfig and schema files.

        Show
        varunthacker Varun Thacker added a comment - Updated patch with the Apache License. Also I attached the outputs in all 3 formats. Once you put your developer key on L24 you should be able to run it without any exceptions. If you run into any exceptions post the stack trace and I will fix it. You can get your key from https://code.google.com/apis/console/ I will soon start working on updating the solrconfig and schema files.
        Hide
        thetaphi Uwe Schindler added a comment -

        Hi,
        I think the license of the data is fine (CC-BY, previously known as CC-A), so we can include the files with the distribution. In any case we have to add the attribution in our NOTICE.txt file (Solr part). We should also add a license header to the files (CC header). I am not sure if JSON and CSV supports this, but XML for sure does.

        Show
        thetaphi Uwe Schindler added a comment - Hi, I think the license of the data is fine (CC-BY, previously known as CC-A), so we can include the files with the distribution. In any case we have to add the attribution in our NOTICE.txt file (Solr part). We should also add a license header to the files (CC header). I am not sure if JSON and CSV supports this, but XML for sure does.
        Hide
        thetaphi Uwe Schindler added a comment -

        Sorry I am python ignorant, may be it is a good idea to use same/compatible version that can run smokeTestRelease.py ?

        All our tools require Python 3.3. Python 2.7 is no longer used by Lucene (in most cases, I think regenerating MOMAN automaton may need 2.7?).

        Show
        thetaphi Uwe Schindler added a comment - Sorry I am python ignorant, may be it is a good idea to use same/compatible version that can run smokeTestRelease.py ? All our tools require Python 3.3. Python 2.7 is no longer used by Lucene (in most cases, I think regenerating MOMAN automaton may need 2.7?).
        Hide
        varunthacker Varun Thacker added a comment -

        Updated the script to work with Python 3x. Once you put in your API_KEY you should be able to generate data in all 3 formats.

        Show
        varunthacker Varun Thacker added a comment - Updated the script to work with Python 3x. Once you put in your API_KEY you should be able to generate data in all 3 formats.
        Hide
        iorixxx Ahmet Arslan added a comment -

        Hi Varun, With your latest python script, I generated film.xml, film.csv and film.json successfully. Here are some observations :

        • In xml, genre is single values and percentage sign separated. I think this would be multivalued field?
        • generated film.xml does not have license header. I thought it will have, no?
        • type field has value of "/film/film" for all docs. Is this expected.
        Show
        iorixxx Ahmet Arslan added a comment - Hi Varun, With your latest python script, I generated film.xml, film.csv and film.json successfully. Here are some observations : In xml, genre is single values and percentage sign separated. I think this would be multivalued field? generated film.xml does not have license header. I thought it will have, no? type field has value of "/film/film" for all docs. Is this expected.
        Hide
        varunthacker Varun Thacker added a comment -

        In xml, genre is single values and percentage sign separated. I think this would be multivalued field?

        Fixed. Thanks!

        generated film.xml does not have license header. I thought it will have, no?

        Added the license header

        type field has value of "/film/film" for all docs. Is this expected.

        Yes all docs will have type = "/film/film" as thats the category type of freebase where we are fetching the data from.

        Show
        varunthacker Varun Thacker added a comment - In xml, genre is single values and percentage sign separated. I think this would be multivalued field? Fixed. Thanks! generated film.xml does not have license header. I thought it will have, no? Added the license header type field has value of "/film/film" for all docs. Is this expected. Yes all docs will have type = "/film/film" as thats the category type of freebase where we are fetching the data from.
        Hide
        thetaphi Uwe Schindler added a comment -

        Added the license header

        I think this should be a CC-BY license header, not ASF.

        Show
        thetaphi Uwe Schindler added a comment - Added the license header I think this should be a CC-BY license header, not ASF.
        Hide
        varunthacker Varun Thacker added a comment -

        The XML output adds the Creative Commons Attribution 2.5 header instead of the ASF license.

        Show
        varunthacker Varun Thacker added a comment - The XML output adds the Creative Commons Attribution 2.5 header instead of the ASF license.
        Hide
        varunthacker Varun Thacker added a comment -

        You need to put in your API Key to run the script. It runs with python3.

        I created a README which helps get started with loading the data in and start searching.

        The License for the data is present in the LICENSE.txt file. I have not attached the generated output in any format in this patch.

        Couple of points to note when I was creating the Readme -

        1. I am assuming that our new default will be schemaless mode which means we can use managed schema to index the documents.
        2. Can we change the /select handler to default to json with indent on?

        Having an example with nested documents in a separate example is a better approach I feel. We should not complicate the experience for new users who don't care for such data

        Show
        varunthacker Varun Thacker added a comment - You need to put in your API Key to run the script. It runs with python3. I created a README which helps get started with loading the data in and start searching. The License for the data is present in the LICENSE.txt file. I have not attached the generated output in any format in this patch. Couple of points to note when I was creating the Readme - 1. I am assuming that our new default will be schemaless mode which means we can use managed schema to index the documents. 2. Can we change the /select handler to default to json with indent on? Having an example with nested documents in a separate example is a better approach I feel. We should not complicate the experience for new users who don't care for such data
        Hide
        varunthacker Varun Thacker added a comment -
        • Updated the readme
        • Added a film artificially in the script to play nice with schemaless mode.
        Show
        varunthacker Varun Thacker added a comment - Updated the readme Added a film artificially in the script to play nice with schemaless mode.
        Hide
        varunthacker Varun Thacker added a comment -

        I think we could do the following -

        1. Take the film.json|xml|csv files and replace it with all the data in the exampledocs folder
        2. Put the python script in the dev-tools folder so that in the future if we want to update the data we can use it.
        3. Drop in the LICENSE.txt file in the exampledocs folder?

        On the website I can see this place which would need to be updated -
        "Indexing Solr XML" , "Indexing JSON", "Indexing CSV (Comma/Column Separated Values)" - http://lucene.apache.org/solr/quickstart.html

        Maybe also updated the "Searching" section on the quickstart page also? We could use the material attached on the README.txt uploaded here.

        Oh, we will have to update the schema in "sample_techproducts_configs" configset and the browse handler in solrconfig with the new data too

        Show
        varunthacker Varun Thacker added a comment - I think we could do the following - 1. Take the film.json|xml|csv files and replace it with all the data in the exampledocs folder 2. Put the python script in the dev-tools folder so that in the future if we want to update the data we can use it. 3. Drop in the LICENSE.txt file in the exampledocs folder? On the website I can see this place which would need to be updated - "Indexing Solr XML" , "Indexing JSON", "Indexing CSV (Comma/Column Separated Values)" - http://lucene.apache.org/solr/quickstart.html Maybe also updated the "Searching" section on the quickstart page also? We could use the material attached on the README.txt uploaded here. Oh, we will have to update the schema in "sample_techproducts_configs" configset and the browse handler in solrconfig with the new data too
        Hide
        varunthacker Varun Thacker added a comment -

        Patch does a few things

        1. Removed all current exampledocs file
        2. added film.xml film.json film.csv and the license file
        3. added the exampledocs_generator.py to dev-tools folder
        4. modified the schema.xml appropriately

        Now we need to decide whether to rename the techproducts configset to film?

        Show
        varunthacker Varun Thacker added a comment - Patch does a few things 1. Removed all current exampledocs file 2. added film.xml film.json film.csv and the license file 3. added the exampledocs_generator.py to dev-tools folder 4. modified the schema.xml appropriately Now we need to decide whether to rename the techproducts configset to film?
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1647918 from Erik Hatcher in branch 'dev/trunk'
        [ https://svn.apache.org/r1647918 ]

        SOLR-6127: Improve example docs, using films data

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1647918 from Erik Hatcher in branch 'dev/trunk' [ https://svn.apache.org/r1647918 ] SOLR-6127 : Improve example docs, using films data
        Hide
        ehatcher Erik Hatcher added a comment -

        made first commit of this, to trunk. made some adjustments like renaming the generated files to plural (films, instead of film). this works well with the steps from the included README.txt.

        porting to 5x is a consideration, but for now we'll proceed with this on trunk and work on migrating to films instead of techproducts.

        Show
        ehatcher Erik Hatcher added a comment - made first commit of this, to trunk. made some adjustments like renaming the generated files to plural (films, instead of film). this works well with the steps from the included README.txt. porting to 5x is a consideration, but for now we'll proceed with this on trunk and work on migrating to films instead of techproducts.
        Hide
        ehatcher Erik Hatcher added a comment -

        Chris Hostetter (Unused) - thoughts on this for 5x? I can see the rationale for keeping techproducts in 5x (there are examples like spatial that aren't implemented anywhere else), but any objections to this data at least being added to 5x? And if it is added to 5x what needs to be done to dot the i's with the Ref Guide or anything else?

        Show
        ehatcher Erik Hatcher added a comment - Chris Hostetter (Unused) - thoughts on this for 5x? I can see the rationale for keeping techproducts in 5x (there are examples like spatial that aren't implemented anywhere else), but any objections to this data at least being added to 5x? And if it is added to 5x what needs to be done to dot the i's with the Ref Guide or anything else?
        Hide
        ehatcher Erik Hatcher added a comment -

        One issue with adding films.xml is that now posting *.xml picks this up, but the expectation from techproduct tutororials/examples would be that the only XML files there are techproducts. It caught me off guard when I'm used to seeing 32 documents and now I had a thousand something.

        Hmmm, maybe the films data should go into a separate sub-directory?

        Show
        ehatcher Erik Hatcher added a comment - One issue with adding films.xml is that now posting *.xml picks this up, but the expectation from techproduct tutororials/examples would be that the only XML files there are techproducts. It caught me off guard when I'm used to seeing 32 documents and now I had a thousand something. Hmmm, maybe the films data should go into a separate sub-directory?
        Hide
        varunthacker Varun Thacker added a comment -

        +1. Let's put it in a sub-directory if we are not removing/improving techproducts for 5.x

        Show
        varunthacker Varun Thacker added a comment - +1. Let's put it in a sub-directory if we are not removing/improving techproducts for 5.x
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1648540 from Erik Hatcher in branch 'dev/trunk'
        [ https://svn.apache.org/r1648540 ]

        SOLR-6127: move films example (data) to its own subdirectory

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1648540 from Erik Hatcher in branch 'dev/trunk' [ https://svn.apache.org/r1648540 ] SOLR-6127 : move films example (data) to its own subdirectory
        Hide
        ehatcher Erik Hatcher added a comment -

        I moved it up as a first-class citizen under example (so we can have its own config/view as needed later maybe).

        Show
        ehatcher Erik Hatcher added a comment - I moved it up as a first-class citizen under example (so we can have its own config/view as needed later maybe).
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1649355 from Erik Hatcher in branch 'dev/trunk'
        [ https://svn.apache.org/r1649355 ]

        SOLR-6127: fix paths in README

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1649355 from Erik Hatcher in branch 'dev/trunk' [ https://svn.apache.org/r1649355 ] SOLR-6127 : fix paths in README
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1649356 from Erik Hatcher in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1649356 ]

        SOLR-6127: add films data to 5x

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1649356 from Erik Hatcher in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1649356 ] SOLR-6127 : add films data to 5x
        Hide
        ehatcher Erik Hatcher added a comment -

        Committed to both 5x and trunk. This will eventually warrant tutorials and other documentation updated, but can close this issue now.

        Show
        ehatcher Erik Hatcher added a comment - Committed to both 5x and trunk. This will eventually warrant tutorials and other documentation updated, but can close this issue now.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1649376 from Erik Hatcher in branch 'dev/trunk'
        [ https://svn.apache.org/r1649376 ]

        SOLR-6127: Fix reference to previously renamed script

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1649376 from Erik Hatcher in branch 'dev/trunk' [ https://svn.apache.org/r1649376 ] SOLR-6127 : Fix reference to previously renamed script
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1649377 from Erik Hatcher in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1649377 ]

        SOLR-6127: Fix reference to previously renamed script (merged from trunk r1649376)

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1649377 from Erik Hatcher in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1649377 ] SOLR-6127 : Fix reference to previously renamed script (merged from trunk r1649376)
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1649523 from Erik Hatcher in branch 'dev/trunk'
        [ https://svn.apache.org/r1649523 ]

        SOLR-6127: README improvements

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1649523 from Erik Hatcher in branch 'dev/trunk' [ https://svn.apache.org/r1649523 ] SOLR-6127 : README improvements
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1649525 from Erik Hatcher in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1649525 ]

        SOLR-6127: README improvements (merged from trunk r1649523)

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1649525 from Erik Hatcher in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1649525 ] SOLR-6127 : README improvements (merged from trunk r1649523)
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1650688 from Erik Hatcher in branch 'dev/trunk'
        [ https://svn.apache.org/r1650688 ]

        SOLR-6127: More improvements to the films example: remove fake document, README steps polished

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1650688 from Erik Hatcher in branch 'dev/trunk' [ https://svn.apache.org/r1650688 ] SOLR-6127 : More improvements to the films example: remove fake document, README steps polished
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1650689 from Erik Hatcher in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1650689 ]

        SOLR-6127: More improvements to the films example: remove fake document, README steps polished (merged from trunk r1650688)

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1650689 from Erik Hatcher in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1650689 ] SOLR-6127 : More improvements to the films example: remove fake document, README steps polished (merged from trunk r1650688)
        Hide
        anshumg Anshum Gupta added a comment -

        Bulk close after 5.0 release.

        Show
        anshumg Anshum Gupta added a comment - Bulk close after 5.0 release.

          People

          • Assignee:
            ehatcher Erik Hatcher
            Reporter:
            varunthacker Varun Thacker
          • Votes:
            2 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development