Solr
  1. Solr
  2. SOLR-1690

JSONKeyValueTokenizerFactory -- JSON Tokenizer

    Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Schema and Analysis
    • Labels:
      None

      Description

      Sometimes it is nice to group structured data into a single field.

      This (rough) patch, takes JSON input and indexes tokens based on the key values pairs in the json.

      schema.xml
      <!-- JSON Field Type -->
          <fieldtype name="json" class="solr.TextField" positionIncrementGap="100" omitNorms="true">
            <analyzer type="index">
              <tokenizer class="solr.JSONKeyValueTokenizerFactory" keepArray="true" hierarchicalKey="false"/>
              <filter class="solr.TrimFilterFactory"/>
              <filter class="solr.LowerCaseFilterFactory"/>
            </analyzer>
            <analyzer type="query">
              <tokenizer class="solr.KeywordTokenizerFactory"/>
              <filter class="solr.TrimFilterFactory" />
              <filter class="solr.LowerCaseFilterFactory"/>
            </analyzer>
          </fieldtype>
      

      Given text:

       { "hello": "world", "rank":5 }
      

      indexed as two tokens:

      term position 1 2
      term text hello:world rank:5
      term type word word
      source start,end 12,17 27,28
      1. noggit-1.0-A1.jar
        21 kB
        Ryan McKinley
      2. SOLR-1690-JSONKeyValueTokenizerFactory.patch
        7 kB
        Ryan McKinley

        Activity

        Hide
        Ryan McKinley added a comment -

        Here is a simple JSON tokenizer. No tests, but a good place to start if you are looking to do something similar.

        Show
        Ryan McKinley added a comment - Here is a simple JSON tokenizer. No tests, but a good place to start if you are looking to do something similar.
        Hide
        Ryan McKinley added a comment -
        Show
        Ryan McKinley added a comment - This tokenizer uses noggit http://svn.apache.org/repos/asf/labs/noggit/
        Hide
        Hoss Man added a comment -

        FWIW: I'm finding it hard to imagine use cases that this would be useful for ... so as a result i have no feedback/suggestions on the patch/usage.

        Show
        Hoss Man added a comment - FWIW: I'm finding it hard to imagine use cases that this would be useful for ... so as a result i have no feedback/suggestions on the patch/usage.
        Hide
        Ryan McKinley added a comment -

        I have been using it to have structured data stored in a single field. Kind of like a less cryptic version of:
        http://wiki.apache.org/solr/UserTagDesign

        I'm not sure it belongs in /trunk, but wanted to post it here so that others could use it if they want...

        Show
        Ryan McKinley added a comment - I have been using it to have structured data stored in a single field. Kind of like a less cryptic version of: http://wiki.apache.org/solr/UserTagDesign I'm not sure it belongs in /trunk, but wanted to post it here so that others could use it if they want...
        Hide
        Hayder Marzouk added a comment -

        Hi Ryan,
        Great solution. It's what i am looking for.
        Can u attach the full code please.

        Show
        Hayder Marzouk added a comment - Hi Ryan, Great solution. It's what i am looking for. Can u attach the full code please.
        Hide
        Prashant Saraswat added a comment -

        @Ryan Mckinley: Many thanks for attaching the patch here. It is most useful.

        @Hoss Man: Consider this usecase.Take your favorite ecommerce site ( say newegg.com, ebay.com etc ). Notice that they have some kind of category hierarchy. Each category has category attributes ( say Brand ) with category sensitive possible values(Apple/Samsung for cell phone and Sharp/Samsung for HDTVs) (. In these cases the number of categories specific attributes are in 10's of thousand. It is not realistically possible to create such a schema so that every category specific attribute is mapped to a solr field. However, you can store the category specific attributes per category as a json string.

        Now, you do need to filter by category specific attributes. Say you are searching for HDTVs and you only want to see those manufactured by Samsung. As is, solr will not allow you to search in a field which looks like this:

        {"name":"Brand", "value":"Samsung"}

        something like fq=categoryattribute:"name":"brand","value":"samsung" ( properly escaped ) doesn't work

        Enter the awesome jsontokenizer written by Ryan McKinley. This allows the same field to be indexed as json and
        something like fq=categoryattribute:"name:brand" AND categoryattribute:"value:Samsung" works.

        Happy to provide more information if needed. Also happy to take the slap if I'm missing something obvious here.

        Show
        Prashant Saraswat added a comment - @Ryan Mckinley: Many thanks for attaching the patch here. It is most useful. @Hoss Man: Consider this usecase.Take your favorite ecommerce site ( say newegg.com, ebay.com etc ). Notice that they have some kind of category hierarchy. Each category has category attributes ( say Brand ) with category sensitive possible values(Apple/Samsung for cell phone and Sharp/Samsung for HDTVs) (. In these cases the number of categories specific attributes are in 10's of thousand. It is not realistically possible to create such a schema so that every category specific attribute is mapped to a solr field. However, you can store the category specific attributes per category as a json string. Now, you do need to filter by category specific attributes. Say you are searching for HDTVs and you only want to see those manufactured by Samsung. As is, solr will not allow you to search in a field which looks like this: {"name":"Brand", "value":"Samsung"} something like fq=categoryattribute:"name":"brand","value":"samsung" ( properly escaped ) doesn't work Enter the awesome jsontokenizer written by Ryan McKinley. This allows the same field to be indexed as json and something like fq=categoryattribute:"name:brand" AND categoryattribute:"value:Samsung" works. Happy to provide more information if needed. Also happy to take the slap if I'm missing something obvious here.

          People

          • Assignee:
            Unassigned
            Reporter:
            Ryan McKinley
          • Votes:
            1 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:

              Development