Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Won't Fix
    • Affects Version/s: 1.3
    • Fix Version/s: None
    • Component/s: Schema and Analysis
    • Labels:
      None

      Description

      An Analyzer that produce a TokenStream based on XML input that contains a marshalled TokenStream. Also contains static TokenStream XML marshaller.

      I kind of pulled this out of my pocket without testing it in a real environment in order to get some comments on the solution before I add it to my project. So cosider it a beta-patch.

      It use JSR173 XMLStream API available in Java 1.6, compatible with Java 1.5 and downloadable from https://sjsxp.dev.java.net/

      XSD:

      <?xml version="1.0" encoding="UTF-8"?>
      <xs:schema attributeFormDefault="unqualified" elementFormDefault="qualified"
                 xmlns:xs="http://www.w3.org/2001/XMLSchema">
          <xs:element name="tokens" type="tokensType"/>
          <xs:complexType name="tokensType">
              <xs:sequence>
                  <xs:element type="tokenType" name="token"/>
              </xs:sequence>
          </xs:complexType>
          <xs:complexType name="tokenType">
              <xs:sequence>
                  <xs:element type="xs:int" name="positionIncrement" maxOccurs="1"/>
                  <xs:element type="xs:string" name="term" minOccurs="1" maxOccurs="1"/>
                  <xs:element type="xs:string" name="type" maxOccurs="1"/>
                  <xs:element type="xs:int" name="startOffset" maxOccurs="1"/>
                  <xs:element type="xs:int" name="endOffset" maxOccurs="1"/>
                  <xs:element type="xs:int" name="flags" maxOccurs="1"/>
                  <xs:element type="payloadType" name="payload" maxOccurs="1"/>
              </xs:sequence>
          </xs:complexType>
          <xs:complexType name="payloadType">
              <xs:choice maxOccurs="1" minOccurs="1">
                  <xs:element type="bytesType" name="bytes"/>
                  <xs:element type="xs:string" name="hex"/>
                  <xs:element type="xs:string" name="base64"/>
              </xs:choice>
          </xs:complexType>
          <xs:complexType name="bytesType">
              <xs:sequence>
                  <xs:element type="xs:byte" name="byte" maxOccurs="unbounded" minOccurs="1"/>
              </xs:sequence>
          </xs:complexType>
      </xs:schema>
      

      Even though I've added a couple of variants to how to handle a Payload in the XSD only <hex> is supported.

      Example XML:

      <tokens>
        <token>
          <positionIncrement>1</positionIncrement>
          <term>term</term>
          <type>type</type>
          <startOffset>0</startOffset>
          <endOffset>3</endOffset>
          <flags>65535</flags>
          <payload><hex>fffefd</hex></payload>
        </token>
      </tokens>
      
      1. SOLR-1020.txt
        17 kB
        Karl Wettin

        Activity

        Erick Erickson made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Hide
        Erick Erickson added a comment -

        SPRING_CLEANING_2013 we can reopen if necessary.

        Show
        Erick Erickson added a comment - SPRING_CLEANING_2013 we can reopen if necessary.
        Erick Erickson made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Won't Fix [ 2 ]
        Hide
        Erick Erickson added a comment -

        SPRING_CLEANING_2013 we can reopen if necessary.

        Show
        Erick Erickson added a comment - SPRING_CLEANING_2013 we can reopen if necessary.
        Hide
        Karl Wettin added a comment - - edited

        Karl, would it make sense to use the NamedList format instead of a custom XML one? That way, you can use most of the existing parsing code.

        I don't know, would it?

        Thoughts?

        The reason I choose JSR173 is that it allows for unmarshalling one token at the time rather than all at once. I.e. I want to reuse the token instance in the TokenStream the Analyzer produce rather than unmarshall all of the data at once. My first thought was to parse the XML using a lexer but some simple tests showed that the overhead of JSR173 was very small compared to jflex. I am however considering jflex for the binary format.

        I came up with this patch because I have a rather elaborate tokenization scheme using ShingleMatrixFilter. The current solution of mine is to pass a base64 encoded serialized object as field value and use a custom Analyzer that assemble and tokenize the entity object passed down in the field value. However the tokenization is rather expensive (especially during initial bulk import of my zillions of documents) so I'd rather do this on my clients as I've got plenty of those but only one Solr.

        Show
        Karl Wettin added a comment - - edited Karl, would it make sense to use the NamedList format instead of a custom XML one? That way, you can use most of the existing parsing code. I don't know, would it? Thoughts? The reason I choose JSR173 is that it allows for unmarshalling one token at the time rather than all at once. I.e. I want to reuse the token instance in the TokenStream the Analyzer produce rather than unmarshall all of the data at once. My first thought was to parse the XML using a lexer but some simple tests showed that the overhead of JSR173 was very small compared to jflex. I am however considering jflex for the binary format. I came up with this patch because I have a rather elaborate tokenization scheme using ShingleMatrixFilter. The current solution of mine is to pass a base64 encoded serialized object as field value and use a custom Analyzer that assemble and tokenize the entity object passed down in the field value. However the tokenization is rather expensive (especially during initial bulk import of my zillions of documents) so I'd rather do this on my clients as I've got plenty of those but only one Solr.
        Hide
        Shalin Shekhar Mangar added a comment -

        Karl, would it make sense to use the NamedList format instead of a custom XML one? That way, you can use most of the existing parsing code. Thoughts?

        Show
        Shalin Shekhar Mangar added a comment - Karl, would it make sense to use the NamedList format instead of a custom XML one? That way, you can use most of the existing parsing code. Thoughts?
        Hide
        Karl Wettin added a comment -

        Missed out on telling you that I'm also looking at a binary solution for Solrj..

        Show
        Karl Wettin added a comment - Missed out on telling you that I'm also looking at a binary solution for Solrj..
        Karl Wettin made changes -
        Field Original Value New Value
        Attachment SOLR-1020.txt [ 12400234 ]
        Karl Wettin created issue -

          People

          • Assignee:
            Unassigned
            Reporter:
            Karl Wettin
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development