Uploaded image for project: 'Apache Jena'
  1. Apache Jena
  2. JENA-1505

add function apf:strIndexSplit

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • ARQ

    Description

      We use Tarql to convert some company CSV data to RDF.
      We had cases of multiple values in a field (eg aliases) that we handle with apf:strSplit.

      But now we've hit another case: several multi-value fields arranged in parallel arrays.
      Each CSV row is a Joint Venture (?jvId, ?jvName) and there are 3 newline-separated parallel arrays that describe the participant companies: ?coIds, ?coNames, ?coIndustries.
      If we use several apf:strSplit in one query, that will cause a Cartesian product, and mix up all company ids, names, industries together.

      Tarql allows multiple CONSTRUCT queries in one script, and "the triples generated by previous CONSTRUCT clauses can be queries in subsequent WHERE clauses to retrieve additional data". So my idea is to split each column in a separate CONSTRUCT, attach the values to temporary nodes, and reassemble them in a final CONSTRUCT.

      But we can't do this with apf:strSplit, since it loses the index (ordering) of the individual values.
      We need a new Jena ARQ function, eg with a signature like this where ? indicates unbound and $indicates bound:

      (?index ?value) apf:strIndexSplit ($string $separator)
      Splits $string on regex $separator and produces a number of binding pairs
      where ?index is bound to a sequential number (starting from 1)
      and ?value is bound to the consecutive string part that is split off.
      

      Then we could hack the problem with something like this:

      construct { # get first multiValue field
       ?ROW tmp:coIds [tmp:index ?INDEX; tmp:value ?VALUE]
      } where {
       bind(uri("urn:tmp:",?ROWNUM) as ?ROW)
       (?INDEX ?VALUE) apf:strIndexSplit (?coIds, "\\n")
      }
      
      construct { # get second multiValue field
       ?ROW tmp:coNames [tmp:index ?INDEX; tmp:value ?VALUE]
      } where {
       bind(uri("urn:tmp:",?ROWNUM) as ?ROW)
       (?INDEX ?VALUE) apf:strIndexSplit (?coNames, "\\n")
      }
      
      construct { # get third multiValue field
       ?ROW tmp:coIndustries [tmp:index ?INDEX; tmp:value ?VALUE]
      } where {
       bind(uri("urn:tmp:",?ROWNUM) as ?ROW)
       (?INDEX ?VALUE) apf:strIndexSplit (?coIndustries, "\\n")
      }
      
      construct { # make JV node
       ?JV ex:id ?jvId; ex:name ?jvName.
      } where {
       bind(uri(concat("jv/",?jvId) as ?JV))
      }
      
      construct { # make Company node and relation
       ?CO ex:id ?coId; ex:name ?coName; ex:industry ?INDUSTRY.
       ?JV ex:hasParticipant ?CO
      } where {
       bind(uri(concat("jv/",?jvId) as ?JV))
       bind(uri(concat("urn:tmp:",?ROWNUM) as ?ROW))
                 ?ROW tmp:coIds        [tmp:index ?INDEX; tmp:value ?coId]
       optional {?ROW tmp:coNames      [tmp:index ?INDEX; tmp:value ?coName]}
       optional {?ROW tmp:coIndustries [tmp:index ?INDEX; tmp:value ?coIndustry]}
       bind(uri(concat("company/",?coId) as ?CO)
       bind(uri(concat("industry/",?coIndustry) as ?INDUSTRY)
      }
      

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            vladimir.alexiev Vladimir Alexiev
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: