Uploaded image for project: 'Jackrabbit Oak'
  1. Jackrabbit Oak
  2. OAK-8421

Add oak-run option to dump extracted text for all binaries

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.14.0
    • None
    • indexing, oak-run
    • None

    Description

      If you use oak-run to dump the extracted text from binary properties, during the "generate" step inlined binaries are skipped and not placed into the output CSV file.  Then during either the "extract" or "populate" steps which use this CSV the extracted text from those binaries will not be included in the dump.

      It would be nice to include an option to the "generate" step to tell oak-run to also include inlined binaries in the CSV.  Then, for this to work, the "extract" step would also need the node store parameter so it could get the text from the node store if the binary is inlined.

      I'm not sure about the "populate" step, it might need this too.  It tries to get the text directly from the index, so it would depend if inlined binaries also store their extracted text in the index.  I would assume they do, so maybe the "populate" step wouldn't need to be modified.

      The oak-run documentation would also need to be updated; specifically this page:  https://jackrabbit.apache.org/oak/docs/query/pre-extract-text.html

      Attachments

        Activity

          People

            Unassigned Unassigned
            mattvryan Matt Ryan
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: