Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Trivial Trivial
    • Resolution: Fixed
    • Affects Version/s: ManifoldCF 1.3
    • Fix Version/s: ManifoldCF 1.4
    • Component/s: Web connector
    • Labels:
      None

      Description

      Although file and sharedDrive connectors get file name, web connector does not get file name currently. Can web connector get a file name like below? :

      RepositoryDocument rd = new RepositoryDocument();
      rd.setFileName(filename from URL);
      rd.setBinary(InputStream);

        Activity

        Hide
        Shinichiro Abe added a comment -

        Thanks Karl for reviewing.
        Committed to trunk.
        r1519533.

        Show
        Shinichiro Abe added a comment - Thanks Karl for reviewing. Committed to trunk. r1519533.
        Hide
        Karl Wright added a comment -

        Looks good to me. Please commit.

        Show
        Karl Wright added a comment - Looks good to me. Please commit.
        Hide
        Shinichiro Abe added a comment -

        Here is a patch. This supports Wget-conventions, I think.

        I've checked this code below:

        url:http://server
        filename:(nothing)
        url:http://server/
        filename:index.html
        url:http://server?id=1
        filename:(nothing)
        url:http://server/?id=1
        filename:index.html?id=1
        url:http://server/dir/test?id=1
        filename:test?id=1
        url:http://server/dir/test/?id=1
        filename:index.html?id=1
        url:http://server/dir/test?id=1&page=1
        filename:test?id=1&page=1
        url:http://server/dir/test/?id=1&page=1
        filename:index.html?id=1&page=1
        url:http://server/dir/test/aaa.txt
        filename:aaa.txt
        url:http://server/dir/test/bbb
        filename:bbb
        url:http://server/dir/test/bbb/
        filename:index.html
        url:http://server/dir/test.html?id=4
        filename:test.html?id=4
        

        Is this patch fine? Please review this.
        Thanks in advance!

        Show
        Shinichiro Abe added a comment - Here is a patch. This supports Wget-conventions, I think. I've checked this code below: url:http://server filename:(nothing) url:http://server/ filename:index.html url:http://server?id=1 filename:(nothing) url:http://server/?id=1 filename:index.html?id=1 url:http://server/dir/test?id=1 filename:test?id=1 url:http://server/dir/test/?id=1 filename:index.html?id=1 url:http://server/dir/test?id=1&page=1 filename:test?id=1&page=1 url:http://server/dir/test/?id=1&page=1 filename:index.html?id=1&page=1 url:http://server/dir/test/aaa.txt filename:aaa.txt url:http://server/dir/test/bbb filename:bbb url:http://server/dir/test/bbb/ filename:index.html url:http://server/dir/test.html?id=4 filename:test.html?id=4 Is this patch fine? Please review this. Thanks in advance!
        Hide
        Karl Wright added a comment -

        I have seen URLs like this:

        http://server/test.html?id=4

        So using the last part of the path is what makes the most sense. (test.html in this case).

        Show
        Karl Wright added a comment - I have seen URLs like this: http://server/test.html?id=4 So using the last part of the path is what makes the most sense. (test.html in this case).
        Hide
        Shinichiro Abe added a comment -

        I'm seeing HDFS repo connector, wget-convention logic applies to "uri" metadata. This is not applied to setFileName().
        HDFS repo connector:
        data.setFileName(fileStatus.getPath().getName()); <-- it sets a file name not a path, so I don't think rd.setFileName() might include a path.

        I hope setFileName() is consistent with our all connector as a file name, not including a path
        (perhaps all setFileName()-s is consistent currently).

        for example:
        http://server/test?id=1 <-- I don't want to set rd.setFileName()
        http://server/test/?id=1 <-- I don't want to set rd.setFileName()
        http://server/test/?id=1&page=1 <-- I don't want to set rd.setFileName()
        http://server/test/aaa.txt <-- I want to set rd.setFileName()
        http://server/test/bbb <-- How do I handle this?

        Do you have any better approach about that?
        Thanks.

        Show
        Shinichiro Abe added a comment - I'm seeing HDFS repo connector, wget-convention logic applies to "uri" metadata. This is not applied to setFileName(). HDFS repo connector: data.setFileName(fileStatus.getPath().getName()); <-- it sets a file name not a path, so I don't think rd.setFileName() might include a path. I hope setFileName() is consistent with our all connector as a file name, not including a path (perhaps all setFileName()-s is consistent currently). for example: http://server/test?id=1 <-- I don't want to set rd.setFileName() http://server/test/?id=1 <-- I don't want to set rd.setFileName() http://server/test/?id=1&page=1 <-- I don't want to set rd.setFileName() http://server/test/aaa.txt <-- I want to set rd.setFileName() http://server/test/bbb <-- How do I handle this? Do you have any better approach about that? Thanks.
        Hide
        Karl Wright added a comment -

        First, if you implement this in the web connector, it should not only use files with the specified extensions. But regardless, I don't think there is any requirement that the rd.setFileName() not include a path. Since Wget-conventions are used in other connectors (notably the file system connector and HDFS connector), I think if you were going to convert a URL to a filename, you should use that code.

        Show
        Karl Wright added a comment - First, if you implement this in the web connector, it should not only use files with the specified extensions. But regardless, I don't think there is any requirement that the rd.setFileName() not include a path. Since Wget-conventions are used in other connectors (notably the file system connector and HDFS connector), I think if you were going to convert a URL to a filename, you should use that code.
        Hide
        Shinichiro Abe added a comment -

        Sorry. rd.setMimeType() is already called by WebConnector.

        I implemented extracting file name in Solr side below.
        I decoded URLs, and I considered only the last directory which has a extension as file.
        Is it useful?

          doc = cmd.solrDoc;
          id = doc.getFieldValue("id");
        
          var paths = new Array();
          var file = "";
          var decodedId = decodeURI(id);
          paths = decodedId.split('/');
          file = paths[paths.length - 1];
        
          var extensionArray = ['xml','json','csv','pdf','doc','docx','ppt','pptx','xls','xlsx','odt','odp','ods','ott','otp','ots','rtf','htm','html','txt','log'];
          
          var isInsert = new Boolean(false);
          for(i = 0; i < extensionArray.length; i++) {
            if (file.indexOf("." + extensionArray[i]) != -1) {
              isInsert = true;
              break;
            }
          }
          
          if (isInsert == true){
            doc.setField("filename_s", file);
          }
        
        Show
        Shinichiro Abe added a comment - Sorry. rd.setMimeType() is already called by WebConnector. I implemented extracting file name in Solr side below. I decoded URLs, and I considered only the last directory which has a extension as file. Is it useful? doc = cmd.solrDoc; id = doc.getFieldValue("id"); var paths = new Array(); var file = ""; var decodedId = decodeURI(id); paths = decodedId.split('/'); file = paths[paths.length - 1]; var extensionArray = ['xml','json','csv','pdf','doc','docx','ppt','pptx','xls','xlsx','odt','odp','ods','ott','otp','ots','rtf','htm','html','txt','log']; var isInsert = new Boolean(false); for(i = 0; i < extensionArray.length; i++) { if (file.indexOf("." + extensionArray[i]) != -1) { isInsert = true; break; } } if (isInsert == true){ doc.setField("filename_s", file); }
        Hide
        Karl Wright added a comment -

        I believe rd.setMimeType() is already called by WebConnector. So is rd.setURL(). The question is whether we should extract a filename even though there isn't one in an http request.

        One way to create a file name would be to make a wget-style file name. That would probably be most consistent with our other connectors.

        Show
        Karl Wright added a comment - I believe rd.setMimeType() is already called by WebConnector. So is rd.setURL(). The question is whether we should extract a filename even though there isn't one in an http request. One way to create a file name would be to make a wget-style file name. That would probably be most consistent with our other connectors.
        Hide
        Shinichiro Abe added a comment -

        I think we can get the file name from the last directory of URLs, this is not always important.
        In intra web sites of enterprise, there are attached and liked binary files(pdf, xls, doc etc).
        After crawling these files and posting to Solr, I think a user may search documents
        by file name , not by title's term and content's term.
        If we can get not only file name but also mime type, in Solr side we can do filter seaching and faceting.
        This is why I'd like to get file name(and mime type).

        Show
        Shinichiro Abe added a comment - I think we can get the file name from the last directory of URLs, this is not always important. In intra web sites of enterprise, there are attached and liked binary files(pdf, xls, doc etc). After crawling these files and posting to Solr, I think a user may search documents by file name , not by title's term and content's term. If we can get not only file name but also mime type, in Solr side we can do filter seaching and faceting. This is why I'd like to get file name(and mime type).
        Hide
        Karl Wright added a comment -

        How do you propose to get the file name from the url?
        Also, where is this important? What output connector are you using?

        In general, I think creating metadata such as the file name is OK even when the repository doesn't have one, but I'd like to understand how it is meant to be used first.

        Show
        Karl Wright added a comment - How do you propose to get the file name from the url? Also, where is this important? What output connector are you using? In general, I think creating metadata such as the file name is OK even when the repository doesn't have one, but I'd like to understand how it is meant to be used first.

          People

          • Assignee:
            Shinichiro Abe
            Reporter:
            Shinichiro Abe
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development