Uploaded image for project: 'OpenNLP'
  1. OpenNLP
  2. OPENNLP-1357

Use CharSequence to allow for memory management

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 1.9.4
    • 2.1.1
    • Sentence Detector
    • None

    Description

      Most of the classes in OpenNLP require the inputs to be as String, StringBuffer, or char[]. This means that you have to load all the data into memory.

      Many of these cases (String and StringBuffer args) could be replaced with a single method that accepts CharSequence as a parameter.

      For example DefaultEndOfSentenceScanner

       

       public List<Integer> getPositions(CharSequence s) {
          List<Integer> l = new ArrayList<>();
          for (int i = 0; i < s.length(); i++) {
            char c = s.charAt(i);
            if (eosCharacters.contains(c)) {
              l.add(i);
            }
          }
          return l;
        }
      

      This would allow for users to manage the memory overhead for large data sets. And in some cases require less temporary memory conversion to char buffers.

      Some code such as the SDContextGenerator already uses CharSequence.  However in SentenceDetectorME there is an unnecessary conversion to a StringBuffer. The sb isn't modified and the SDContextGenerator.getContext takes CharSequence as an arg and String is a CharSequence.

       

      public Span[] sentPosDetect(String s) {
          sentProbs.clear();
          StringBuffer sb = new StringBuffer(s);

       

      I can create a pull request(s) for the above if you think it is useful.

       

      Attachments

        Activity

          People

            mawiesne Martin Wiesner
            paul.austin@automutatio.com Paul Austin
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: