Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-9062

Change output type of Tokenizer to Array(String, true)

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • None
    • 1.5.0
    • ML
    • None

    Description

      Currently output type of Tokenizer is Array(String, false), which is not compatible with Word2Vec and Other transformers since their input type is Array(String, true). Seq[String] in udf will be treated as Array(String, true) by default.

      I'm also thinking for Nullable columns, maybe tokenizer should return Array(null) for null value in the input.

      Attachments

        Issue Links

          Activity

            People

              yuhaoyan yuhao yang
              yuhaoyan yuhao yang
              Joseph K. Bradley Joseph K. Bradley
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: