Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-48685

PySpark MinHashLSH when used with CountVectorizer doesn't meet requirements

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.5.1
    • 3.5.1
    • ML
    • None

    Description

      I'm facing an issue when trying to use the MinHashLSH model, where the model is complaining about having only zero values in some rows although I do apply a filter before using the model. Here is a sample code to demonstrate using pyspark:

      ```python
      @F.udf(returnType=types.BooleanType())
      def is_non_zero_vector(vector: SparseVector) -> bool:
      """
      Returns True if the vector has at least one non zero element
      """
      return vector.numNonzeros() > 0
       
      df_text = df.select("id", "text")

      tokenizer=Tokenizer(inputCol="text", outputCol="text_tokens")
      df_text=tokenizer.transform(df_text).select("id", "text_tokens")

      ngram=NGram(inputCol="text_tokens", outputCol="text_ngrams", n=self.min_hash_lsh_ngram_size)
      df_text=ngram.transform(df_text).select("id", "text_ngrams")

      count_vectorizer=CountVectorizer(inputCol="text_ngrams", outputCol="text_count_vector").fit(df_text)
      df_text=count_vectorizer.transform(df_text).select("id", "text_count_vector")

      1. Keep only the non zero vectors
        df_text=df_text.filter(is_non_zero_vector(F.col("text_count_vector")))

      min_hash_lsh=MinHashLSH(
      inputCol="text_count_vector",
      outputCol="text_hashes",
      seed=self.min_hash_lsh_num_hash_tables,
      numHashTables=self.min_hash_lsh_num_hash_tables,
      ).fit(df_text)
      df_text=min_hash_lsh.transform(df_text)

      1. Calculate the distance between all pairs of vectors and keep only the pairs with a distance > 0 (that are duplicates)
        pairs_df=min_hash_lsh.approxSimilarityJoin(df_text, df_text, 0.6, distCol="vector_distance")
        pairs_df=pairs_df.filter("vector_distance != 0")

      ```

      I've also analyzed the dataframe and there is in fact no rows without at least 1 non-zero index.

      Attachments

        Activity

          People

            Unassigned Unassigned
            etiennecl Etienne Soulard-Geoffrion
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: