This article is contributed. See the original author and article here.
When you are using Pandas in a notebook you may notice there was a change in the Arrow IPC format from 0.15.1 onwards.
You can find more information here: https://arrow.apache.org/blog/2019/10/06/0.15.0-release/
So this a small post how to work around this:
The customer was using the example available here:
Specifically this one:
import pandas as pd from pyspark.sql.functions import col, pandas_udf from pyspark.sql.types import LongType # Declare the function and create the UDF def multiply_func(a, b): return a * b multiply = pandas_udf(multiply_func, returnType=LongType()) # The function for a pandas_udf should be able to execute with local Pandas data x = pd.Series([1, 2, 3]) print(multiply_func(x, x)) # Create a Spark DataFrame, 'spark' is an existing SparkSession df = spark.createDataFrame(pd.DataFrame(x, columns=["x"])) # Execute function as a Spark vectorized UDF df.select(multiply(col("x"), col("x"))).show()
When tried to run in synapse it failed with the error:
Py4JJavaError : An error occurred while calling o205.showString. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 2.0 failed 4 times, most recent failure: Lost task 1.3 in stage 2.0 (TID 15, c671bd6ddc35b7487900238907316, executor 1): java.lang.IllegalArgumentException at java.nio.ByteBuffer.allocate(ByteBuffer.java:334)
Synapse Spark uses the library documented here:
The workaround to use the the legacy Panda version is:
1) Add this to your code:
import os os.environ ['ARROW_PRE_0_15_IPC_FORMAT=']='1'
and for that example, specific replace show per display like:
#Instead of: df.select(multiply(col("x"), col("x"))).show() #use display(df.select(multiply(col("x"), col("x"))))
You can also replace the whl files:
That is it!
Brought to you by Dr. Ware, Microsoft Office 365 Silver Partner, Charleston SC.