RAPIDS Accelerator for Apache Spark ML Library Integration

There are cases where you may want to get access to the raw data on the GPU, preferably without copying it. One use case for this is exporting the data to an ML framework after doing feature extraction. To do this we provide a simple Scala utility com.nvidia.spark.rapids.ColumnarRdd that can be used to convert a DataFrame to an RDD[ai.rapids.cudf.Table]. Each Table will have the same schema as the DataFrame passed in.

Table is not a typical thing in an RDD so special care needs to be taken when working with it. By default, it is not serializable so repartitioning the RDD or any other operator that involves a shuffle will not work. This is because it is relatively expensive to serialize and deserialize GPU data using a conventional Spark shuffle. In addition, most of the memory associated with the Table is on the GPU itself. So, each Table must be closed when it is no longer needed to avoid running out of GPU memory. By convention, it is the responsibility of the one consuming the data to close it when they no longer need it.

val df = spark.sql("""select my_column from my_table""")
val rdd: RDD[Table] = ColumnarRdd(df)
// Compute the max of the first column
val maxValue = rdd.map(table => {
  val max = table.getColumn(0).max().getLong
  // Close the table to avoid leaks
  table.close()
  max
}).max()

RMM

You may need to disable RMM caching when exporting data to an ML library as that library will likely want to use all of the GPU’s memory and if it is not aware of RMM it will not have access to any of the memory that RMM is holding.