Utilizing accelerators in Apache Spark presents opportunities for significant speedup of ETL, ML and DL applications. In this deep dive, we give an overview of accelerator aware task scheduling, columnar data processing support, fractional scheduling, and stage level resource scheduling and configuration. Furthermore, we dive into the Apache Spark 3.x RAPIDS plugin, which enables applications to take advantage of GPU acceleration with no code change. An explanation of how the Catalyst optimizer physical plan is modified for GPU aware scheduling is reviewed. The talk touches upon how the plugin can take advantage of the RAPIDS specific libraries, cudf, cuio and rmm to run tasks on the GPU. Optimizations were also made to the shuffle plugin to take advantage of the GPU using UCX, a unified communication framework that addresses GPU memory intra and inter node. A roadmap for further optimizations taking advantage of RDMA and GPU Direct Storage are mentioned. Industry standard benchmarks and runs on production datasets will be shared.
About: Databricks provides a unified data analytics platform, powered by Apache Spark™, that accelerates innovation by unifying data science, engineering and business. Read more here: https://databricks.com/product/unifie…
Connect with us: Website:…
===
Original video: https://www.youtube.com/watch?v=4MI_LYah900&feature=emb_title
Downloaded by http://huffduff-video.snarfed.org/ on Sun Aug 2 19:32:43 2020
Available for 30 days after download