The Koalas project implements the pandas DataFrame API on top of Apache Spark, making data scientists more productive when dealing with huge data. Spark is the de facto standard for large data processing, while pandas is the de facto standard (single-node) DataFrame implementation in Python. If you’re already familiar with pandas, you can use Spark right away with no learning curve. Using Koalas gives the users the option of having a single codebase that one can use with pandas to test smaller datasets and Spark for larger and distributed datasets. The Koalas open-source project has progressed significantly. The pandas API coverage in Koalas was roughly 10%–20% at the time of launch. Thanks to community contributions over several frequent releases, the pandas API coverage rapidly increased and is now close to 80% in Koalas 1.0. Better pandas API coverage, spark accessor, better type hint support, broader plotting support, more comprehensive support of in-place update, better support of missing values, NaN, and NA are some of the latest advancements of Koalas 1.0.

Access the repository here:

(Visited 5 times, 1 visits today)