Data Lab for Apache Spark™ FAQ

Reviewed on September 23, 2025

Apache Spark™ is an open-source unified analytics engine designed for large-scale data processing. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. Spark™ offers high-level APIs in Java, Scala, Python, and R, and an optimized engine that supports general execution graphs.

How does Apache Spark™ work?

Apache Spark™ processes data in memory, which allows it to perform tasks up to 100 times faster than traditional disk-based processing frameworks like Hadoop MapReduce. It uses Resilient Distributed Datasets (RDDs) to store data across multiple nodes in a cluster and perform parallel operations on this data.

What workloads is Data Lab for Apache Spark™ suited for?

Data Lab for Apache Spark™ supports a range of workloads, including:

Big data processing (large-scale data transformation, cleaning, and analysis)
Machine learning (training models, predictive analytics, recommendation systems)
Real-time analytics (streaming data, live dashboards, fraud detection)
Data integration (ETL pipelines, combining data from multiple sources)
Interactive querying (SQL-based exploration of large datasets)

It offers scalable CPU and GPU Instances with flexible node limits and robust Apache Spark™ library support.

Offering and availability

What notebook is available with the Data Lab?

The service provides an optional JupyterLab notebook running on a shared CPU Instance, when chosen it is fully integrated with the Apache Spark™ cluster for seamless data processing and calculations.

Pricing and billing

How am I billed for Data Lab for Apache Spark™?

Data Lab for Apache Spark™ is billed based on the following factors:

The main node configuration selected.
The worker node configuration selected, and the number of worker nodes in the cluster.
The persistent volume size provisioned.
The presence of a notebook.

Compatibility and integration

Can I run a Data Lab for Apache Spark™ using GPUs?

Yes, you can run your cluster on either CPUs or GPUs. Scaleway leverages Nvidia's RAPIDS Accelerator For Apache Spark, an open-source suite of software libraries and APIs to execute end-to-end data science and analytics pipelines entirely on GPUs. This technology allows for significant acceleration of data processing tasks compared to CPU-based processing.

Still need help?

Create a support ticket

Data Lab for Apache Spark™ FAQ

Overview

What is Apache Spark™?