Data Lab for Apache Spark™ - Concepts

Reviewed on September 02, 2025

Apache Spark™ cluster

An Apache Spark™ cluster is an orchestrated set of machines over which distributed/Big data calculus is processed. In the case of Scaleway Data Lab, the Apache Spark™ cluster is a Kubernetes cluster, with Apache Spark™ installed in each Pod. For more details, check out the Apache Spark™ documentation.

Data Lab

A Data Lab is a project setup that provides an Apache Spark™ cluster with an optional notebook for data processing, analysis and exploration. The Data Lab is available through a VPC. It includes the required infrastructure and tools to allow data engineers and data to leverage the power of the cluster for their pipeline or use it to explore data, create models, and gain insights.

Data Lab for Apache Spark™

A Data Lab for Apache Spark™ is a data lab that is distributed across multiple worker nodes to accelerate the processing of large datasets to save time and gain access to actionable insights faster.

Executor

An Apache Spark Executor is a process that runs on a worker node in a cluster. Its primary responsibility is to execute tasks assigned to it by the Apache Spark driver program. Executors play a crucial role in processing large-scale data sets, as they can be scaled up or down depending on the workload.

Fixture

A fixture is a set of pre-defined data used for testing purposes. In the context of Data Lab for Apache Spark™, fixtures are essential for ensuring that your code is properly tested and validated before being deployed to production.

GPU

GPUs (Graphical Processing Units) allow Apache Spark™ to accelerate computations for tasks that involve large-scale parallel processing, such as machine learning and specific data-analytics, significantly reducing the processing time for massive datasets and preparation for AI models.

JupyterLab

JupyterLab is a web-based platform that enables interactive computing, allowing you to work with notebooks, code, and data in one place. It builds upon the classic Jupyter Notebook by offering a more flexible and integrated user interface, making it easier to handle various file formats and interactive components.

Main node

The main node in an Apache Spark™ cluster is the driver node, which coordinates the execution of the Spark™ application by transforming code into tasks, scheduling them, and managing communication with the cluster. When the option is chosen, the notebook is hosted by the main node.

Notebook

A notebook for an Apache Spark™ cluster is an interactive, web-based tool that allows users to write and execute code, visualize data, and share results in a collaborative environment. It connects to an Apache Spark™ cluster to run large-scale data processing tasks directly from the notebook interface, making it easier to develop and test data workflows.

Adding a notebook to your cluster requires 1 GB of storage.

Persistent Volume (PV)

A Persistent Volume (PV) is a cluster-wide storage resource that ensures data persistence beyond the lifecycle of individual Pods. Persistent volumes abstract the underlying storage details, allowing administrators to use various storage solutions.

Apache Spark™ executors require storage space for various operations, particularly to shuffle data during wide operations such as sorting, grouping, and aggregation. Wide operations are transformations that require data from different partitions to be combined, often resulting in data movement across the cluster. During the map phase, executors write data to shuffle storage, which is then read by reducers.

A persistent volume sized properly ensures a smooth execution of your workload.

Session

A session starts when a cell is run within the notebook. This process involves establishing a connection between your notebook and the cluster. As a result, the necessary resources are allocated to support the execution of your code, ensuring seamless interaction with the cluster.

Transaction

An SQL transaction is a sequence of one or more SQL operations (such as queries, inserts, updates, or deletions) executed as a single unit of work. These transactions ensure data integrity and consistency, following the ACID properties: Atomicity, Consistency, Isolation, and Durability, meaning all operations within a transaction either complete successfully or none of them take effect. An SQL transaction can be rolled back in case of an error.

Worker nodes

Worker nodes are high-end machines built specifically for intensive computations. They feature powerful CPUs/GPUs, substantial RAM, and are designed to handle massive datasets with ease. In the context of Data Lab, worker nodes play a crucial role in processing large-scale data sets and performing distributed computations.

Endpoint

A point of connection to the cluster. The endpoint is associated with an IPv4 address and a port, and contains the information of whether the endpoint is read-write or not.

Notebook endpoint: when the option is chosen, an endpoint is provided to connect the notebook to the cluster Spark
Private endpoint: the endpoint needed to connect the cluster Spark to other resources within the same private network This ensures secure, internal-only communication—preventing exposure to the public internet while enabling your applications or teams to interact with Spark services safely.

Spark operator

The Spark Operator is a Kubernetes operator designed to simplify the deployment, scaling, and management of Apache Spark applications on Kubernetes. It automates tasks like submitting Spark jobs, creating and scaling Spark clusters, and handling resource allocation—all using Kubernetes-native APIs. This makes it easier to run Spark workloads in containerized environments while ensuring high availability and efficient resource usage.

Still need help?

Create a support ticket