Data Lab for Apache Spark™ - Quickstart

Reviewed on February 24, 2025

Console overview

Follow this guided tour to discover how to navigate the console.

Data Lab for Apache Spark™ is a product designed to assist data scientists and data engineers in performing calculations on a remotely managed Apache Spark infrastructure.

It is composed of the following:

Cluster: An Apache Spark cluster powered by a Kubernetes architecture.
Notebook: A JupyterLab service operating on a dedicated node type.

Scaleway provides dedicated node types for both the notebook and the cluster. The cluster nodes are high-end machines built for intensive computations, featuring powerful CPUs/GPUs, and substantial RAM.

The notebook, although capable of performing some local computations, primarily serves as a web interface for interacting with the Apache Spark cluster.

Before you start

To complete the actions presented below, you must have:

A Scaleway account logged into the console
Owner status or IAM permissions allowing you to perform actions in the intended Organization
Optionally, an Object Storage bucket

How to create a Data Lab for Apache Spark™ cluster

Click Data Lab under Data & Analytics on the side menu.
Click Create Data Lab cluster. The creation wizard displays.
Complete the following steps in the wizard:
- Choose an Apache Spark version from the drop-down menu.
- Select a worker node configuration. For this procedure, we recommend selecting a CPU rather than a GPU.
- Enter the desired number of worker nodes.
  Note
  Provisioning zero worker nodes lets you retain and access you cluster and notebook configurations, but will not allow you to run calculations.
- Enter a name for your Data Lab.
- Optionally, add a description and/or tags for your Data Lab.
- Verify the estimated cost.
Click Create Data Lab cluster to finish. You are directed to the Data Lab cluster overview page.

How to connect to your Data Lab

Click Data Lab under Data & Analytics on the side menu. The Data Lab for Apache Spark™ page displays.
Click the name of the Data Lab cluster you want to connect to. The cluster Overview page displays.
Click Open Notebook in the Notebook section. You are directed to the notebook login page.
Enter your API secret key when prompted for a password, then click Log in. You are directed to the notebook home screen.

How to run the demo file

Each Data Lab for Apache Spark™ comes with a default DatalabDemo.ipynb demonstration file for testing purposes. This file contains a preconfigured notebook environment that requires no modification to run.

Execute the cells in order to perform pre-determined operations on a dummy data set.

How to set up a new Data Lab environment

From the notebook Launcher tab, select PySpark under Notebook.

In a new cell, copy and paste the code below and replace the placeholders with your API access key, secret key, and the endpoint of your Object Storage Bucket to set up the Apache Spark session:

%%configure -f
{
    "name": "My Spark",
    "conf":{
        "spark.hadoop.fs.s3a.access.key": "your-api-access-key",
        "spark.hadoop.fs.s3a.secret.key": "your-api-secret-key",
        "spark.hadoop.fs.s3a.endpoint": "your-bucket-endpoint"
        }
}

Note

The Object Storage bucket endpoint is required only if you did not specify a bucket when creating the cluster.

In a new cell below, copy and paste the following command to initialize the Apache Spark session:
```
from pyspark.sql.types import StructType, StructField, LongType, DoubleType, StringType
```
Execute the two cells you just created.

Note
The initialization of your Apache Spark session can take a few minutes.

Once initialized, the information of the Spark session displays.

You can now execute commands that will run on the resources defined when creating the Data Lab for Apache Spark™.

How to delete a Data Lab for Apache Spark™

Important

This action is irreversible and will permanently delete this Data Lab cluster and all its associated data.

From the Overview tab of your Data Lab cluster, click the Settings tab, then select Delete cluster.
Enter DELETE in the confirmation pop-up to confirm your action.
Click Delete Data Lab cluster.

Still need help?

Create a support ticket