Data Lab for Apache Spark™ - Quickstart
Data Lab for Apache Spark™ is a product designed to assist data scientists and data engineers in performing calculations on a remotely managed Apache Spark infrastructure.
It is composed of the following:
-
Cluster: An Apache Spark cluster powered by a Kubernetes architecture.
-
Notebook: A JupyterLab service operating on a dedicated node type.
Scaleway provides dedicated node types for both the notebook and the cluster. The cluster nodes are high-end machines built for intensive computations, featuring powerful CPUs/GPUs, and substantial RAM.
The notebook, although capable of performing some local computations, primarily serves as a web interface for interacting with the Apache Spark cluster.
Before you start
To complete the actions presented below, you must have:
- A Scaleway account logged into the console
- Owner status or IAM permissions allowing you to perform actions in the intended Organization
- Optionally, an Object Storage bucket
How to create a Data Lab for Apache Spark™ cluster
-
Click Data Lab under Data & Analytics on the side menu.
-
Click Create Data Lab cluster. The creation wizard displays.
-
Complete the following steps in the wizard:
- Choose an Apache Spark version from the drop-down menu.
- Select a worker node configuration. For this procedure, we recommend selecting a CPU rather than a GPU.
- Enter the desired number of worker nodes.
- Enter a name for your Data Lab.
- Optionally, add a description and/or tags for your Data Lab.
- Verify the estimated cost.
-
Click Create Data Lab cluster to finish. You are directed to the Data Lab cluster overview page.
How to connect to your Data Lab
-
Click Data Lab under Data & Analytics on the side menu. The Data Lab for Apache Spark™ page displays.
-
Click the name of the Data Lab cluster you want to connect to. The cluster Overview page displays.
-
Click Open Notebook in the Notebook section. You are directed to the notebook login page.
-
Enter your API secret key when prompted for a password, then click Log in. You are directed to the notebook home screen.
How to run the demo file
Each Data Lab for Apache Spark™ comes with a default DatalabDemo.ipynb
demonstration file for testing purposes. This file contains a preconfigured notebook environment that requires no modification to run.
Execute the cells in order to perform pre-determined operations on a dummy data set.
How to set up a new Data Lab environment
-
From the notebook Launcher tab, select PySpark under Notebook.
-
In a new cell, copy and paste the code below and replace the placeholders with your API access key, secret key, and the endpoint of your Object Storage Bucket to set up the Apache Spark session:
%%configure -f { "name": "My Spark", "conf":{ "spark.hadoop.fs.s3a.access.key": "your-api-access-key", "spark.hadoop.fs.s3a.secret.key": "your-api-secret-key", "spark.hadoop.fs.s3a.endpoint": "your-bucket-endpoint" } }
-
In a new cell below, copy and paste the following command to initialize the Apache Spark session:
from pyspark.sql.types import StructType, StructField, LongType, DoubleType, StringType
-
Execute the two cells you just created.
Once initialized, the information of the Spark session displays.
You can now execute commands that will run on the resources defined when creating the Data Lab for Apache Spark™.
How to delete a Data Lab for Apache Spark™
-
From the Overview tab of your Data Lab cluster, click the Settings tab, then select Delete cluster.
-
Enter DELETE in the confirmation pop-up to confirm your action.
-
Click Delete Data Lab cluster.