How to use Private Networks with your Data Lab cluster
Private Networks allow your Data Lab for Apache Spark™ cluster to communicate in an isolated and secure network without needing to be connected to the public internet.
At the moment, Data Lab clusters can only be attached to a Private Network during their creation, and cannot be detached and reattached to another Private Network afterward.
For full information about Scaleway Private Networks and VPC, see our dedicated documentation and best practices guide.
Before you start
To complete the actions presented below, you must have:
- A Scaleway account logged into the console
- Owner status or IAM permissions allowing you to perform actions in the intended Organization
- Created a Private Network
- Created an Ubuntu Instance attached to a Private Network
How to use a cluster through a Private Network
Setting up your Instance
-
Run the command below from the shell of your Instance to install the required dependencies:
sudo apt update sudo apt install -y \ build-essential curl git \ libssl-dev zlib1g-dev libbz2-dev libreadline-dev libsqlite3-dev \ libncursesw5-dev xz-utils tk-dev libxml2-dev libxmlsec1-dev libffi-dev liblzma-dev \ openjdk-17-jre-headless tmux -
Run the command below to install
pyenv:curl https://pyenv.run | bash -
Run the command below to add
pyenvto your Bash configuration:echo 'export PATH="$HOME/.pyenv/bin:$PATH"' >> ~/.bashrc echo 'eval "$(pyenv init -)"' >> ~/.bashrc echo 'eval "$(pyenv virtualenv-init -)"' >> ~/.bashrc -
Run the command below to reload your shell:
exec $SHELL -
Run the command below to install Python 3.13, and activate a virtual environment:
pyenv install 3.13.0 pyenv virtualenv 3.13.0 jupyter-spark-3.13 pyenv activate jupyter-spark-3.13 -
Run the command below to install JupyterLab and PySpark inside the virtual environment:
pip install --upgrade pip pip install jupyterlab pyspark -
Run the command below to generate a configuration file for your JupyterLab:
jupyter lab --generate-config -
Open the configuration file you just created:
nano ~/.jupyter/jupyter_lab_config.py -
Set the following parameters:
# if running as root: c.ServerApp.allow_root = True c.ServerApp.port = 8888 # optional authentication token: # c.ServerApp.token = "your-super-secure-password" -
Run the command below to start JupyterLab:
jupyter lab -
In a new terminal, connect to your JupyterLab via SSH. The Instance public IP can be found in the Overview tab of your Instance:
ssh -L 8888:127.0.0.1:8888 <user>@<instance-public-ip> -
Access http://localhost:8888, then enter the token generated while executing the
jupyter labcommand.
You now have access to your Data Lab for Apache Spark™ cluster via a Private Network, using a JupyterLab notebook deployed on an Instance.
Running a sample workload over Private Networks
-
In a new Jupyter notebook file, add the code below to a new cell:
from pyspark.sql import SparkSession MASTER_URL = "<SPARK_MASTER_ENDPOINT>" # "spark://master-datalab-[...]:7077" format DRIVER_HOST = "<INSTANCE_PN_IP>" # "XX.XX.XX.XX" format spark = ( SparkSession.builder .appName("jupyter-from-vpc-instance-test") .master(MASTER_URL) # make sure executors can talk back to this driver .config("spark.driver.host", DRIVER_HOST) .config("spark.driver.bindAddress", "0.0.0.0") .config("spark.driver.port", "7078") .config("spark.blockManager.port", "7079") .config("spark.ui.port", "4040") .getOrCreate() ) spark.range(10).show() -
Replace the placeholders with the appropriate values:
<SPARK_MASTER_ENDPOINT>can be found in the Overview tab of your cluster, under Private endpoint in the Network section.<INSTANCE_PN_IP>can be found in the Private Networks tab of your Instance. Make sure to only copy the IP, and not the/22part.
-
Run the cell.
Your notebook hosted on an Instance is ready to be used over Private Networks.
Running an application over Private Networks using spark-submit
-
Run the command below from the shell of your Instance to install the required dependencies:
sudo apt update sudo apt install -y openjdk-17-jdk curl wget tar java -version -
Run the command below to install Apache Spark™:
cd ~ wget https://archive.apache.org/dist/spark/spark-4.0.0/spark-4.0.0-bin-hadoop3.tgz -
Run the command below to unzip the archive:
sudo mkdir -p /opt/spark sudo tar -xzf spark-4.0.0-bin-hadoop3.tgz -C /opt/spark --strip-components=1 -
Run the command below to add Apache Spark™ to your Bash configuration, and reload your bash session:
echo 'export SPARK_HOME=/opt/spark' >> ~/.bashrc echo 'export PATH="$SPARK_HOME/bin:$PATH"' >> ~/.bashrc source ~/.bashrc -
Install Python 3.13 if you have not done it yet, then set the environment variables below:
export PYSPARK_PYTHON=$(which python) //should be 3.13 export PYSPARK_DRIVER_PYTHON=$(which python) -
Run the command below to execute
spark-submitto calculate pi for 100 iterations. Do not forget to replace the placeholders with the appropriate values.spark-submit \ --master spark://<SPARK_MASTER_ENDPOINT>:7077 \ --deploy-mode client \ --conf spark.driver.port=7078 \ --conf spark.blockManager.port=7079 \ --conf spark.driver.host=<INSTANCE_PN_IP> \ $SPARK_HOME/examples/src/main/python/pi.py 100 -
Access the Apache Spark™ UI of your cluster. The list of completed applications displays. From here, you can inspect the jobs previously started using
spark-submit.
You successfully run workloads on your cluster from an Instance over a Private Network.