Skip to navigationSkip to main contentSkip to footerScaleway DocsAsk our AI
Ask our AI

How to use Private Networks with your Data Lab cluster

Private Networks allow your Data Lab for Apache Spark™ cluster to communicate in an isolated and secure network without needing to be connected to the public internet.

At the moment, Data Lab clusters can only be attached to a Private Network during their creation, and cannot be detached and reattached to another Private Network afterward.

For full information about Scaleway Private Networks and VPC, see our dedicated documentation and best practices guide.

Before you start

To complete the actions presented below, you must have:

How to use a cluster through a Private Network

Setting up your Instance

  1. Connect to your Instance via SSH.

  2. Run the command below from the shell of your Instance to install the required dependencies:

    sudo apt update
    sudo apt install -y \
      build-essential curl git \
      libssl-dev zlib1g-dev libbz2-dev libreadline-dev libsqlite3-dev \
      libncursesw5-dev xz-utils tk-dev libxml2-dev libxmlsec1-dev libffi-dev liblzma-dev \
      openjdk-17-jre-headless tmux
  3. Run the command below to install pyenv:

    curl https://pyenv.run | bash
  4. Run the command below to add pyenv to your Bash configuration:

    echo 'export PATH="$HOME/.pyenv/bin:$PATH"' >> ~/.bashrc
    echo 'eval "$(pyenv init -)"' >> ~/.bashrc
    echo 'eval "$(pyenv virtualenv-init -)"' >> ~/.bashrc
  5. Run the command below to reload your shell:

    exec $SHELL
  6. Run the command below to install Python 3.13, and activate a virtual environment:

    pyenv install 3.13.0
    pyenv virtualenv 3.13.0 jupyter-spark-3.13
    pyenv activate jupyter-spark-3.13
    Note

    Your Instance's Python version must be 3.13. If you encounter an error due to a mismatch between the worker and driver Python versions, run the following command to display minor versions, then reinstall using the exact one:

    pyenv install -l | grep 3.13
  7. Run the command below to install JupyterLab and PySpark inside the virtual environment:

    pip install --upgrade pip
    pip install jupyterlab pyspark
  8. Run the command below to generate a configuration file for your JupyterLab:

    jupyter lab --generate-config
  9. Open the configuration file you just created:

    nano ~/.jupyter/jupyter_lab_config.py
  10. Set the following parameters:

    # if running as root:
    c.ServerApp.allow_root = True
    c.ServerApp.port = 8888
    # optional authentication token:
    # c.ServerApp.token = "your-super-secure-password"
  11. Run the command below to start JupyterLab:

    jupyter lab
  12. In a new terminal, connect to your JupyterLab via SSH. The Instance public IP can be found in the Overview tab of your Instance:

    ssh -L 8888:127.0.0.1:8888 <user>@<instance-public-ip>
    Note

    Make sure to allow root connection in your configuration file if you log in as a root user.

  13. Access http://localhost:8888, then enter the token generated while executing the jupyter lab command.

You now have access to your Data Lab for Apache Spark™ cluster via a Private Network, using a JupyterLab notebook deployed on an Instance.

Running a sample workload over Private Networks

  1. In a new Jupyter notebook file, add the code below to a new cell:

    from pyspark.sql import SparkSession
    
        MASTER_URL = "<SPARK_MASTER_ENDPOINT>" # "spark://master-datalab-[...]:7077" format
        DRIVER_HOST = "<INSTANCE_PN_IP>" # "XX.XX.XX.XX" format
    
        spark = (
            SparkSession.builder
            .appName("jupyter-from-vpc-instance-test")
            .master(MASTER_URL)
            # make sure executors can talk back to this driver
            .config("spark.driver.host", DRIVER_HOST)
            .config("spark.driver.bindAddress", "0.0.0.0")
            .config("spark.driver.port", "7078")
            .config("spark.blockManager.port", "7079")
            .config("spark.ui.port", "4040")
            .getOrCreate()
        )
    
        spark.range(10).show()
  2. Replace the placeholders with the appropriate values:

    • <SPARK_MASTER_ENDPOINT> can be found in the Overview tab of your cluster, under Private endpoint in the Network section.
    • <INSTANCE_PN_IP> can be found in the Private Networks tab of your Instance. Make sure to only copy the IP, and not the /22 part.
  3. Run the cell.

Your notebook hosted on an Instance is ready to be used over Private Networks.

Running an application over Private Networks using spark-submit

  1. Connect to your Instance via SSH.

  2. Run the command below from the shell of your Instance to install the required dependencies:

    sudo apt update
    sudo apt install -y openjdk-17-jdk curl wget tar
    java -version
  3. Run the command below to install Apache Spark™:

    cd ~
    wget https://archive.apache.org/dist/spark/spark-4.0.0/spark-4.0.0-bin-hadoop3.tgz
  4. Run the command below to unzip the archive:

    sudo mkdir -p /opt/spark
    sudo tar -xzf spark-4.0.0-bin-hadoop3.tgz -C /opt/spark --strip-components=1
  5. Run the command below to add Apache Spark™ to your Bash configuration, and reload your bash session:

    echo 'export SPARK_HOME=/opt/spark' >> ~/.bashrc
    echo 'export PATH="$SPARK_HOME/bin:$PATH"' >> ~/.bashrc
    source ~/.bashrc
  6. Install Python 3.13 if you have not done it yet, then set the environment variables below:

    export PYSPARK_PYTHON=$(which python) //should be 3.13
    export PYSPARK_DRIVER_PYTHON=$(which python)
  7. Run the command below to execute spark-submit to calculate pi for 100 iterations. Do not forget to replace the placeholders with the appropriate values.

    spark-submit \
    --master spark://<SPARK_MASTER_ENDPOINT>:7077 \
    --deploy-mode client \
    --conf spark.driver.port=7078 \
    --conf spark.blockManager.port=7079 \
    --conf spark.driver.host=<INSTANCE_PN_IP> \
    $SPARK_HOME/examples/src/main/python/pi.py 100
    Note
    • <SPARK_MASTER_ENDPOINT> can be found in the Overview tab of your cluster, under Private endpoint in the Network section.
    • <INSTANCE_PN_IP> can be found in the Private Networks tab of your Instance. Make sure to only copy the IP, and not the /22 part.
  8. Access the Apache Spark™ UI of your cluster. The list of completed applications displays. From here, you can inspect the jobs previously started using spark-submit.

You successfully run workloads on your cluster from an Instance over a Private Network.

Still need help?

Create a support ticket
No Results