Skip to navigationSkip to main contentSkip to footerScaleway DocsAsk our AI
Ask our AI

Data Lab for Apache Spark™ FAQ

Overview

What is Apache Spark?

Apache Spark™ is an open-source unified analytics engine designed for large-scale data processing. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. Spark™ offers high-level APIs in Java, Scala, Python, and R, and an optimized engine that supports general execution graphs.

How does Apache Spark™ work?

Apache Spark™ processes data in memory, which allows it to perform tasks up to 100 times faster than traditional disk-based processing frameworks like Hadoop MapReduce. It uses Resilient Distributed Datasets (RDDs) to store data across multiple nodes in a cluster and perform parallel operations on this data.

What workloads is Data Lab for Apache Spark™ suited for?

Data Lab for Apache Spark™ supports a range of workloads, including:

  • Complex analytics
  • Machine learning tasks
  • High-speed operations on large datasets

It offers scalable CPU and GPU Instances with flexible node limits and robust Apache Spark™ library support.

Offering and availability

What notebook is included with Dedicated Data Labs?

The service provides a JupyterLab notebook running on a dedicated CPU Instance, fully integrated with the Apache Spark™ cluster for seamless data processing and calculations.

Pricing and billing

How am I billed for Data Lab for Apache Spark™?

Data Lab for Apache Spark™ is billed based on the following factors:

  • The main node configuration selected.
  • The worker node configuration selected, and the number of worker nodes in the cluster.
  • The persistent volume size provisioned.
  • The presence of a notebook.

Compatibility and integration

Can I run a Data Lab for Apache Spark™ using GPUs?

Yes, you can run your cluster on either CPUs or GPUs. Scaleway leverages Nvidia's RAPIDS Accelerator For Apache Spark, an open-source suite of software libraries and APIs to execute end-to-end data science and analytics pipelines entirely on GPUs. This technology allows for significant acceleration of data processing tasks compared to CPU-based processing.

Can I connect a separate notebook environment to the Data Lab?

Yes, you can connect a different notebook via Private Networks.

Refer to the dedicated documentation for comprehensive information on how to connect to a Data Lab for Apache Spark™ cluster over Private Networks.

Usage and management

Can I upscale or downscale a Data Lab for Apache Spark™?

Yes, you can upscale a Data Lab cluster to distribute your workloads across more worker nodes for faster processing. You can also scale it down to zero to reduce costs, while retaining your configuration and context.

You can still access the notebook of a Data Lab cluster with zero worker nodes, but you cannot perform any calculations. You can resume the activity of your cluster by provisioning at least one worker node.

Still need help?

Create a support ticket
No Results