This repository contains a set of configurations to create pyspark and python development environments in vscode using docker containers. For development in pyspark two different strategies can be used, using spark in local mode or in standalone mode.
Environment | Configurations | Spark Mode |
---|---|---|
Python | python-devenv-wks | - |
Pyspark | pyspark-devenv-local-wks | Local |
Pyspark | pyspark-devenv-standalone-wks | Standalone |
To get started, follow these steps:
- Install Visual Studio Code (VSCode)
- Install and configure Docker for your operating system.
- Install the Remote Development extension pack
-
Download the devcontainer files for the desired VScode development environment (zip extension), and unzip the files to a location of your choice.
-
Configure the environment variables at your choice, if needed (see section Environment variables).
- Change the name of file .env_template inside folder .devcontainer to .env.
- Add or modify the environment variables with the values that fit best for you.
-
Start VSCode.
-
Press on View โ Command Palette โ search for Remote-Containers: Open Folder in Container....
-
Choose and press over Remote-Containers: Open Folder in Container...
-
Select folder pyspark-devenv-{spark mode}-wks or python-devenv-wks that contains folder .devcontainer.
-
Leave the process running until the installation is complete.
Variable | Description | Default Value |
---|---|---|
JUPYTER_PORT | Port to access to Jupyter environment | 8888 |
GIT_EMAIL | Email that will be used to configure git | default |
GIT_USERNAME | Username that will be used to configure git | default |
DISABLE_JUPYTER | If you want to disable jupyter environment set this value to 1 | 0 |
JUPYTER_ALLOW_ORIGIN | The address origin that are allowed to access to your jupyter server | 0.0.0.0 |
JUPYTER_PASSWORD | The password to access to your jupyter server (hashed password) | hashed string of "devuser" |
TAG | The tag of the docker image regarding the develoment environment (see tags below) | latest |
Variable | Description | Default Value |
---|---|---|
JUPYTER_SPARk_MEMORY | The amount of memory that spark used by jupyter is allowed to consume | 2g |
JUPYTER_SPARK_CORES | Number of cpu cores that spark used by jupyter is allowed to use | 2 |
SPARK_EXECUTOR_MEMORY | The amount of memory that spark executor can consume | 2g |
Variable | Description | Default Value |
---|---|---|
SPARK_WORKER_CORES | Number of cpu cores that a spark worker can use | 2 |
SPARK_WORKER_MEMORY | The amount of memory that a spark worker can use | 4g |
HISTORY_CLEANER_INTERVAL | Specifies how often the filesystem job history cleaner checks for files to delete | 1d |
HISTORY_MAX_AGE | History files older than this value will be deleted when the filesystem history cleaner runs | 7d |
python-devenv-wks |
---|
3.10 |
3.8 |
pyspark-devenv-local-wks/pyspark-devenv-standalone-wks | Python version | Spark version |
---|---|---|
3.10-3.4.0 (latest) | 3.10 | 3.4.0 |
3.10-3.3.2 | 3.10 | 3.3.2 |
3.8-3.2.1 | 3.8 | 3.2.1 |