Skip to content

Commit

Permalink
Add generate config docs
Browse files Browse the repository at this point in the history
  • Loading branch information
Wh1isper committed Jul 28, 2023
1 parent 9c008bb commit eff30cb
Show file tree
Hide file tree
Showing 5 changed files with 180 additions and 8 deletions.
7 changes: 7 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,11 @@
repos:
- repo: local
hooks:
- id: generate-config
name: Generate config
entry: python ./generate_config_docs.py
language: system

- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.4.0
hooks:
Expand Down
14 changes: 14 additions & 0 deletions config-template.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# Spark Configuration

The exact explanation and defaults for spark config can be found [here](https://spark.apache.org/docs/latest/configuration.html), `None` means to use the spark native defaults

# Config PySpark Session via environment variables

> Generated by [generate-config-docs.py](./generate_config_docs.py)
> Run `python ./generate_config_docs.py` to update this file
{docs}

# TIPS

S3 secrets tokens(and others) need only be configured on the `Driver` or `Connect Server`, Configuration in `Connect client` take no effort.
90 changes: 88 additions & 2 deletions config.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,90 @@
It will be ready as soon as possible after Relase 0.1.0, until then you can refer to the source code file [sparglim/config/configer.py](sparglim/config/configer.py)
# Spark Configuration

The exact explanation and defaults for spark config can be found [here](https://spark.apache.org/docs/latest/configuration.html), `None` means to use the spark native defaults

TODO: Generate avaliable envs for config spark session
# Config PySpark Session via environment variables

> Generated by [generate-config-docs.py](./generate_config_docs.py)
> Run `python ./generate_config_docs.py` to update this file
Source code: sparglim/config/configer.py

Avaliable environment variables for SparkEnvConfiger:

Default config:

- `SPAGLIM_APP_NAME`: `spark.app.name`, default: `Sparglim`.
- `SPAGLIM_DEPLOY_MODE`: `spark.submit.deployMode`, default: `client`.
- `SPARGLIM_SCHEDULER_MODE`: `spark.scheduler.mode`, default: `FAIR`.
- `SPARGLIM_UI_PORT`: `spark.ui.port`, default: `None`.
- `S3_ACCESS_KEY` or `AWS_ACCESS_KEY_ID`: `spark.hadoop.fs.s3a.access.key`, default: `None`.
- `S3_SECRET_KEY` or `AWS_SECRET_ACCESS_KEY`: `spark.hadoop.fs.s3a.secret.key`, default: `None`.
- `S3_ENTRY_POINT`: `spark.hadoop.fs.s3a.endpoint`, default: `None`.
- `S3_ENTRY_POINT_REGION` or `AWS_DEFAULT_REGION`: `spark.hadoop.fs.s3a.endpoint.region`, default: `None`.
- `S3_PATH_STYLE_ACCESS`: `spark.hadoop.fs.s3a.path.style.access`, default: `None`.
- `S3_MAGIC_COMMITTER`: `spark.hadoop.fs.s3a.bucket.all.committer.magic.enabled`, default: `None`.
- `SPARGIM_KERBEROS_KEYTAB`: `spark.kerberos.keytab`, default: `None`.
- `SPARGIM_KERBEROS_PRINCIPAL`: `spark.kerberos.principal`, default: `None`.

`config_basic()` can config following:

- `SPAGLIM_APP_NAME`: `spark.app.name`, default: `Sparglim`.
- `SPAGLIM_DEPLOY_MODE`: `spark.submit.deployMode`, default: `client`.
- `SPARGLIM_SCHEDULER_MODE`: `spark.scheduler.mode`, default: `FAIR`.
- `SPARGLIM_UI_PORT`: `spark.ui.port`, default: `None`.

`config_s3()` can config following:

- `S3_ACCESS_KEY` or `AWS_ACCESS_KEY_ID`: `spark.hadoop.fs.s3a.access.key`, default: `None`.
- `S3_SECRET_KEY` or `AWS_SECRET_ACCESS_KEY`: `spark.hadoop.fs.s3a.secret.key`, default: `None`.
- `S3_ENTRY_POINT`: `spark.hadoop.fs.s3a.endpoint`, default: `None`.
- `S3_ENTRY_POINT_REGION` or `AWS_DEFAULT_REGION`: `spark.hadoop.fs.s3a.endpoint.region`, default: `None`.
- `S3_PATH_STYLE_ACCESS`: `spark.hadoop.fs.s3a.path.style.access`, default: `None`.
- `S3_MAGIC_COMMITTER`: `spark.hadoop.fs.s3a.bucket.all.committer.magic.enabled`, default: `None`.

`config_kerberos()` can config following:

- `SPARGIM_KERBEROS_KEYTAB`: `spark.kerberos.keytab`, default: `None`.
- `SPARGIM_KERBEROS_PRINCIPAL`: `spark.kerberos.principal`, default: `None`.

`config_local()` can config following:

- `SPARGLIM_MASTER`: `spark.master`, default: `local[*]`.
- `SPARGLIM_LOCAL_MEMORY`: `spark.driver.memory`, default: `512m`.

`config_connect_client()` can config following:

- `SPARGLIM_REMOTE`: `spark.remote`, default: `sc://localhost:15002`.

`config_connect_server()` can config following:

- `SPARGLIM_CONNECT_SERVER_PORT`: `spark.connect.grpc.binding.port`, default: `None`.
- `SPARGLIM_CONNECT_GRPC_ARROW_MAXBS`: `spark.connect.grpc.arrow.maxBatchSize`, default: `None`.
- `SPARGLIM_CONNECT_GRPC_MAXIM`: `spark.connect.grpc.maxInboundMessageSize`, default: `None`.

`config_k8s()` can config following:

- `SPARGLIM_MASTER`: `spark.master`, default: `k8s://https://kubernetes.default.svc`.
- `SPARGLIM_K8S_NAMESPACE`: `spark.kubernetes.namespace`, default: `None`.
- `SPARGLIM_K8S_IMAGE`: `spark.kubernetes.container.image`, default: `wh1isper/spark-executor:3.4.1`.
- `SPARGLIM_K8S_IMAGE_PULL_SECRETS`: `spark.kubernetes.container.image.pullSecrets`, default: `None`.
- `SPARGLIM_K8S_IMAGE_PULL_POLICY`: `spark.kubernetes.container.image.pullPolicy`, default: `IfNotPresent`.
- `SPARK_EXECUTOR_NUMS`: `spark.executor.instances`, default: `3`.
- `SPARGLIM_K8S_EXECUTOR_LABEL_LIST`: `spark.kubernetes.executor.label.*`, default: `sparglim-executor`. A string seperated by `,` will be converted
- `SPARGLIM_K8S_EXECUTOR_ANNOTATION_LIST`: `spark.kubernetes.executor.annotation.*`, default: `sparglim-executor`. A string seperated by `,` will be converted
- `SPARGLIM_DRIVER_HOST`: `spark.driver.host`, default: `None`.
- `SPARGLIM_DRIVER_BINDADDRESS`: `spark.driver.bindAddress`, default: `0.0.0.0`.
- `SPARGLIM_DRIVER_POD_NAME`: `spark.kubernetes.driver.pod.name`, default: `None`.
- `SPARGLIM_K8S_EXECUTOR_REQUEST_CORES`: `spark.kubernetes.executor.cores`, default: `None`.
- `SPARGLIM_K8S_EXECUTOR_LIMIT_CORES`: `spark.kubernetes.executor.limit.cores`, default: `None`.
- `SPARGLIM_EXECUTOR_REQUEST_MEMORY`: `spark.executor.memory`, default: `512m`.
- `SPARGLIM_EXECUTOR_LIMIT_MEMORY`: `spark.executor.memoryOverhead`, default: `None`.
- `SPARGLIM_K8S_GPU_VENDOR`: `spark.executor.resource.gpu.vendor`, default: `nvidia.com`.
- `SPARGLIM_K8S_GPU_DISCOVERY_SCRIPT`: `spark.executor.resource.gpu.discoveryScript`, default: `/opt/spark/examples/src/main/scripts/getGpusResources.sh`.
- `SPARGLIM_K8S_GPU_AMOUNT`: `spark.executor.resource.gpu.amount`, default: `None`.
- `SPARGLIM_RAPIDS_SQL_ENABLED`: `spark.rapids.sql.enabled`, default: `None`.


# TIPS

S3 secrets tokens(and others) need only be configured on the `Driver` or `Connect Server`, Configuration in `Connect client` take no effort.
55 changes: 55 additions & 0 deletions generate_config_docs.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
#!/usr/bin/env python
import os
from pathlib import Path

from sparglim.config.configer import SparkEnvConfiger

_HERE = Path(os.path.abspath(__file__)).parent
TEMPLATE_PATH = _HERE / "config-template.md"
OUTPUT_PATH = _HERE / "config.md"


def _generate_env_config_docs(config: dict) -> str:
docs = ""
for spark_config, (env, default) in config.items():
annotations = ""
if isinstance(env, list):
env = "` or `".join(env)
if env.endswith("_LIST"):
spark_config = spark_config.replace("list", "*")
annotations = " A string seperated by `,` will be converted"
docs += f"- `{env}`: `{spark_config}`, default: `{default}`.{annotations}\n"
return docs


def generate_docs(target_configer_cls) -> str:
docs = ""

source_code_path = target_configer_cls.__module__.replace(".", "/") + ".py"
docs += f"Source code: {source_code_path}\n\n"

docs += f"Avaliable environment variables for {target_configer_cls.__name__}:\n\n"

docs += f"Default config:\n\n"
docs += _generate_env_config_docs(target_configer_cls.default_config_mapper)

items = target_configer_cls.__dict__.items()
config_map = {k: v for k, v in items if k.startswith("_") and isinstance(v, dict)}
for config_suffix, config in config_map.items():
docs += f"\n`config{config_suffix}()` can config following:\n\n"
docs += _generate_env_config_docs(config)

return docs


def generate_from_template(docs: str):
print(f"Generating docs... {TEMPLATE_PATH.as_posix()} -> {OUTPUT_PATH.as_posix()}")
print(docs)
template = TEMPLATE_PATH.read_text()
template = template.format(docs=docs)
OUTPUT_PATH.write_text(template)


if __name__ == "__main__":
docs = generate_docs(SparkEnvConfiger)
generate_from_template(docs)
22 changes: 16 additions & 6 deletions sparglim/config/configer.py
Original file line number Diff line number Diff line change
Expand Up @@ -125,14 +125,13 @@ class SparkEnvConfiger:
# Not sure when it will be supported
# Welcome PR or discussion here!
}
default_config_mapper = {
**_basic,
**_s3,
**_kerberos,
}

def __init__(self) -> None:
self.default_config_mapper: ConfigEnvMapper = {
**self._basic,
**self._s3,
**self._kerberos,
}

self._config: Config
self.initialize()

Expand Down Expand Up @@ -191,6 +190,17 @@ def config_s3(self, custom_config: Optional[Dict[str, Any]] = None) -> SparkEnvC
)
return self

def config_kerberos(self, custom_config: Optional[Dict[str, Any]] = None) -> SparkEnvConfiger:
if not custom_config:
custom_config = dict()
self.config(
{
**self._config_from_env(self._kerberos),
**custom_config,
}
)
return self

def config_local(self, custom_config: Optional[Dict[str, Any]] = None) -> SparkEnvConfiger:
logger.info(f"Config master: local mode")
if not custom_config:
Expand Down

0 comments on commit eff30cb

Please sign in to comment.