Skip to content

Config

kubr.config.runner.RunnerConfig

Bases: BaseModel

RunnerConfig is the configuration for the runner.

Parameters:

Name Type Description Default
experiment ExperimentConfig

Experiment configuration.

required
container ContainerConfig

Container configuration.

required
resources ResourceConfig

Resource configuration.

required
type JobType

Job type. Defaults to JobType.torchrun.

required
backend JobBackend

Job backend. Defaults to JobBackend.Volcano.

required
code Optional[CodePersistenceConfig]

Code persistence configuration. Defaults to None.

required
data Optional[DataConfig]

Data configuration. Defaults to None.

required
Source code in kubr/config/runner.py
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
class RunnerConfig(BaseModel):
    """RunnerConfig is the configuration for the runner.

    Args:
        experiment (ExperimentConfig): Experiment configuration.
        container (ContainerConfig): Container configuration.
        resources (ResourceConfig): Resource configuration.
        type (JobType, optional): Job type. Defaults to JobType.torchrun.
        backend (JobBackend, optional): Job backend. Defaults to JobBackend.Volcano.
        code (Optional[CodePersistenceConfig], optional): Code persistence configuration. Defaults to None.
        data (Optional[DataConfig], optional): Data configuration. Defaults to None.

    """

    experiment: ExperimentConfig
    init_container: Optional[ContainerConfig] = None
    container: ContainerConfig
    resources: ResourceConfig
    type: JobType = JobType.torchrun
    backend: JobBackend = JobBackend.Volcano
    code: Optional[CodePersistenceConfig] = None
    data: Optional[DataConfig] = None

kubr.config.runner.ExperimentConfig

Bases: BaseModel

ExperimentConfig is the configuration for the experiment.

Parameters:

Name Type Description Default
name str

Name of the experiment.

required
namespace str

Namespace of the experiment.

required
queue Optional[str]

Queue to submit the experiment to. Defaults to "default".

required
job_retries int

Number of retries for the job. Defaults to 0.

required
worker_max_retries int

Maximum number of retries for the task. Defaults to 10.

required
Source code in kubr/config/runner.py
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
class ExperimentConfig(BaseModel):
    """ExperimentConfig is the configuration for the experiment.

    Args:
        name (str): Name of the experiment.
        namespace (str): Namespace of the experiment.
        queue (Optional[str], optional): Queue to submit the experiment to. Defaults to "default".
        job_retries (int, optional): Number of retries for the job. Defaults to 0.
        worker_max_retries (int, optional): Maximum number of retries for the task. Defaults to 10.
    """

    name: str
    namespace: str

    # args: List[str] = field(default_factory=list)
    # env: Dict[str, str] = field(default_factory=dict)

    queue: Optional[str] = "default"
    # priority_class: Optional[str] = None

    # TODO add tests for retries
    job_retries: int = 0
    worker_max_retries: int = 0

kubr.config.runner.ResourceConfig

Bases: BaseModel

ResourceConfig is the configuration for the resources.

Parameters:

Name Type Description Default
nodes int

Number of replicas to run. Defaults to 1.

required
cpu int

Number of CPUs to request. Defaults to 0.

required
memory int

Memory in GB to request. Defaults to 0.

required
gpu int

Number of GPUs to request. Defaults to 0.

required
ib Union[int, Literal['auto']]

Number of Infiniband devices to request. Defaults to 0.

required
ib_device str

Name of the Infiniband device to request. Defaults to "nvidia.com/hostdev".

required
Source code in kubr/config/runner.py
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
class ResourceConfig(BaseModel):
    """ResourceConfig is the configuration for the resources.

    Args:
        nodes (int, optional): Number of replicas to run. Defaults to 1.
        cpu (int, optional): Number of CPUs to request. Defaults to 0.
        memory (int, optional): Memory in GB to request. Defaults to 0.
        gpu (int, optional): Number of GPUs to request. Defaults to 0.
        ib (Union[int, Literal['auto']], optional): Number of Infiniband devices to request. Defaults to 0.
        ib_device (str, optional): Name of the Infiniband device to request. Defaults to "nvidia.com/hostdev".
    """

    # TODO [config][resources] add taints\tolerations\affinity
    nodes: int = pydantic.Field(gt=0, type=int, default=1)
    cpu: int = 0
    memory: float = 0
    gpu: int = 0
    # devices: Dict[str, float] = field(default_factory=dict)
    # capabilities: Dict[str, str] = field(default_factory=dict)
    ib: Union[int, Literal["auto"]] = 0
    ib_device: str = "nvidia.com/hostdev"

kubr.config.runner.DataConfig

Bases: BaseModel

DataConfig is the configuration for the data.

Parameters:

Name Type Description Default
volumes Optional[List[VolumeMount]]

List of volumes to mount. Defaults to [].

required
Source code in kubr/config/runner.py
73
74
75
76
77
78
79
80
81
class DataConfig(BaseModel):
    """DataConfig is the configuration for the data.

    Args:
        volumes (Optional[List[VolumeMount]], optional): List of volumes to mount. Defaults to [].
    """

    # pvcs: Optional[List[str]] = None
    volumes: Optional[List[VolumeMount]] = []

kubr.config.runner.ContainerConfig

Bases: BaseModel

ContainerConfig is the configuration for the container.

Parameters:

Name Type Description Default
image str

Image to run.

required
entrypoint Optional[str]

Entrypoint to run. Defaults to None.

required
env Dict[str, str]

Environment variables to pass to the entrypoint. Defaults to {}.

required
secrets Optional[List[SecretConfig]]

Secrets to pass to the entrypoint. Defaults to None.

required
Source code in kubr/config/runner.py
28
29
30
31
32
33
34
35
36
37
38
39
40
41
class ContainerConfig(BaseModel):
    """ContainerConfig is the configuration for the container.

    Args:
        image (str): Image to run.
        entrypoint (Optional[str], optional): Entrypoint to run. Defaults to None.
        env (Dict[str, str], optional): Environment variables to pass to the entrypoint. Defaults to {}.
        secrets (Optional[List[SecretConfig]], optional): Secrets to pass to the entrypoint. Defaults to None.
    """

    image: str
    entrypoint: Optional[str] = None
    env: List[EnvVar] = []
    secrets: List[SecretMount] = []

kubr.config.runner.VolumeMount

Bases: BaseModel

VolumeMount is the configuration for a volume mount.

Parameters:

Name Type Description Default
name str

Name of the volume.

required
type Literal['hostPath']

Type of the volume.

required
mount_path str

Mount path of the volume.

required
Source code in kubr/config/runner.py
53
54
55
56
57
58
59
60
61
62
63
64
class VolumeMount(BaseModel):
    """VolumeMount is the configuration for a volume mount.

    Args:
        name (str): Name of the volume.
        type (Literal["hostPath"]): Type of the volume.
        mount_path (str): Mount path of the volume.
    """

    name: str
    type: Literal["hostPath"]
    mount_path: str