gcloud ml-engine jobs submit training <JOB> <USER_ARGS>

Submit an AI Platform training job

Arguments

NameDescription
JOBName of the job
USER_ARGSAdditional user arguments to be forwarded to user code + The '--' argument must be specified between gcloud specific args on the left and USER_ARGS on the right

Options

NameDescription
--account <ACCOUNT>Google Cloud Platform user account to use for invocation. Overrides the default *core/account* property value for this command invocation
--async(DEPRECATED) Display information about the operation in progress without waiting for the operation to complete. Enabled by default and can be omitted; use `--stream-logs` to run synchronously
--billing-project <BILLING_PROJECT>The Google Cloud Platform project that will be charged quota for operations performed in gcloud. If you need to operate on one project, but need quota against a different project, you can use this flag to specify the billing project. If both `billing/quota_project` and `--billing-project` are specified, `--billing-project` takes precedence. Run `$ gcloud config set --help` to see more information about `billing/quota_project`
--config <CONFIG>Path to the job configuration file. This file should be a YAML document (JSON also accepted) containing a Job resource as defined in the API (all fields are optional): https://cloud.google.com/ml/reference/rest/v1/projects.jobs + EXAMPLES: + JSON: + { "jobId": "my_job", "labels": { "type": "prod", "owner": "alice" }, "trainingInput": { "scaleTier": "BASIC", "packageUris": [ "gs://my/package/path" ], "region": "us-east1" } } + YAML: + jobId: my_job labels: type: prod owner: alice trainingInput: scaleTier: BASIC packageUris: - gs://my/package/path region: us-east1 + + If an option is specified both in the configuration file **and** via command line arguments, the command line arguments override the configuration file
--configuration <CONFIGURATION>The configuration to use for this command invocation. For more information on how to use configurations, run: `gcloud topic configurations`. You can also use the CLOUDSDK_ACTIVE_CONFIG_NAME environment variable to set the equivalent of this flag for a terminal session
--flags-file <YAML_FILE>A YAML or JSON file that specifies a *--flag*:*value* dictionary. Useful for specifying complex flag values with special characters that work with any command interpreter. Additionally, each *--flags-file* arg is replaced by its constituent flags. See $ gcloud topic flags-file for more information
--flatten <KEY>Flatten _name_[] output resource slices in _KEY_ into separate records for each item in each slice. Multiple keys and slices may be specified. This also flattens keys for *--format* and *--filter*. For example, *--flatten=abc.def* flattens *abc.def[].ghi* references to *abc.def.ghi*. A resource record containing *abc.def[]* with N elements will expand to N records in the flattened output. This flag interacts with other flags that are applied in this order: *--flatten*, *--sort-by*, *--filter*, *--limit*
--format <FORMAT>Set the format for printing command output resources. The default is a command-specific human-friendly output format. The supported formats are: `config`, `csv`, `default`, `diff`, `disable`, `flattened`, `get`, `json`, `list`, `multi`, `none`, `object`, `table`, `text`, `value`, `yaml`. For more details run $ gcloud topic formats
--helpDisplay detailed help
--impersonate-service-account <SERVICE_ACCOUNT_EMAIL>For this gcloud invocation, all API requests will be made as the given service account instead of the currently selected account. This is done without needing to create, download, and activate a key for the account. In order to perform operations as the service account, your currently selected account must have an IAM role that includes the iam.serviceAccounts.getAccessToken permission for the service account. The roles/iam.serviceAccountTokenCreator role has this permission or you may create a custom role. Overrides the default *auth/impersonate_service_account* property value for this command invocation
--job-dir <JOB_DIR>Cloud Storage path in which to store training outputs and other data needed for training. + This path will be passed to your TensorFlow program as the `--job-dir` command-line arg. The benefit of specifying this field is that AI Platform will validate the path for use in training. However, note that your training program will need to parse the provided `--job-dir` argument. + If packages must be uploaded and `--staging-bucket` is not provided, this path will be used instead
--kms-key <KMS_KEY>ID of the key or fully qualified identifier for the key
--kms-keyring <KMS_KEYRING>The KMS keyring of the key
--kms-location <KMS_LOCATION>The Cloud location for the key
--kms-project <KMS_PROJECT>The Cloud project for the key
--labels <KEY=VALUE>List of label KEY=VALUE pairs to add. + Keys must start with a lowercase character and contain only hyphens (`-`), underscores (```_```), lowercase characters, and numbers. Values must contain only hyphens (`-`), underscores (```_```), lowercase characters, and numbers
--log-httpLog all HTTP server requests and responses to stderr. Overrides the default *core/log_http* property value for this command invocation
--master-accelerator <MASTER_ACCELERATOR>Hardware accelerator config for the master worker. Must specify both the accelerator type (TYPE) for each server and the number of accelerators to attach to each server (COUNT). + *type*::: Type of the accelerator. Choices are nvidia-tesla-a100,nvidia-tesla-k80,nvidia-tesla-p100,nvidia-tesla-p4,nvidia-tesla-t4,nvidia-tesla-v100,tpu-v2,tpu-v2-pod,tpu-v3,tpu-v3-pod + *count*::: Number of accelerators to attach to each machine running the job. Must be greater than 0. +
--master-image-uri <MASTER_IMAGE_URI>Docker image to run on each master worker. This image must be in Container Registry. Only one of `--master-image-uri` and `--runtime-version` must be specified
--master-machine-type <MASTER_MACHINE_TYPE>Specifies the type of virtual machine to use for training job's master worker. + You must set this value when `--scale-tier` is set to `CUSTOM`
--module-name <MODULE_NAME>Name of the module to run
--package-path <PACKAGE_PATH>Path to a Python package to build. This should point to a *local* directory containing the Python source for the job. It will be built using *setuptools* (which must be installed) using its *parent* directory as context. If the parent directory contains a `setup.py` file, the build will use that; otherwise, it will use a simple built-in one
--packages <PACKAGE>Path to Python archives used for training. These can be local paths (absolute or relative), in which case they will be uploaded to the Cloud Storage bucket given by `--staging-bucket`, or Cloud Storage URLs ('gs://bucket-name/path/to/package.tar.gz')
--parameter-server-accelerator <PARAMETER_SERVER_ACCELERATOR>Hardware accelerator config for the parameter servers. Must specify both the accelerator type (TYPE) for each server and the number of accelerators to attach to each server (COUNT). + *type*::: Type of the accelerator. Choices are nvidia-tesla-a100,nvidia-tesla-k80,nvidia-tesla-p100,nvidia-tesla-p4,nvidia-tesla-t4,nvidia-tesla-v100,tpu-v2,tpu-v2-pod,tpu-v3,tpu-v3-pod + *count*::: Number of accelerators to attach to each machine running the job. Must be greater than 0. +
--parameter-server-count <PARAMETER_SERVER_COUNT>Number of parameter servers to use for the training job
--parameter-server-image-uri <PARAMETER_SERVER_IMAGE_URI>Docker image to run on each parameter server. This image must be in Container Registry. If not specified, the value of `--master-image-uri` is used
--parameter-server-machine-type <PARAMETER_SERVER_MACHINE_TYPE>Type of virtual machine to use for training job's parameter servers. This flag must be specified if any of the other arguments in this group are specified machine to use for training job's parameter servers
--project <PROJECT_ID>The Google Cloud Platform project ID to use for this invocation. If omitted, then the current project is assumed; the current project can be listed using `gcloud config list --format='text(core.project)'` and can be set using `gcloud config set project PROJECTID`. + `--project` and its fallback `core/project` property play two roles in the invocation. It specifies the project of the resource to operate on. It also specifies the project for API enablement check, quota, and billing. To specify a different project for quota and billing, use `--billing-project` or `billing/quota_project` property
--python-version <PYTHON_VERSION>Version of Python used during training. Choices are 3.7, 3.5, and 2.7. However, this value must be compatible with the chosen runtime version for the job. + Must be used with a compatible runtime version: + * 3.7 is compatible with runtime versions 1.15 and later. * 3.5 is compatible with runtime versions 1.4 through 1.14. * 2.7 is compatible with runtime versions 1.15 and earlier
--quietDisable all interactive prompts when running gcloud commands. If input is required, defaults will be used, or an error will be raised. Overrides the default core/disable_prompts property value for this command invocation. This is equivalent to setting the environment variable `CLOUDSDK_CORE_DISABLE_PROMPTS` to 1
--region <REGION>Region of the machine learning training job to submit. If not specified, you may be prompted to select a region. + To avoid prompting when this flag is omitted, you can set the ``compute/region'' property: + $ gcloud config set compute/region REGION + A list of regions can be fetched by running: + $ gcloud compute regions list + To unset the property, run: + $ gcloud config unset compute/region + Alternatively, the region can be stored in the environment variable ``CLOUDSDK_COMPUTE_REGION''
--runtime-version <RUNTIME_VERSION>AI Platform runtime version for this job. Must be specified unless --master-image-uri is specified instead. It is defined in documentation along with the list of supported versions: https://cloud.google.com/ai-platform/prediction/docs/runtime-version-list
--scale-tier <SCALE_TIER>Specify the machine types, the number of replicas for workers, and parameter servers. _SCALE_TIER_ must be one of: + *basic*::: Single worker instance. This tier is suitable for learning how to use AI Platform, and for experimenting with new models using small datasets. *basic-gpu*::: Single worker instance with a GPU. *basic-tpu*::: Single worker instance with a Cloud TPU. *custom*::: CUSTOM tier is not a set tier, but rather enables you to use your own cluster specification. When you use this tier, set values to configure your processing cluster according to these guidelines (using the `--config` flag): + * You _must_ set `TrainingInput.masterType` to specify the type of machine to use for your master node. This is the only required setting. * You _may_ set `TrainingInput.workerCount` to specify the number of workers to use. If you specify one or more workers, you _must_ also set `TrainingInput.workerType` to specify the type of machine to use for your worker nodes. * You _may_ set `TrainingInput.parameterServerCount` to specify the number of parameter servers to use. If you specify one or more parameter servers, you _must_ also set `TrainingInput.parameterServerType` to specify the type of machine to use for your parameter servers. Note that all of your workers must use the same machine type, which can be different from your parameter server type and master type. Your parameter servers must likewise use the same machine type, which can be different from your worker type and master type. *premium-1*::: Large number of workers with many parameter servers. *standard-1*::: Many workers and a few parameter servers. ::: +
--service-account <SERVICE_ACCOUNT>The email address of a service account to use when running the training application. You must have the `iam.serviceAccounts.actAs` permission for the specified service account. In addition, the AI Platform Training Google-managed service account must have the `roles/iam.serviceAccountAdmin` role for the specified service account. [Learn more about configuring a service account.](/ai-platform/training/docs/custom-service-account) If not specified, the AI Platform Training Google-managed service account is used by default
--staging-bucket <STAGING_BUCKET>Bucket in which to stage training archives. + Required only if a file upload is necessary (that is, other flags include local paths) and no other flags implicitly specify an upload path
--stream-logsBlock until job completion and stream the logs while the job runs. + Note that even if command execution is halted, the job will still run until cancelled with + $ gcloud ai-platform jobs cancel JOB_ID
--trace-token <TRACE_TOKEN>Token used to route traces of service requests for investigation of issues. Overrides the default *core/trace_token* property value for this command invocation
--use-chief-in-tf-config <USE_CHIEF_IN_TF_CONFIG>Use "chief" role in the cluster instead of "master". This is required for TensorFlow 2.0 and newer versions. Unlike "master" node, "chief" node does not run evaluation
--user-output-enabledPrint user intended output to the console. Overrides the default *core/user_output_enabled* property value for this command invocation. Use *--no-user-output-enabled* to disable
--verbosity <VERBOSITY>Override the default verbosity for this command. Overrides the default *core/verbosity* property value for this command invocation. _VERBOSITY_ must be one of: *debug*, *info*, *warning*, *error*, *critical*, *none*
--worker-accelerator <WORKER_ACCELERATOR>Hardware accelerator config for the worker nodes. Must specify both the accelerator type (TYPE) for each server and the number of accelerators to attach to each server (COUNT). + *type*::: Type of the accelerator. Choices are nvidia-tesla-a100,nvidia-tesla-k80,nvidia-tesla-p100,nvidia-tesla-p4,nvidia-tesla-t4,nvidia-tesla-v100,tpu-v2,tpu-v2-pod,tpu-v3,tpu-v3-pod + *count*::: Number of accelerators to attach to each machine running the job. Must be greater than 0. +
--worker-count <WORKER_COUNT>Number of worker nodes to use for the training job
--worker-image-uri <WORKER_IMAGE_URI>Docker image to run on each worker node. This image must be in Container Registry. If not specified, the value of `--master-image-uri` is used
--worker-machine-type <WORKER_MACHINE_TYPE>Type of virtual machine to use for training job's worker nodes