--account <ACCOUNT> | Google Cloud Platform user account to use for invocation. Overrides the default *core/account* property value for this command invocation |
--autoscaling-policy <AUTOSCALING_POLICY> | ID of the autoscaling policy or fully qualified identifier for the autoscaling policy |
--billing-project <BILLING_PROJECT> | The Google Cloud Platform project that will be charged quota for operations performed in gcloud. If you need to operate on one project, but need quota against a different project, you can use this flag to specify the billing project. If both `billing/quota_project` and `--billing-project` are specified, `--billing-project` takes precedence. Run `$ gcloud config set --help` to see more information about `billing/quota_project` |
--bucket <BUCKET> | The Google Cloud Storage bucket to use by default to stage job
dependencies, miscellaneous config files, and job driver console output
when using this cluster |
--cluster-name <CLUSTER_NAME> | The name of the managed dataproc cluster.
If unspecified, the workflow template ID will be used |
--configuration <CONFIGURATION> | The configuration to use for this command invocation. For more
information on how to use configurations, run:
`gcloud topic configurations`. You can also use the CLOUDSDK_ACTIVE_CONFIG_NAME environment
variable to set the equivalent of this flag for a terminal
session |
--enable-component-gateway | Enable access to the web UIs of selected components on the cluster
through the component gateway |
--enable-kerberos | Enable Kerberos on the cluster |
--flags-file <YAML_FILE> | A YAML or JSON file that specifies a *--flag*:*value* dictionary.
Useful for specifying complex flag values with special characters
that work with any command interpreter. Additionally, each
*--flags-file* arg is replaced by its constituent flags. See
$ gcloud topic flags-file for more information |
--flatten <KEY> | Flatten _name_[] output resource slices in _KEY_ into separate records
for each item in each slice. Multiple keys and slices may be specified.
This also flattens keys for *--format* and *--filter*. For example,
*--flatten=abc.def* flattens *abc.def[].ghi* references to
*abc.def.ghi*. A resource record containing *abc.def[]* with N elements
will expand to N records in the flattened output. This flag interacts
with other flags that are applied in this order: *--flatten*,
*--sort-by*, *--filter*, *--limit* |
--format <FORMAT> | Set the format for printing command output resources. The default is a
command-specific human-friendly output format. The supported formats
are: `config`, `csv`, `default`, `diff`, `disable`, `flattened`, `get`, `json`, `list`, `multi`, `none`, `object`, `table`, `text`, `value`, `yaml`. For more details run $ gcloud topic formats |
--help | Display detailed help |
--image <IMAGE> | The custom image used to create the cluster. It can be the image name, the image URI, or the image family URI, which selects the latest image from the family |
--image-version <VERSION> | The image version to use for the cluster. Defaults to the latest version |
--impersonate-service-account <SERVICE_ACCOUNT_EMAIL> | For this gcloud invocation, all API requests will be made as the given service account instead of the currently selected account. This is done without needing to create, download, and activate a key for the account. In order to perform operations as the service account, your currently selected account must have an IAM role that includes the iam.serviceAccounts.getAccessToken permission for the service account. The roles/iam.serviceAccountTokenCreator role has this permission or you may create a custom role. Overrides the default *auth/impersonate_service_account* property value for this command invocation |
--initialization-action-timeout <TIMEOUT> | The maximum duration of each initialization action. See $ gcloud topic datetimes for information on duration formats |
--initialization-actions <CLOUD_STORAGE_URI> | A list of Google Cloud Storage URIs of executables to run on each node in the cluster |
--kerberos-config-file <KERBEROS_CONFIG_FILE> | Path to a YAML (or JSON) file containing the configuration for Kerberos on the
cluster. If you pass `-` as the value of the flag the file content will be read
from stdin.
+
The YAML file is formatted as follows:
+
```
# Optional. Flag to indicate whether to Kerberize the cluster.
# The default value is true.
enable_kerberos: true
+
# Required. The Google Cloud Storage URI of a KMS encrypted file
# containing the root principal password.
root_principal_password_uri: gs://bucket/password.encrypted
+
# Required. The URI of the KMS key used to encrypt various
# sensitive files.
kms_key_uri:
projects/myproject/locations/global/keyRings/mykeyring/cryptoKeys/my-key
+
# Configuration of SSL encryption. If specified, all sub-fields
# are required. Otherwise, Dataproc will provide a self-signed
# certificate and generate the passwords.
ssl:
# Optional. The Google Cloud Storage URI of the keystore file.
keystore_uri: gs://bucket/keystore.jks
+
# Optional. The Google Cloud Storage URI of a KMS encrypted
# file containing the password to the keystore.
keystore_password_uri: gs://bucket/keystore_password.encrypted
+
# Optional. The Google Cloud Storage URI of a KMS encrypted
# file containing the password to the user provided key.
key_password_uri: gs://bucket/key_password.encrypted
+
# Optional. The Google Cloud Storage URI of the truststore
# file.
truststore_uri: gs://bucket/truststore.jks
+
# Optional. The Google Cloud Storage URI of a KMS encrypted
# file containing the password to the user provided
# truststore.
truststore_password_uri:
gs://bucket/truststore_password.encrypted
+
# Configuration of cross realm trust.
cross_realm_trust:
# Optional. The remote realm the Dataproc on-cluster KDC will
# trust, should the user enable cross realm trust.
realm: REMOTE.REALM
+
# Optional. The KDC (IP or hostname) for the remote trusted
# realm in a cross realm trust relationship.
kdc: kdc.remote.realm
+
# Optional. The admin server (IP or hostname) for the remote
# trusted realm in a cross realm trust relationship.
admin_server: admin-server.remote.realm
+
# Optional. The Google Cloud Storage URI of a KMS encrypted
# file containing the shared password between the on-cluster
# Kerberos realm and the remote trusted realm, in a cross
# realm trust relationship.
shared_password_uri:
gs://bucket/cross-realm.password.encrypted
+
# Optional. The Google Cloud Storage URI of a KMS encrypted file
# containing the master key of the KDC database.
kdc_db_key_uri: gs://bucket/kdc_db_key.encrypted
+
# Optional. The lifetime of the ticket granting ticket, in
# hours. If not specified, or user specifies 0, then default
# value 10 will be used.
tgt_lifetime_hours: 1
+
# Optional. The name of the Kerberos realm. If not specified,
# the uppercased domain name of the cluster will be used.
realm: REALM.NAME
``` |
--kerberos-kms-key <KERBEROS_KMS_KEY> | ID of the key or fully qualified identifier for the key |
--kerberos-kms-key-keyring <KERBEROS_KMS_KEY_KEYRING> | The KMS keyring of the key |
--kerberos-kms-key-location <KERBEROS_KMS_KEY_LOCATION> | The Cloud location for the key |
--kerberos-kms-key-project <KERBEROS_KMS_KEY_PROJECT> | The Cloud project for the key |
--kerberos-root-principal-password-uri <KERBEROS_ROOT_PRINCIPAL_PASSWORD_URI> | Google Cloud Storage URI of a KMS encrypted file containing the root
principal password. Must be a Cloud Storage URL beginning with 'gs://' |
--labels <KEY=VALUE> | List of label KEY=VALUE pairs to add.
+
Keys must start with a lowercase character and contain only hyphens (`-`), underscores (```_```), lowercase characters, and numbers. Values must contain only hyphens (`-`), underscores (```_```), lowercase characters, and numbers |
--log-http | Log all HTTP server requests and responses to stderr. Overrides the default *core/log_http* property value for this command invocation |
--master-accelerator <type=TYPE,[count=COUNT]> | Attaches accelerators (e.g. GPUs) to the master
instance(s).
+
*type*::: The specific type (e.g. nvidia-tesla-k80 for nVidia Tesla
K80) of accelerator to attach to the instances. Use 'gcloud compute
accelerator-types list' to learn about all available accelerator
types.
+
*count*::: The number of pieces of the accelerator to attach to each
of the instances. The default value is 1 |
--master-boot-disk-size <MASTER_BOOT_DISK_SIZE> | The size of the boot disk. The value must be a
whole number followed by a size unit of ``KB'' for kilobyte, ``MB''
for megabyte, ``GB'' for gigabyte, or ``TB'' for terabyte. For example,
``10GB'' will produce a 10 gigabyte disk. The minimum size a boot disk
can have is 10 GB. Disk size must be a multiple of 1 GB |
--master-boot-disk-type <MASTER_BOOT_DISK_TYPE> | The type of the boot disk. The value must be ``pd-standard'' or
``pd-ssd'' |
--master-machine-type <MASTER_MACHINE_TYPE> | The type of machine to use for the master. Defaults to server-specified |
--master-min-cpu-platform <PLATFORM> | When specified, the VM will be scheduled on host with specified CPU
architecture or a newer one. To list available CPU platforms in given
zone, run:
+
$ gcloud compute zones describe ZONE
+
CPU platform selection is available only in selected zones; zones that
allow CPU platform selection will have an `availableCpuPlatforms` field
that contains the list of available CPU platforms for that zone.
+
You can find more information online:
https://cloud.google.com/compute/docs/instances/specify-min-cpu-platform |
--metadata <KEY=VALUE> | Metadata to be made available to the guest operating system running on the instances |
--network <NETWORK> | The Compute Engine network that the VM instances of the cluster will be
part of. This is mutually exclusive with --subnet. If neither is
specified, this defaults to the "default" network |
--no-address | If provided, the instances in the cluster will not be assigned external
IP addresses.
+
If omitted the instances in the cluster will each be assigned an
ephemeral external IP address.
+
Note: Dataproc VMs need access to the Dataproc API. This can be achieved
without external IP addresses using Private Google Access
(https://cloud.google.com/compute/docs/private-google-access) |
--node-group <NODE_GROUP> | The name of the sole-tenant node group to create the cluster on. Can be
a short name ("node-group-name") or in the format
"projects/{project-id}/zones/{zone}/nodeGroups/{node-group-name}" |
--num-master-local-ssds <NUM_MASTER_LOCAL_SSDS> | The number of local SSDs to attach to the master in a cluster |
--num-masters <NUM_MASTERS> | The number of master nodes in the cluster.
+
Number of Masters | Cluster Mode
--- | ---
1 | Standard
3 | High Availability |
--num-secondary-worker-local-ssds <NUM_SECONDARY_WORKER_LOCAL_SSDS> | The number of local SSDs to attach to each preemptible worker in
a cluster |
--num-secondary-workers <NUM_SECONDARY_WORKERS> | The number of secondary worker nodes in the cluster |
--num-worker-local-ssds <NUM_WORKER_LOCAL_SSDS> | The number of local SSDs to attach to each worker in a cluster |
--num-workers <NUM_WORKERS> | The number of worker nodes in the cluster. Defaults to server-specified |
--optional-components <COMPONENT> | List of optional components to be installed on cluster machines.
+
The following page documents the optional components that can be
installed:
https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/optional-components |
--private-ipv6-google-access-type <PRIVATE_IPV6_GOOGLE_ACCESS_TYPE> | The private IPv6 Google access type for the cluster. _PRIVATE_IPV6_GOOGLE_ACCESS_TYPE_ must be one of: *inherit-subnetwork*, *outbound*, *bidirectional* |
--project <PROJECT_ID> | The Google Cloud Platform project ID to use for this invocation. If
omitted, then the current project is assumed; the current project can
be listed using `gcloud config list --format='text(core.project)'`
and can be set using `gcloud config set project PROJECTID`.
+
`--project` and its fallback `core/project` property play two roles
in the invocation. It specifies the project of the resource to
operate on. It also specifies the project for API enablement check,
quota, and billing. To specify a different project for quota and
billing, use `--billing-project` or `billing/quota_project` property |
--properties <PREFIX:PROPERTY=VALUE> | Specifies configuration properties for installed packages, such as Hadoop
and Spark.
+
Properties are mapped to configuration files by specifying a prefix, such as
"core:io.serializations". The following are supported prefixes and their
mappings:
+
Prefix | File | Purpose of file
--- | --- | ---
capacity-scheduler | capacity-scheduler.xml | Hadoop YARN Capacity Scheduler configuration
core | core-site.xml | Hadoop general configuration
distcp | distcp-default.xml | Hadoop Distributed Copy configuration
hadoop-env | hadoop-env.sh | Hadoop specific environment variables
hdfs | hdfs-site.xml | Hadoop HDFS configuration
hive | hive-site.xml | Hive configuration
mapred | mapred-site.xml | Hadoop MapReduce configuration
mapred-env | mapred-env.sh | Hadoop MapReduce specific environment variables
pig | pig.properties | Pig configuration
spark | spark-defaults.conf | Spark configuration
spark-env | spark-env.sh | Spark specific environment variables
yarn | yarn-site.xml | Hadoop YARN configuration
yarn-env | yarn-env.sh | Hadoop YARN specific environment variables
+
See https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/cluster-properties
for more information.
+ |
--quiet | Disable all interactive prompts when running gcloud commands. If input
is required, defaults will be used, or an error will be raised.
Overrides the default core/disable_prompts property value for this
command invocation. This is equivalent to setting the environment
variable `CLOUDSDK_CORE_DISABLE_PROMPTS` to 1 |
--region <REGION> | Dataproc region for the template. Each Dataproc region constitutes an independent resource namespace constrained to deploying instances into Compute Engine zones inside the region. Overrides the default `dataproc/region` property value for this command invocation |
--reservation <RESERVATION> | The name of the reservation, required when `--reservation-affinity=specific` |
--reservation-affinity <RESERVATION_AFFINITY> | The type of reservation for the instance. _RESERVATION_AFFINITY_ must be one of: *any*, *none*, *specific* |
--scopes <SCOPE> | Specifies scopes for the node instances. Multiple SCOPEs can be specified,
separated by commas.
Examples:
+
$ {command} example-cluster --scopes https://www.googleapis.com/auth/bigtable.admin
+
$ {command} example-cluster --scopes sqlservice,bigquery
+
The following *minimum scopes* are necessary for the cluster to function
properly and are always added, even if not explicitly specified:
+
https://www.googleapis.com/auth/devstorage.read_write
https://www.googleapis.com/auth/logging.write
+
If the `--scopes` flag is not specified, the following *default scopes*
are also included:
+
https://www.googleapis.com/auth/bigquery
https://www.googleapis.com/auth/bigtable.admin.table
https://www.googleapis.com/auth/bigtable.data
https://www.googleapis.com/auth/devstorage.full_control
+
If you want to enable all scopes use the 'cloud-platform' scope.
+
SCOPE can be either the full URI of the scope or an alias. *default* scopes are
assigned to all instances. Available aliases are:
+
Alias | URI
--- | ---
bigquery | https://www.googleapis.com/auth/bigquery
cloud-platform | https://www.googleapis.com/auth/cloud-platform
cloud-source-repos | https://www.googleapis.com/auth/source.full_control
cloud-source-repos-ro | https://www.googleapis.com/auth/source.read_only
compute-ro | https://www.googleapis.com/auth/compute.readonly
compute-rw | https://www.googleapis.com/auth/compute
datastore | https://www.googleapis.com/auth/datastore
default | https://www.googleapis.com/auth/devstorage.read_only
| https://www.googleapis.com/auth/logging.write
| https://www.googleapis.com/auth/monitoring.write
| https://www.googleapis.com/auth/pubsub
| https://www.googleapis.com/auth/service.management.readonly
| https://www.googleapis.com/auth/servicecontrol
| https://www.googleapis.com/auth/trace.append
gke-default | https://www.googleapis.com/auth/devstorage.read_only
| https://www.googleapis.com/auth/logging.write
| https://www.googleapis.com/auth/monitoring
| https://www.googleapis.com/auth/service.management.readonly
| https://www.googleapis.com/auth/servicecontrol
| https://www.googleapis.com/auth/trace.append
logging-write | https://www.googleapis.com/auth/logging.write
monitoring | https://www.googleapis.com/auth/monitoring
monitoring-read | https://www.googleapis.com/auth/monitoring.read
monitoring-write | https://www.googleapis.com/auth/monitoring.write
pubsub | https://www.googleapis.com/auth/pubsub
service-control | https://www.googleapis.com/auth/servicecontrol
service-management | https://www.googleapis.com/auth/service.management.readonly
sql (deprecated) | https://www.googleapis.com/auth/sqlservice
sql-admin | https://www.googleapis.com/auth/sqlservice.admin
storage-full | https://www.googleapis.com/auth/devstorage.full_control
storage-ro | https://www.googleapis.com/auth/devstorage.read_only
storage-rw | https://www.googleapis.com/auth/devstorage.read_write
taskqueue | https://www.googleapis.com/auth/taskqueue
trace | https://www.googleapis.com/auth/trace.append
userinfo-email | https://www.googleapis.com/auth/userinfo.email
+
DEPRECATION WARNING: https://www.googleapis.com/auth/sqlservice account scope
and `sql` alias do not provide SQL instance management capabilities and have
been deprecated. Please, use https://www.googleapis.com/auth/sqlservice.admin
or `sql-admin` to manage your Google SQL Service instances.
+ |
--secondary-worker-accelerator <type=TYPE,[count=COUNT]> | Attaches accelerators (e.g. GPUs) to the secondary-worker
instance(s).
+
*type*::: The specific type (e.g. nvidia-tesla-k80 for nVidia Tesla
K80) of accelerator to attach to the instances. Use 'gcloud compute
accelerator-types list' to learn about all available accelerator
types.
+
*count*::: The number of pieces of the accelerator to attach to each
of the instances. The default value is 1 |
--secondary-worker-boot-disk-size <SECONDARY_WORKER_BOOT_DISK_SIZE> | The size of the boot disk. The value must be a
whole number followed by a size unit of ``KB'' for kilobyte, ``MB''
for megabyte, ``GB'' for gigabyte, or ``TB'' for terabyte. For example,
``10GB'' will produce a 10 gigabyte disk. The minimum size a boot disk
can have is 10 GB. Disk size must be a multiple of 1 GB |
--secondary-worker-boot-disk-type <SECONDARY_WORKER_BOOT_DISK_TYPE> | The type of the boot disk. The value must be ``pd-standard'' or
``pd-ssd'' |
--secondary-worker-type <TYPE> | The type of the secondary worker group. _TYPE_ must be one of: *preemptible*, *non-preemptible* |
--service-account <SERVICE_ACCOUNT> | The Google Cloud IAM service account to be authenticated as |
--single-node | Create a single node cluster.
+
A single node cluster has all master and worker components.
It cannot have any separate worker nodes. If this flag is not
specified, a cluster with separate workers is created |
--subnet <SUBNET> | Specifies the subnet that the cluster will be part of. This is mutally
exclusive with --network |
--tags <TAG> | Specifies a list of tags to apply to the instance. These tags allow
network firewall rules and routes to be applied to specified VM instances.
See gcloud_compute_firewall-rules_create(1) for more details.
+
To read more about configuring network tags, read this guide:
https://cloud.google.com/vpc/docs/add-remove-network-tags
+
To list instances with their respective status and tags, run:
+
$ gcloud compute instances list --format='table(name,status,tags.list())'
+
To list instances tagged with a specific tag, `tag1`, run:
+
$ gcloud compute instances list --filter='tags:tag1' |
--temp-bucket <TEMP_BUCKET> | The Google Cloud Storage bucket to use by default to to store
ephemeral cluster and jobs data, such as Spark and MapReduce history files |
--trace-token <TRACE_TOKEN> | Token used to route traces of service requests for investigation of issues. Overrides the default *core/trace_token* property value for this command invocation |
--user-output-enabled | Print user intended output to the console. Overrides the default *core/user_output_enabled* property value for this command invocation. Use *--no-user-output-enabled* to disable |
--verbosity <VERBOSITY> | Override the default verbosity for this command. Overrides the default *core/verbosity* property value for this command invocation. _VERBOSITY_ must be one of: *debug*, *info*, *warning*, *error*, *critical*, *none* |
--worker-accelerator <type=TYPE,[count=COUNT]> | Attaches accelerators (e.g. GPUs) to the worker
instance(s).
+
*type*::: The specific type (e.g. nvidia-tesla-k80 for nVidia Tesla
K80) of accelerator to attach to the instances. Use 'gcloud compute
accelerator-types list' to learn about all available accelerator
types.
+
*count*::: The number of pieces of the accelerator to attach to each
of the instances. The default value is 1 |
--worker-boot-disk-size <WORKER_BOOT_DISK_SIZE> | The size of the boot disk. The value must be a
whole number followed by a size unit of ``KB'' for kilobyte, ``MB''
for megabyte, ``GB'' for gigabyte, or ``TB'' for terabyte. For example,
``10GB'' will produce a 10 gigabyte disk. The minimum size a boot disk
can have is 10 GB. Disk size must be a multiple of 1 GB |
--worker-boot-disk-type <WORKER_BOOT_DISK_TYPE> | The type of the boot disk. The value must be ``pd-standard'' or
``pd-ssd'' |
--worker-machine-type <WORKER_MACHINE_TYPE> | The type of machine to use for workers. Defaults to server-specified |
--worker-min-cpu-platform <PLATFORM> | When specified, the VM will be scheduled on host with specified CPU
architecture or a newer one. To list available CPU platforms in given
zone, run:
+
$ gcloud compute zones describe ZONE
+
CPU platform selection is available only in selected zones; zones that
allow CPU platform selection will have an `availableCpuPlatforms` field
that contains the list of available CPU platforms for that zone.
+
You can find more information online:
https://cloud.google.com/compute/docs/instances/specify-min-cpu-platform |
--zone <ZONE> | The compute zone (e.g. us-central1-a) for the cluster. If empty
and --region is set to a value other than `global`, the server will
pick a zone in the region. Overrides the default *compute/zone* property value for this command invocation |