Metrics

On this page

This document is for an older version of Crossplane.

This document applies to Crossplane v2.1 and not to the latest release v2.2.

Crossplane produces Prometheus style metrics for effective monitoring and alerting in your environment. These metrics are essential for helping to identify and resolve potential issues. This page offers explanations of all these metrics gathered from Crossplane. Understanding these metrics helps you maintain the health and performance of your resources. Please note that this document focuses on Crossplane specific metrics and doesn’t cover standard Go metrics.

To enable the export of metrics it’s necessary to configure the --set metrics.enabled=true option in the helm chart.

1metrics:
2  enabled: true

These Prometheus annotations expose the metrics:

1prometheus.io/path: /metrics
2prometheus.io/port: "8080"
3prometheus.io/scrape: "true"

Crossplane core metrics

The Crossplane pod emits these metrics.

Metric Name	Description
`function_run_function_request_total`	Total number of RunFunctionRequests sent
`function_run_function_response_total`	Total number of RunFunctionResponses received
`function_run_function_seconds`	Histogram of RunFunctionResponse latency (seconds)
`function_run_function_response_cache_hits_total`	Total number of RunFunctionResponse cache hits
`function_run_function_response_cache_misses_total`	Total number of RunFunctionResponse cache misses
`function_run_function_response_cache_errors_total`	Total number of RunFunctionResponse cache errors
`function_run_function_response_cache_writes_total`	Total number of RunFunctionResponse cache writes
`function_run_function_response_cache_deletes_total`	Total number of RunFunctionResponse cache deletes
`function_run_function_response_cache_bytes_written_total`	Total number of RunFunctionResponse bytes written to cache
`function_run_function_response_cache_bytes_deleted_total`	Total number of RunFunctionResponse bytes deleted from cache
`function_run_function_response_cache_read_seconds`	Histogram of cache read latency (seconds)
`function_run_function_response_cache_write_seconds`	Histogram of cache write latency (seconds)
`engine_controllers_started_total`	Total number of controllers started
`engine_controllers_stopped_total`	Total number of controllers stopped
`engine_watches_started_total`	Total number of watches started
`engine_watches_stopped_total`	Total number of watches stopped

Circuit breaker metrics

The circuit breaker prevents reconciliation thrashing by monitoring and rate-limiting watch events per Composite Resource (XR). Crossplane core emits these metrics to help you identify and respond to excessive reconciliation activity.

Metric Name	Description
`circuit_breaker_opens_total`	Number of times the XR circuit breaker transitioned from closed to open
`circuit_breaker_closes_total`	Number of times the XR circuit breaker transitioned from open to closed
`circuit_breaker_events_total`	Number of XR watch events handled by the circuit breaker, labeled by outcome

All circuit breaker metrics include a controller label formatted as composite/<plural>.<group> (for example, composite/xpostgresqlinstances.example.com), providing visibility per XRD without creating high cardinality from individual XR instances.

circuit_breaker_opens_total

Tracks when a circuit breaker transitions from closed to open state. An increase indicates an XR is receiving excessive watch events and has triggered throttling.

Use this metric to:

Alert on XRs experiencing reconciliation thrashing
Identify which XRD types are prone to excessive watch events
Track the frequency of circuit breaker activations

Example PromQL queries:

1# Rate of circuit breaker opens over 5 minutes
2rate(circuit_breaker_opens_total[5m])
3
4# Count of circuit breaker opens by controller
5sum by (controller) (circuit_breaker_opens_total)

circuit_breaker_closes_total

Tracks when a circuit breaker transitions from open to closed state. This indicates an XR has recovered from excessive watch events and returned to normal operation.

Use this metric to:

Monitor recovery from reconciliation thrashing

Verify circuit breakers are closing after cooldown periods

Track circuit breaker lifecycle

circuit_breaker_events_total

Tracks all watch events processed by the circuit breaker, labeled by result:

Allowed: Normal operation when circuit is closed - events proceed to reconciliation

Dropped: Events blocked when circuit is fully open - indicates active throttling

Halfopen_allowed: Limited probe events when circuit is half-open - circuit is testing for recovery

Use this metric to:

Track the volume of watch events per XR type
Detect when the circuit drops events (active throttling)
Alert on high dropped event rates indicating potential issues
Understand reconciliation pressure on specific controllers

Example PromQL queries:

 1# Rate of dropped events (active throttling), aggregated per controller
 2sum by (controller) (
 3  rate(circuit_breaker_events_total{result="Dropped"}[5m])
 4)
 5
 6# Percentage of events being dropped
 7sum by (controller) (rate(circuit_breaker_events_total{result="Dropped"}[5m]))
 8/
 9sum by (controller) (rate(circuit_breaker_events_total[5m])) * 100
10
11# Number of replicas per controller currently dropping events
12count by (controller) (
13  rate(circuit_breaker_events_total{result="Dropped"}[5m]) > 0
14)
15
16# Estimated number of circuit breaker opens over 5 minutes
17sum by (controller) (
18  increase(circuit_breaker_opens_total[5m])
19)
20
21# Alert condition: controllers under high watch pressure (severe overload)
22sum by (controller) (
23  rate(circuit_breaker_events_total{result="Dropped"}[5m])
24) > 1

Recommended alerts:

 1# Alert when circuit breaker is consistently dropping events
 2- alert: CircuitBreakerDropRatioHigh
 3  expr: |
 4    (
 5      sum by (controller)(rate(circuit_breaker_events_total{result="Dropped"}[5m]))
 6      /
 7      sum by (controller)(rate(circuit_breaker_events_total[5m]))
 8    ) > 0.2
 9  for: 5m
10  labels:
11    severity: critical
12  annotations:
13    summary: "High circuit breaker drop ratio for {{ $labels.controller }}"
14    description: "More than 20% of events are being dropped by the circuit breaker for {{ $labels.controller }}, indicating sustained overload."
15
16# Alert when circuit breaker opens frequently
17- alert: CircuitBreakerFrequentOpens
18  expr: |
19    sum by (controller) (
20      rate(circuit_breaker_opens_total[5m])
21    ) * 3600 > 6
22  for: 15m
23  labels:
24    severity: warning
25  annotations:
26    summary: "Frequent circuit breaker opens for {{ $labels.controller }}"
27    description: "Circuit breaker for {{ $labels.controller }} is opening more than 6 times per hour, indicating reconciliation thrashing."

For more information on the circuit breaker feature and configuration, see Troubleshooting - Circuit breaker.

Provider metrics

Crossplane providers emit these metrics. All providers built with crossplane-runtime emit the crossplane_managed_resource_* metrics.

Providers expose metrics on the metrics port (default 8080). To scrape these metrics, configure a PodMonitor or add Prometheus annotations to the provider’s DeploymentRuntimeConfig.

Metric Name	Description
`crossplane_managed_resource_exists`	The number of managed resources that exist
`crossplane_managed_resource_ready`	The number of managed resources in `Ready=True` state
`crossplane_managed_resource_synced`	The number of managed resources in `Synced=True` state
`crossplane_managed_resource_deletion_seconds`	The time it took to delete a managed resource
`crossplane_managed_resource_first_time_to_readiness_seconds`	The time it took for a managed resource to become ready first time after creation
`crossplane_managed_resource_first_time_to_reconcile_seconds`	The time it took to detect a managed resource by the controller
`crossplane_managed_resource_drift_seconds`	Time elapsed after the last successful reconcile when detecting an out-of-sync resource

Upjet provider metrics

These metrics are only emitted by Upjet-based providers (such as provider-upjet-aws, provider-upjet-azure, provider-upjet-gcp).

Metric Name	Description
`upjet_resource_ext_api_duration`	Measures in seconds how long it takes a Cloud SDK call to complete
`upjet_resource_external_api_calls_total`	The number of external API calls to cloud providers, with labels describing the endpoints and resources
`upjet_resource_reconcile_delay_seconds`	Measures in seconds how long the reconciles for a resource delay from the configured poll periods
`upjet_resource_ttr`	Measures in seconds the time-to-readiness (TTR) for managed resources
`upjet_resource_cli_duration`	Measures in seconds how long it takes a Terraform CLI invocation to complete
`upjet_resource_active_cli_invocations`	The number of active (running) Terraform CLI invocations
`upjet_resource_running_processes`	The number of running Terraform CLI and Terraform provider processes

Controller-runtime and Kubernetes client metrics

These metrics come from the controller-runtime framework and Kubernetes client libraries. Both Crossplane and providers emit these metrics.

Metric Name	Description
`certwatcher_read_certificate_errors_total`	Total number of certificate read errors
`certwatcher_read_certificate_total`	Total number of certificate reads
`controller_runtime_active_workers`	Number of workers (threads processing jobs from the work queue) per controller
`controller_runtime_max_concurrent_reconciles`	Maximum number of concurrent reconciles per controller
`controller_runtime_reconcile_errors_total`	Total number of reconciliation errors per controller. Sharp or continuous rising of this metric indicates a problem.
`controller_runtime_reconcile_time_seconds`	Histogram of time per reconciliation per controller
`controller_runtime_reconcile_total`	Total number of reconciliations per controller
`controller_runtime_webhook_latency_seconds`	Histogram of the latency of processing admission requests
`controller_runtime_webhook_requests_in_flight`	Current number of admission requests served
`controller_runtime_webhook_requests_total`	Total number of admission requests by HTTP status code
`rest_client_requests_total`	Number of HTTP requests, partitioned by status code, method, and host
`workqueue_adds_total`	Total number of adds handled by `workqueue`
`workqueue_depth`	Current depth of `workqueue`
`workqueue_longest_running_processor_seconds`	How long the longest running processor for `workqueue` has been running
`workqueue_queue_duration_seconds`	Histogram of time an item stays in `workqueue` before processing starts
`workqueue_retries_total`	Total number of retries handled by `workqueue`
`workqueue_unfinished_work_seconds`	Seconds of work in progress not yet observed by `work_duration`. Large values suggest stuck threads.
`workqueue_work_duration_seconds`	Histogram of time to process an item from `workqueue` (from start to completion)