Prometheus Metric Types: A Quick Reference

This guide explains the four core Prometheus metric types, including common "patterns" that might seem counter-intuitive at first glance.

1. Counter

Definition: A cumulative metric that represents a single monotonically increasing counter. Its value can only increase or be reset to zero on restart.

Best for: "How many times has X happened?"
Examples: Total HTTP requests, total errors, bytes received.
Key Function: rate() (calculates the per-second rate of increase).

2. Gauge

Definition: A metric that represents a single numerical value that can arbitrarily go up and down.

Best for: "What is the current state/level right now?"
Examples: Memory usage, temperature, number of concurrent requests.
Special Patterns:
- Metadata (build_info): Setting a Gauge to 1 with labels like version or commit to export process info.
- Timestamps (process_start_time_seconds): Storing a Unix timestamp to calculate uptime.

3. Histogram

Definition: Samples observations (usually durations or sizes) and counts them in configurable "buckets."

Best for: "What is the distribution of my data?" (e.g., Latency).
Why use it: Averages hide outliers. Histograms allow you to calculate percentiles (P95, P99).

The Core Purpose: Distribution

If you have 100 web requests:

95 of them take 10ms (Lightning fast).
5 of them take 5,000ms (The app feels broken for these users).

If you look at the Average (Gauge), it says your latency is 259ms. That looks "okay," but it's a lie. It hides the fact that 5% of your users are having a terrible time.

A Histogram breaks these 100 requests into "buckets" (e.g., <100ms, <500ms, <5s). This allows you to see the outliers that an average or a single gauge value would hide.

How it works in Prometheus

When you define a Histogram, you define Buckets. Every time an event happens (like a function call finishing), you "observe" the duration. Prometheus then increments the counter for every bucket that the duration fits into.

Why is it useful? (The "Magic" of Percentiles)

The real power of a Histogram isn't just seeing the buckets; it's using the histogram_quantile function in PromQL. This allows you to ask questions like:

"What is the P95 latency?" (The maximum time 95% of my users waited).
"Is my latest deployment making the slow requests even slower?"

4. Summary

Definition: Similar to a Histogram, a Summary samples observations. While it also provides a total count and a sum of all observations, it calculates configurable quartiles over a sliding time window on the client side.

Best for: When you need accurate percentiles but cannot perform the calculation on the Prometheus server.
Downside: You cannot aggregate Summaries from multiple instances (Histograms are usually preferred for distributed systems).

Metric Comparison Matrix

Metric Type	Value Behavior	Real-world Analogy	Primary Use Case
Counter	Only increases	A car's Odometer	Total events over time
Gauge	Up and down	A car's Speedometer	Current snapshots/levels
Histogram	Cumulative buckets	Race finish times (sub 10m, sub 15m)	Latency & SLA monitoring
Summary	Sliding quantiles	Performance reviews	Client-side percentiles

Visualizing Latency with Histograms

Histograms allow you to see "Heatmaps," which show you how your application's performance changes over time across all users, rather than just a single average line.

Integralist/Prometheus Metrics.md

Select an option

No results found