This guide explains the four core Prometheus metric types, including common "patterns" that might seem counter-intuitive at first glance.
Definition: A cumulative metric that represents a single monotonically increasing counter. Its value can only increase or be reset to zero on restart.
- Best for: "How many times has X happened?"
- Examples: Total HTTP requests, total errors, bytes received.
- Key Function:
rate()(calculates the per-second rate of increase).
Definition: A metric that represents a single numerical value that can arbitrarily go up and down.
- Best for: "What is the current state/level right now?"
- Examples: Memory usage, temperature, number of concurrent requests.
- Special Patterns:
- Metadata (
build_info): Setting a Gauge to1with labels likeversionorcommitto export process info. - Timestamps (
process_start_time_seconds): Storing a Unix timestamp to calculate uptime.
- Metadata (
Definition: Samples observations (usually durations or sizes) and counts them in configurable "buckets."
- Best for: "What is the distribution of my data?" (e.g., Latency).
- Why use it: Averages hide outliers. Histograms allow you to calculate percentiles (P95, P99).
If you have 100 web requests:
- 95 of them take 10ms (Lightning fast).
- 5 of them take 5,000ms (The app feels broken for these users).
If you look at the Average (Gauge), it says your latency is 259ms. That looks "okay," but it's a lie. It hides the fact that 5% of your users are having a terrible time.
A Histogram breaks these 100 requests into "buckets" (e.g., <100ms, <500ms, <5s). This allows you to see the outliers that an average or a single gauge value would hide.
When you define a Histogram, you define Buckets. Every time an event happens (like a function call finishing), you "observe" the duration. Prometheus then increments the counter for every bucket that the duration fits into.
The real power of a Histogram isn't just seeing the buckets; it's using the histogram_quantile function in PromQL. This allows you to ask questions like:
- "What is the P95 latency?" (The maximum time 95% of my users waited).
- "Is my latest deployment making the slow requests even slower?"
Definition: Similar to a Histogram, a Summary samples observations. While it also provides a total count and a sum of all observations, it calculates configurable quartiles over a sliding time window on the client side.
- Best for: When you need accurate percentiles but cannot perform the calculation on the Prometheus server.
- Downside: You cannot aggregate Summaries from multiple instances (Histograms are usually preferred for distributed systems).
| Metric Type | Value Behavior | Real-world Analogy | Primary Use Case |
|---|---|---|---|
| Counter | Only increases | A car's Odometer | Total events over time |
| Gauge | Up and down | A car's Speedometer | Current snapshots/levels |
| Histogram | Cumulative buckets | Race finish times (sub 10m, sub 15m) | Latency & SLA monitoring |
| Summary | Sliding quantiles | Performance reviews | Client-side percentiles |
Histograms allow you to see "Heatmaps," which show you how your application's performance changes over time across all users, rather than just a single average line.