Back to blog Tutorials

Node.js observability: structured logs, health checks, and metrics in production

How to set up JSON logs, HTTP health checks, and Prometheus metrics in Node.js applications so you actually know what happens in production.

9 min read

By Guara Cloud Editorial

Tested with Node.js 20 / Docker / pino 9.x / prom-client 15.x

You have opened production logs trying to figure out why the API returned 500, only to find a console.log("error here") with no timestamp, no context, and no stack trace. This happens to teams of every size. Observability is not about pretty dashboards. It is about answering “what happened?” without having to reproduce the bug locally.

This tutorial covers the three pieces that actually make a difference day to day: structured logs with pino, honest health checks, and Prometheus metrics for alerting. No paid APM. Just what a regular Node.js application needs so you can sleep at night.

Quick answer

To get observability in a production Node.js application, add three things: JSON-formatted logs with pino (searchable on Guara Cloud), a /health endpoint that checks the database and dependencies (not just returning 200), and Prometheus metrics with prom-client to track latency, throughput, and errors. Guara Cloud collects stdout logs automatically and can scrape metrics via the metrics endpoint.

Key takeaways

  • Replace console.log with pino. JSON logs let you filter by request ID, status code, and duration directly on the platform.
  • Your health check needs to test real dependencies. A /health that only returns {"status":"ok"} will not detect a database outage.
  • Latency histograms are more useful than plain counters. The p99 percentile shows where users actually suffer.
  • Add a requestId to every request log. Without correlation across logs, debugging a specific error becomes archaeology.

When this applies

Any Node.js API in production with real users. It does not matter if it is Express, Fastify, NestJS, or Hono. If the application receives traffic and you need to respond to incidents, these three components (logs, health checks, metrics) are the bare minimum.

When this does not apply

If the project is a batch script that runs and exits, the metrics and health check parts lose their purpose. For microservice architectures with more than 10 services and thousands of requests per second, you will probably want distributed tracing with OpenTelemetry, which goes beyond what this tutorial covers.

Before you start

  • Node.js 20 installed locally
  • An existing Node.js API (Express, Fastify, or similar)
  • Docker working
  • A Guara Cloud account for deployment

1. Structured logging with pino

console.log works in development. In production, it generates unstructured text that nobody can filter. Pino fixes this with structured JSON and high performance (it is one of the fastest loggers in the Node ecosystem).

Install pino and the HTTP middleware:

npm install pino pino-http

Set up the logger in src/logger.js:

import pino from 'pino';

export const logger = pino({
  level: process.env.LOG_LEVEL || 'info',
  base: { service: 'my-api' },
  timestamp: pino.stdTimeFunctions.isoTime,
});

Now integrate with Express in src/app.js:

import express from 'express';
import pinoHttp from 'pino-http';
import { logger } from './logger.js';
import { randomUUID } from 'crypto';

const app = express();

app.use(pinoHttp({
  logger,
  genReqId: () => randomUUID(),
  customProps: () => ({ env: process.env.NODE_ENV }),
  customSuccessMessage: (req, res) => `${req.method} ${req.url} ${res.statusCode}`,
  customErrorMessage: (req, res, err) => `${req.method} ${req.url} ${res.statusCode} ${err.message}`,
}));

// Every log now includes the request ID automatically
app.get('/orders', async (req, res) => {
  req.log.info({ filters: req.query }, 'Fetching orders');
  // ...
});

Each log line now has: ISO timestamp, level, requestId, method, URL, status code, and duration. In the Guara Cloud logs, you search for requestId:"abc-123" and see the full journey of that request.

2. Health checks that actually test something

Most health checks I see in production do this:

app.get('/health', (req, res) => {
  res.json({ status: 'ok' });
});

That tells you the process is running. It does not tell you if the database is reachable, if the external API is responding, or if memory is not blowing up. An honest health check tests the dependencies:

import { Pool } from 'pg';

const pool = new Pool({ connectionString: process.env.DATABASE_URL });

app.get('/health', async (req, res) => {
  const checks = {};
  let healthy = true;

  // Check database
  try {
    const start = Date.now();
    await pool.query('SELECT 1');
    checks.database = { status: 'up', latency: Date.now() - start };
  } catch (err) {
    checks.database = { status: 'down', error: err.message };
    healthy = false;
  }

  // Check external service (if applicable)
  try {
    const resp = await fetch(process.env.PAYMENT_API_URL + '/ping', {
      signal: AbortSignal.timeout(3000),
    });
    checks.paymentApi = { status: resp.ok ? 'up' : 'degraded' };
  } catch {
    checks.paymentApi = { status: 'down' };
    healthy = false;
  }

  checks.uptime = process.uptime();
  checks.memory = process.memoryUsage.rss();

  res.status(healthy ? 200 : 503).json(checks);
});

Guara Cloud uses this endpoint to decide if the container is healthy. If it returns 503, the platform restarts the service and only sends traffic when it comes back. That is much better than users getting errors and you finding out hours later.

One important detail: the health check runs every 30 seconds by default. Heavy queries in it will overload the database. Keep the checks light (SELECT 1, not SELECT with JOINs).

Environment variables for observability

Name Value
LOG_LEVEL info (use debug only in staging)
NODE_ENV production
DATABASE_URL postgres://user:***@host/db
METRICS_ENABLED true

3. Prometheus metrics with prom-client

Metrics serve two purposes: understanding usage patterns (traffic spikes at 2pm, latency gets worse after deploys) and setting up automatic alerts. prom-client is the standard library for exposing metrics in Prometheus format.

npm install prom-client

Set up in src/metrics.js:

import client from 'prom-client';

// Collect default Node.js metrics (CPU, memory, event loop lag, GC)
client.collectDefaultMetrics({ prefix: 'nodejs_' });

export const httpRequestsTotal = new client.Counter({
  name: 'http_requests_total',
  help: 'Total HTTP requests by method, route, and status',
  labelNames: ['method', 'route', 'status'],
});

export const httpRequestDuration = new client.Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration in seconds',
  labelNames: ['method', 'route', 'status'],
  buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
});

export const dbQueryDuration = new client.Histogram({
  name: 'db_query_duration_seconds',
  help: 'Database query duration in seconds',
  labelNames: ['operation', 'table'],
  buckets: [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1],
});

Register metrics on every request (Express middleware):

import { httpRequestsTotal, httpRequestDuration } from './metrics.js';

app.use((req, res, next) => {
  const end = httpRequestDuration.startTimer({
    method: req.method,
    route: req.route?.path || req.path,
  });

  res.on('finish', () => {
    const labels = {
      method: req.method,
      route: req.route?.path || req.path,
      status: res.statusCode,
    };
    httpRequestsTotal.inc(labels);
    end(labels);
  });

  next();
});

Expose at the /metrics endpoint:

import client from 'prom-client';

app.get('/metrics', async (req, res) => {
  res.set('Content-Type', client.register.contentType);
  res.end(await client.register.metrics());
});

With this, you get metrics for:

  • http_request_duration_seconds_bucket: latency percentiles per route
  • http_requests_total: throughput by status (200, 404, 500)
  • nodejs_eventloop_lag_seconds: whether the event loop is blocked
  • nodejs_heap_size_used_bytes: memory usage

A useful alert based on these metrics: “fire if p99 latency exceeds 2 seconds for more than 5 minutes.” That catches degradation before users complain.

4. Protect internal endpoints

Metrics and health checks expose information about the system. In environments with sensitive data, /metrics can reveal table names, route patterns, and traffic volume. Restrict access:

// Only respond if the secret header is present
app.use(['/metrics', '/health'], (req, res, next) => {
  const token = req.headers['x-internal-token'];
  if (token !== process.env.INTERNAL_TOKEN && req.ip !== '127.0.0.1') {
    return res.status(403).end();
  }
  next();
});

On Guara Cloud, metric scraping comes from the platform’s internal network. The INTERNAL_TOKEN ensures only authorized infrastructure accesses these endpoints.

5. Dockerfile and deployment

The Dockerfile does not need anything special for observability. The important thing is making sure logs go to stdout (not to a file) and the endpoints are reachable:

FROM node:20-alpine

WORKDIR /app

COPY package.json package-lock.json ./
RUN npm ci --omit=dev

COPY src/ ./src/

EXPOSE 3000

HEALTHCHECK --interval=30s --timeout=3s --retries=3 \
  CMD wget -qO- http://localhost:3000/health || exit 1

CMD ["node", "src/server.js"]

Deploying to Guara Cloud

  1. Push the Dockerfile and code to your Git repository
  2. Create a new service on Guara Cloud pointing to the repository
  3. Set environment variables (LOG_LEVEL, DATABASE_URL, INTERNAL_TOKEN)
  4. Start the deploy and check the logs to confirm entries appear in JSON format
  5. Hit /health through the public URL to confirm dependency status is returned
Deploy via CLI
guara deploy --name my-api --env LOG_LEVEL=info --env METRICS_ENABLED=true

After the deploy, structured logs show up automatically in the Guara Cloud panel. You can filter by level:error, requestId, route, or duration.

What to do with all of this

Having logs, health checks, and metrics working is the starting point. The next step is creating alerts. Some examples I use and recommend:

High error rate: if more than 5% of requests return 5xx in the last 5 minutes, notify. Uses rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]).

Latency increasing: if p99 exceeds 1s for more than 10 minutes. Uses histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[10m])).

Memory leak: if nodejs_heap_size_used_bytes grows consistently for 2 hours without releasing.

On Guara Cloud, logs are already collected automatically. You define filters and alerts through the interface, without setting up a separate ELK stack.

Common issues

Problem Logs appear as plain text, not JSON
Solution Verify you are using pino correctly. If you mix console.log with pino, console.log entries are not formatted. Replace every console call with the logger.
Problem Health check returns 503 even though everything works
Solution The health check timeout (3s) might be too short if the database is far away. Increase it to 5s or use a liveness endpoint that only checks the process, not the dependencies.
Problem /metrics returns error 500
Solution Usually a duplicate label in the metric registry. Make sure each combination of metric name and labels is unique.
Problem High cardinality in metrics (too many different labels)
Solution Never use user values as labels (userId, email). Only use finite values: method, route, status. Routes with parameters (/:id) must be normalized.
Problem Logs are too verbose and expensive to store
Solution Use LOG_LEVEL=info in production and configure sampling for high-volume routes (health checks and status endpoints do not need to be logged).

Do I need OpenTelemetry for Node.js observability?

Not necessarily. OpenTelemetry is useful for distributed tracing, when a request passes through several services. If you have a monolith or a few services, pino + prom-client handles 90% of cases with far less complexity.

What is the difference between liveness and readiness probes?

Liveness checks if the process is alive (responds quickly, no dependencies). Readiness checks if it is ready to receive traffic (database connected, cache loaded). Use /health/readiness for readiness and /health/liveness for liveness.

Should I log the request body?

In production, no. Logs with bodies increase storage costs and can contain sensitive data (PII, tokens). Log only method, URL, status, and duration. If you need the body for debugging, use staging with LOG_LEVEL=debug.

How much overhead does pino add?

Virtually none. Pino serializes JSON in a separate buffer from the main thread and is considered one of the fastest loggers for Node.js. The latency impact is less than 1ms per request.

Monitor your applications on Guara Cloud

Structured logs collected automatically, managed health checks, and accessible metrics. All on Brazilian infrastructure billed in local currency.

Start for free