Writing

Blog

Articles on performance testing, SRE, observability, and AI systems performance.

AI Performance

Measuring LLM Inference Performance: Latency, Throughput, and Cost

The metrics that actually matter for LLM serving — TTFT, TPOT, tokens/sec, and cost per request — how they trade off, and how to load-test an inference endpoint.

Read →
SRE

SLOs and Error Budgets: A Practical Guide for Performance Engineers

How to turn vague reliability goals into measurable SLIs, SLOs, and error budgets — and how that math directly governs release velocity and on-call load.

Read →
SRE

Chaos Engineering: Testing Reliability by Breaking Things on Purpose

What chaos engineering is, how to run a safe first experiment, and how it connects to error budgets and SLOs.

Read →
SRE

Capacity Planning with the Universal Scalability Law

How the Universal Scalability Law models contention and coherency penalties to predict where a system's throughput will actually peak and decline.

Read →
SRE

Writing Incident Response Runbooks That Actually Get Used

What makes an incident runbook useful under real pressure versus one that gets ignored, with a practical structure to follow.

Read →
SRE

On-Call Best Practices That Prevent Burnout

Practical on-call practices — rotation design, alert quality, and post-incident follow-up — that keep on-call sustainable rather than dreaded.

Read →
SRE

Building a Genuine Blameless Postmortem Culture

What separates a blameless postmortem culture that actually works from one that's blameless only in name, and how to build the former.

Read →
SRE

SRE vs DevOps vs Platform Engineering: What Actually Differs

A clear-eyed comparison of SRE, DevOps, and platform engineering as organizational approaches, and where the real differences (and overlaps) lie.

Read →
SRE

Toil Reduction: Identifying and Eliminating Operational Toil

What SRE means by 'toil,' how to identify it systematically, and a practical framework for deciding what to automate first.

Read →
SRE

Monitoring vs Observability: A Practical Distinction

What actually separates monitoring from observability beyond the buzzword, and why the distinction matters for debugging unknown failure modes.

Read →
SRE

Runbooks vs Playbooks: A Useful Distinction for Incident Response

The practical difference between an incident runbook and a playbook, and when each is the right tool to write and maintain.

Read →
SRE

SRE Team Topologies: Embedded, Centralized, and Hybrid Models

How SRE teams are typically organized — embedded, centralized, and hybrid models — and the trade-offs each makes between context and consistency.

Read →
AI Performance

Continuous Batching: How Modern LLM Servers Achieve High Throughput

How continuous batching differs from static batching, why it's central to vLLM and TGI's throughput advantage, and what it costs individual requests.

Read →
AI Performance

Prompt Caching and KV Cache: Why Repeated Context Gets Cheaper

How prompt/KV caching reduces cost and latency for repeated context in LLM applications, and when it actually helps versus doesn't.

Read →
AI Performance

Benchmarking Vector Database Performance for RAG Systems

What actually matters when benchmarking a vector database for retrieval-augmented generation — recall, latency, and indexing trade-offs.

Read →
AI Performance

GPU Utilization for LLM Model Serving: What to Actually Measure

Why GPU utilization percentage alone is a misleading metric for LLM serving, and what to measure instead to understand real efficiency.

Read →
AI Performance

Quantization and Performance Trade-offs in LLM Serving

How model quantization (INT8, INT4, and similar) trades accuracy for latency, throughput, and memory savings, and how to evaluate the trade-off.

Read →
AI Performance

Optimizing RAG Pipeline Latency: Where the Time Actually Goes

A breakdown of where latency accumulates in a retrieval-augmented generation pipeline, and the highest-leverage places to optimize it.

Read →
AI Performance

Benchmarking Open-Source LLM Inference Servers: vLLM, TGI, and Ollama

A practical comparison framework for benchmarking vLLM, TGI, and Ollama, and what each is actually optimized for.

Read →
AI Performance

Load Testing LLM APIs: A Practical Guide

How to design a load test specifically for LLM APIs, covering realistic prompt distributions, streaming measurement, and concurrency sweeps.

Read →
AI Performance

Token Economics 101: Understanding LLM API Cost Structure

How LLM API pricing actually works — input vs output token pricing, why output costs more, and the practical levers for controlling cost.

Read →
Observability

OpenTelemetry for Performance Engineers: A Practical Start

A practical introduction to OpenTelemetry's traces, metrics, and logs, and how to instrument a service for meaningful performance analysis.

Read →
Observability

Prometheus and Grafana Basics for Performance Monitoring

How Prometheus's pull-based metrics model and PromQL work, and how to build Grafana dashboards that actually answer performance questions.

Read →
Observability

The RED Method: Rate, Errors, Duration for Service Monitoring

How the RED method gives a simple, consistent framework for monitoring any request-driven service, and how it complements the USE method.

Read →
Observability

Distributed Tracing Explained: Spans, Context, and Sampling

How distributed tracing actually works under the hood — spans, trace context propagation, and sampling strategies — explained from first principles.

Read →
Observability

Structured Logging Best Practices for Debuggable Systems

Why structured logging (key-value fields, not free text) matters for debugging at scale, and practical conventions worth adopting.

Read →
Observability

The USE Method: Utilization, Saturation, Errors for Resource Monitoring

How Brendan Gregg's USE method systematically checks system resources for performance bottlenecks, and how it pairs with the RED method.

Read →
Observability

APM Tool Comparison: Datadog, Dynatrace, and New Relic

A practical comparison of how Datadog, Dynatrace, and New Relic approach instrumentation, AI-assisted root-cause analysis, and pricing.

Read →
Observability

Building SLO Dashboards That Drive Real Decisions

How to design an SLO dashboard that actually informs the ship/freeze decisions error budgets are meant to enable, not just display pretty graphs.

Read →
Concepts

Little's Law for Performance Engineers, with Worked Examples

An intuitive explanation of Little's Law (L = λW), how to derive concurrency, throughput, or latency from the other two, and common misuses.

Read →
Concepts

Amdahl's Law for Performance Engineers

How Amdahl's Law quantifies the limit parallelization can achieve when part of a workload is inherently serial, with practical examples.

Read →
Concepts

Queueing Theory Basics for Performance Engineers

An accessible introduction to queueing theory concepts — utilization, queue length, and waiting time — and why systems get dramatically slower near full utilization.

Read →
Concepts

Why p99 Matters: Understanding Latency Percentiles

What latency percentiles actually mean, why averages systematically mislead, and the pitfalls of averaging or combining percentiles incorrectly.

Read →
Concepts

Concurrency vs Parallelism: A Clear Distinction

The genuine technical distinction between concurrency and parallelism, why it matters for performance reasoning, and common confusions.

Read →
Concepts

Garbage Collection Tuning Fundamentals

The core concepts behind garbage collector tuning — generational collection, pause times, and throughput trade-offs — applicable across JVM, .NET, and Go.

Read →
Concepts

Throughput vs Latency: Why You Usually Can't Maximize Both

Why throughput and latency often trade off against each other through batching, and how to decide where to sit on that trade-off curve.

Read →
Performance Testing

Setting Performance Budgets for Web Applications

How to set practical performance budgets (page weight, load time, Core Web Vitals) and enforce them in CI before they regress in production.

Read →
Performance Testing

Spike, Stress, and Soak Testing: Three Different Questions

How spike testing, stress testing, and soak testing each answer a different reliability question, and why a single load test can't cover all three.

Read →
Observability

Synthetic Monitoring vs Real User Monitoring (RUM)

How synthetic monitoring and real user monitoring complement each other for understanding production performance, and when to rely on each.

Read →
Performance Testing

How to Write a Performance Test Plan That Answers a Real Question

A practical template for a performance test plan that starts from a specific question, not a generic checklist of tools and metrics.

Read →
Performance Testing

A Pre-Launch Performance Testing Checklist

A practical checklist to run through before considering a performance testing effort complete and ready to inform a launch decision.

Read →
Performance Testing

Top Performance Testing Mistakes (and How to Avoid Them)

A roundup of the most common, costly performance testing mistakes across tools and teams, distilled into a practical avoidance guide.

Read →
Concepts

Understanding Apdex: Translating Latency into User Satisfaction

What the Apdex score actually measures, how to set its thresholds meaningfully, and its limitations as a single summary metric.

Read →
SRE

How to Calculate an Error Budget, Step by Step

A step-by-step walkthrough of calculating an error budget from an SLO, with worked examples at different reliability targets.

Read →
Concepts

What is DevPerfOps? Performance as a First-Class Citizen

DevPerfOps extends DevOps by embedding performance engineering across the entire delivery pipeline — shifting it left from a pre-release gate to a continuous, shared responsibility.

Read →