Writing
Blog
Articles on performance testing, SRE, observability, and AI systems performance.
Measuring LLM Inference Performance: Latency, Throughput, and Cost
The metrics that actually matter for LLM serving — TTFT, TPOT, tokens/sec, and cost per request — how they trade off, and how to load-test an inference endpoint.
Read →SLOs and Error Budgets: A Practical Guide for Performance Engineers
How to turn vague reliability goals into measurable SLIs, SLOs, and error budgets — and how that math directly governs release velocity and on-call load.
Read →Chaos Engineering: Testing Reliability by Breaking Things on Purpose
What chaos engineering is, how to run a safe first experiment, and how it connects to error budgets and SLOs.
Read →Capacity Planning with the Universal Scalability Law
How the Universal Scalability Law models contention and coherency penalties to predict where a system's throughput will actually peak and decline.
Read →Writing Incident Response Runbooks That Actually Get Used
What makes an incident runbook useful under real pressure versus one that gets ignored, with a practical structure to follow.
Read →On-Call Best Practices That Prevent Burnout
Practical on-call practices — rotation design, alert quality, and post-incident follow-up — that keep on-call sustainable rather than dreaded.
Read →Building a Genuine Blameless Postmortem Culture
What separates a blameless postmortem culture that actually works from one that's blameless only in name, and how to build the former.
Read →SRE vs DevOps vs Platform Engineering: What Actually Differs
A clear-eyed comparison of SRE, DevOps, and platform engineering as organizational approaches, and where the real differences (and overlaps) lie.
Read →Toil Reduction: Identifying and Eliminating Operational Toil
What SRE means by 'toil,' how to identify it systematically, and a practical framework for deciding what to automate first.
Read →Monitoring vs Observability: A Practical Distinction
What actually separates monitoring from observability beyond the buzzword, and why the distinction matters for debugging unknown failure modes.
Read →Runbooks vs Playbooks: A Useful Distinction for Incident Response
The practical difference between an incident runbook and a playbook, and when each is the right tool to write and maintain.
Read →SRE Team Topologies: Embedded, Centralized, and Hybrid Models
How SRE teams are typically organized — embedded, centralized, and hybrid models — and the trade-offs each makes between context and consistency.
Read →Continuous Batching: How Modern LLM Servers Achieve High Throughput
How continuous batching differs from static batching, why it's central to vLLM and TGI's throughput advantage, and what it costs individual requests.
Read →Prompt Caching and KV Cache: Why Repeated Context Gets Cheaper
How prompt/KV caching reduces cost and latency for repeated context in LLM applications, and when it actually helps versus doesn't.
Read →Benchmarking Vector Database Performance for RAG Systems
What actually matters when benchmarking a vector database for retrieval-augmented generation — recall, latency, and indexing trade-offs.
Read →GPU Utilization for LLM Model Serving: What to Actually Measure
Why GPU utilization percentage alone is a misleading metric for LLM serving, and what to measure instead to understand real efficiency.
Read →Quantization and Performance Trade-offs in LLM Serving
How model quantization (INT8, INT4, and similar) trades accuracy for latency, throughput, and memory savings, and how to evaluate the trade-off.
Read →Optimizing RAG Pipeline Latency: Where the Time Actually Goes
A breakdown of where latency accumulates in a retrieval-augmented generation pipeline, and the highest-leverage places to optimize it.
Read →Benchmarking Open-Source LLM Inference Servers: vLLM, TGI, and Ollama
A practical comparison framework for benchmarking vLLM, TGI, and Ollama, and what each is actually optimized for.
Read →Load Testing LLM APIs: A Practical Guide
How to design a load test specifically for LLM APIs, covering realistic prompt distributions, streaming measurement, and concurrency sweeps.
Read →Token Economics 101: Understanding LLM API Cost Structure
How LLM API pricing actually works — input vs output token pricing, why output costs more, and the practical levers for controlling cost.
Read →OpenTelemetry for Performance Engineers: A Practical Start
A practical introduction to OpenTelemetry's traces, metrics, and logs, and how to instrument a service for meaningful performance analysis.
Read →Prometheus and Grafana Basics for Performance Monitoring
How Prometheus's pull-based metrics model and PromQL work, and how to build Grafana dashboards that actually answer performance questions.
Read →The RED Method: Rate, Errors, Duration for Service Monitoring
How the RED method gives a simple, consistent framework for monitoring any request-driven service, and how it complements the USE method.
Read →Distributed Tracing Explained: Spans, Context, and Sampling
How distributed tracing actually works under the hood — spans, trace context propagation, and sampling strategies — explained from first principles.
Read →Structured Logging Best Practices for Debuggable Systems
Why structured logging (key-value fields, not free text) matters for debugging at scale, and practical conventions worth adopting.
Read →The USE Method: Utilization, Saturation, Errors for Resource Monitoring
How Brendan Gregg's USE method systematically checks system resources for performance bottlenecks, and how it pairs with the RED method.
Read →APM Tool Comparison: Datadog, Dynatrace, and New Relic
A practical comparison of how Datadog, Dynatrace, and New Relic approach instrumentation, AI-assisted root-cause analysis, and pricing.
Read →Building SLO Dashboards That Drive Real Decisions
How to design an SLO dashboard that actually informs the ship/freeze decisions error budgets are meant to enable, not just display pretty graphs.
Read →Little's Law for Performance Engineers, with Worked Examples
An intuitive explanation of Little's Law (L = λW), how to derive concurrency, throughput, or latency from the other two, and common misuses.
Read →Amdahl's Law for Performance Engineers
How Amdahl's Law quantifies the limit parallelization can achieve when part of a workload is inherently serial, with practical examples.
Read →Queueing Theory Basics for Performance Engineers
An accessible introduction to queueing theory concepts — utilization, queue length, and waiting time — and why systems get dramatically slower near full utilization.
Read →Why p99 Matters: Understanding Latency Percentiles
What latency percentiles actually mean, why averages systematically mislead, and the pitfalls of averaging or combining percentiles incorrectly.
Read →Concurrency vs Parallelism: A Clear Distinction
The genuine technical distinction between concurrency and parallelism, why it matters for performance reasoning, and common confusions.
Read →Garbage Collection Tuning Fundamentals
The core concepts behind garbage collector tuning — generational collection, pause times, and throughput trade-offs — applicable across JVM, .NET, and Go.
Read →Throughput vs Latency: Why You Usually Can't Maximize Both
Why throughput and latency often trade off against each other through batching, and how to decide where to sit on that trade-off curve.
Read →Setting Performance Budgets for Web Applications
How to set practical performance budgets (page weight, load time, Core Web Vitals) and enforce them in CI before they regress in production.
Read →Spike, Stress, and Soak Testing: Three Different Questions
How spike testing, stress testing, and soak testing each answer a different reliability question, and why a single load test can't cover all three.
Read →Synthetic Monitoring vs Real User Monitoring (RUM)
How synthetic monitoring and real user monitoring complement each other for understanding production performance, and when to rely on each.
Read →How to Write a Performance Test Plan That Answers a Real Question
A practical template for a performance test plan that starts from a specific question, not a generic checklist of tools and metrics.
Read →A Pre-Launch Performance Testing Checklist
A practical checklist to run through before considering a performance testing effort complete and ready to inform a launch decision.
Read →Top Performance Testing Mistakes (and How to Avoid Them)
A roundup of the most common, costly performance testing mistakes across tools and teams, distilled into a practical avoidance guide.
Read →Understanding Apdex: Translating Latency into User Satisfaction
What the Apdex score actually measures, how to set its thresholds meaningfully, and its limitations as a single summary metric.
Read →How to Calculate an Error Budget, Step by Step
A step-by-step walkthrough of calculating an error budget from an SLO, with worked examples at different reliability targets.
Read →What is DevPerfOps? Performance as a First-Class Citizen
DevPerfOps extends DevOps by embedding performance engineering across the entire delivery pipeline — shifting it left from a pre-release gate to a continuous, shared responsibility.
Read →