💸The answer is the expensive part: controlling output tokens

💸The answer is the expensive part: controlling output tokens (English version)

💸La respuesta es lo que más cuesta: controla el token de salida

En mi primer post miré una mitad de la moneda: lo que le mandamos a Copilot. Cómo adelgazar el contexto, cerrar hilos a tiempo, no pegar archivos enteros. Todo eso sigue valiendo y es por donde yo empezaría a limpiar. Pero, cuando te pones a mirar los números con calma, resulta que el grueso del gasto no está tanto en lo que tú envías como en lo que el modelo te responde (la otra mitad de la moneda). Y eso también cambia un poco la forma de pedir las cosas.

¿Por qué la respuesta cuesta más que la pregunta? No es un capricho de tarifa, hay una razón técnica. Nuestra pregunta, el modelo la lee entera de una sola pasada. Pero la respuesta no puede leerla: tiene que escribirla «palabra» a «palabra», y cada «palabra» le obliga a recorrer toda su red neuronal otra vez. Leer 1.000 tokens es una pasada; escribir 1.000 son por decirlo rápido y fácil 1.000 pasadas. Ahí está la diferencia de precio. Input vs Output vs Reasoning Tokens Cost – LLM Pricing Explained – DEV Community. Eso se traduce directo en el precio de los modelos que usa Copilot. Modelos y precios para GitHub Copilot – Documentación de GitHub

Modelo	Entrada ($/1M tokens)	Salida ($/1M tokens)	Ratio
GPT-5.5	$5,00	$30,00	6× más cara la salida
GPT-5.4	$2,50	$15,00	6× más cara la salida
GPT-5.4 mini	$0,75	$4,50	6× más cara la salida

Cada token que el modelo escribe nos cuesta unas 6 veces lo que cada token que mandamos. Si en la entrada el ahorro venía de quitar peso muerto, aquí el multiplicador juega mucho más fuerte.

Y hay una parte que se paga aunque no se ve: los reasoning tokens. Los modelos con razonamiento avanzado no van directo a la respuesta, antes generan un monólogo interno donde planifican, se autocorrigen y descartan caminos. Ese razonamiento no aparece en pantalla, pero se factura igual que la salida visible. Así puede quedar una petición que «parece»corta:

			
Lo que ves en pantalla:        500 tokens  ← lo visible
Razonamiento interno:        2.500 tokens  ← invisible, pero facturado como salida
──────────────────────────────────────────
Total facturado como salida: 3.000 tokens

Con esa idea en la cabeza, en este post te muestro dónde se esconde el gasto de salida en el día a día y qué hago yo para tenerlo a raya.

(Hábito) Copilot reescribiendo objetos enteros. Pasa a menudo: pides «añade validación a este procedure» y el Agente reescribe el codeunit completo en vez de tocar solo lo necesario. Lo que hago: ser quirúrgico en el prompt. «Modifica solo el procedure PostSalesLine: comprueba que Customer.»No.» no esté vacío antes del exit» genera cuatro líneas, no cientos.

Acción	Tokens de salida aprox.
Fichero de 200 líneas reescrito completo	~1.500 tokens
Solo las 10 líneas que cambian	~80 tokens

(Hábito) El modo Agente sin objetivo claro. El Agente es lo más potente de Copilot y también lo que más dispara el gasto si lo sueltas sin rumbo. La propia documentación lo define así: «In agent mode, Copilot determines which files to make changes to, offers code changes and terminal commands to complete the task, and iterates to remediate issues until the original task is complete». Esa palabra, iterates, es la clave: cada vuelta reenvía el contexto acumulado que crece y genera nueva salida. El coste no es lineal, sube con cada paso. Lo que hago: no entro al Agent de primeras. Si solo necesito entender algo o decidir, me quedo en Ask, que ni toca código. La regla es sencilla: Ask → Plan → Agent, y subes de escalón solo cuando el anterior se te queda corto, nunca al revés. Y si el Agente se va por las ramas, lo corto ya que un agente dando vueltas también consume tokens.

Pasos del agente	Coste relativo aproximado
1 paso	1×
3 pasos	~3,5×
6 pasos	~9×
10 pasos	~20×

(Hábito) Explicaciones que no pediste. El modelo tiende a contarte qué hizo, por qué, y qué alternativas valoró. Si ya sabías lo que pediste, eso es salida pura sin valor para ti. Unas pocas palabras en el prompt te ahorran cientos de tokens de vuelta:

En lugar de…	Escribe…
«Refactoriza esta función»	«Refactoriza esta función. Solo el código, sin explicaciones.»
«Corrige el bug»	«Corrige el bug. Dame solo el bloque corregido.»
«Genera los tests»	«Genera los tests. Sin comentarios inline.»

(Hábito) El esfuerzo de razonamiento mal calibrado. Esfuerzo alto en una tarea simple genera reasoning tokens que no necesitas (invisibles, pero facturados). Esfuerzo bajo en una difícil te da una respuesta floja y acabas pagando varias iteraciones. La idea es emparejarlo con la tarea:

Tipo de tarea	Esfuerzo recomendado
Explicar un bloque de código	Bajo
Buscar un bug concreto y conocido	Bajo–Medio
Diseño de arquitectura o comparar opciones	Medio–Alto
Bug difícil, algoritmo complejo	Alto

(Instrucción) Fija el formato y el tope de la salida. «Solo el diff», «máximo 15 líneas», «una tabla, sin prosa», «solo el bloque corregido». Acotar la forma de la respuesta pone un techo directo a lo más caro, y de paso el modelo no se va por las ramas. Es la diferencia entre dejar la salida abierta y decirle exactamente cuánto y cómo quieres que escriba.

(Instrucción) Indicar qué dejar fuera. Por defecto, el modelo rellena: explica lo que hizo, ofrece alternativas que nadie pidió, resume al final y lo más caro: reimprime código que no ha cambiado. Una frase corta apaga esos automatismos: «no expliques, sin alternativas, no reimprimas lo que no toques». Es un truco que ahorra salida, porque ataca los hábitos por defecto del modelo.

(Instrucción) Una petición, un objetivo. Encadenar «haz A, arregla B y de paso documenta C» en un mismo turno multiplica la salida, genera las tres cosas y arrastra el contexto de cada paso al siguiente. Pide una, la revisas, sigues. Por ejemplo, en vez de «crea el endpoint de login, añade sus tests, documéntalo en el README y revisa si el de registro tiene el mismo bug» (cuatro entregables largos), y si uno falla rehaces el bloque entero, mejor ve por pasos: «crea solo el endpoint de login», lo revisas, y luego «ahora los tests de ese endpoint». Cada respuesta es corta, la controlas, y no pagas por lo que aún no has validado.

(Instrucción) Dar el molde, no dejar que lo invente. Si lo que quieres se parece a algo que ya existe, dilo: «igual que el procedure InsertCustomer que ya tienes», «mismo formato que esta página». Así el modelo replica un patrón en vez de inventar uno desde cero. Ahorra salida y, sobre todo, evita el re-ask por divergencia: el coste que no ves venir.

(Instrucción) Pedir que pregunte antes de generar. Una línea «si te falta contexto, pregúntame antes de escribir código» y te ahorras el patrón más caro de todos: que suelte una respuesta larga y equivocada que acabas descartando. Eso es pagar dos rondas. Mejor una pregunta de entrada (barata) que una generación entera de salida (cara) que no sirve.

(Instrucción · una sola vez) Pon una instrucción permanente de concisión. Puesto una sola vez, cada turno arranca ya pidiendo poco, preguntando cuando duda en lugar de inventar, y frenando antes de un cambio masivo.

			
## Verbosity
- Respuestas cortas y directas por defecto; amplía solo si lo pido.
## Comunicación
- No adivines ni inventes: si dudas, pregunta. Una pregunta de más
  es mejor que una respuesta segura y equivocada.
## Seguridad
- Antes de tocar 10 ficheros o más, pide confirmación.

		

Así que la regla no es «minimiza la salida». Es: «empareja la verbosidad con el valor de la tarea». Una respuesta corta en una tarea simple es ahorro real. Una respuesta corta en una tarea difícil es economía falsa, porque la pagas dos veces.

Si recortas…	El riesgo es…	Síntoma de que recortaste de más
Longitud de respuesta	Respuesta incompleta → re-ask → pagas dos rondas	Tienes que pedir «continúa» o «¿y qué pasa con X?»
Diffs en vez de fichero completo	El modelo aplica el diff mal → depuras → sale más caro	Más tiempo arreglando que si lo hubiera reescrito
Esfuerzo de razonamiento	Respuesta superficial → retrabajo	No funciona al primer intento
Explicaciones	Entiendes menos → consultas extra más adelante	«¿Por qué lo hiciste así?» al turno siguiente

Si juntamos todo, los trucos del token de salida caen en dos grupos: las de hábito (lo que dejamos de hacer) y las de instrucción (lo que añadimos antes de enviar). Estas últimas son las que de verdad están en nuestras mano en cada prompt:

Truco	Tipo	Ahorro potencial	Cuándo aplica	Riesgo si abusas
Pide el cambio concreto, no el fichero	Hábito	Alto (~19×)	Cambios puntuales en código existente	Diff mal aplicado → retrabajo
Elige el modo más barato que resuelva (Ask → Plan → Agent)	Hábito	Muy alto	Siempre	Subir de modo tarde te cuesta una ronda
Corta el agente si da vueltas	Hábito	Variable	Agent con objetivo vago	—
Fija el formato, el tope y qué omitir (solo el diff, sin prosa, sin alternativas)	Instrucción	Medio–Alto	Cuando sabes qué quieres	Perder contexto útil
Una petición, un objetivo	Instrucción	Medio	Tareas que tiendes a encadenar.	Separar lo muy acoplado → idas y vueltas.
Dar el molde, no dejar que lo invente	Instrucción	Medio–Alto	Cuando ya existe un patrón parecido.	El ejemplo ocupa entrada (barata) — buen trato.
Pide que pregunte antes de generar si le falta contexto	Instrucción	Alto	Tareas ambiguas o con supuestos.	Un pequeño ida y vuelta extra.
Instrucción permanente de concisión (se pone una vez)	Instrucción	Alto y recurrente	Siempre — aplica en cada turno.	Demasiado seca para cuando sí quieres razonamiento.
Calibra el esfuerzo de razonamiento	Configuración	Medio	Modelos con razonamiento activado.	Respuesta superficial en tareas difíciles.

Hay algunos trucos más finos para apurar, varias apuntando justo a lo que no se ve: el razonamiento.

Lo que le pides	Qué ahorra	Qué puedes perder (el riesgo)
«No uses librerías externas, mantén el estilo del fichero» — acotar el terreno	Razonamiento gastado en caminos que el modelo iba a descartar igual, y salida que no encajaba.	Casi nada, salvo que te pases y bloquees una solución mejor.
«Soy senior, salta lo básico» — decir tu nivel	Párrafos de contexto que el lector ya domina.	Casi nada, salvo sobreestimarse y perder una pieza que no se conocía.
«Con que compile o pase este test, basta» — marcar cuándo está hecho	El razonamiento de explorar casos límite y la salida de cubrirlos.	Cobertura: algún caso raro que queda fuera de ese «hecho»
«Asume valores razonables y dímelos en una línea» — asumir y seguir	Una ronda entera de ida y vuelta.	Si el supuesto era erróneo, toca rehacer, mejor solo en tareas de bajo riesgo.
«Elige la mejor opción y aplícala, sin darme un menú» — decidir por ti	La salida de enumerar alternativas y el razonamiento de compararlas.	Visibilidad: pierdes ver opciones que quizá habrías vetado.

Controlar la entrada te da margen; controlar la salida te da el grueso del ahorro. Mira cómo tienes el Agent, cómo pides los cambios y cuánto razonamiento le exiges a cada tarea.

💸The answer is the expensive part: controlling output tokens

In my first post I looked at one half of the coin: what we send Copilot. How to slim down the context, closing threads in time, not pasting whole files. All of that still holds and it’s where I’d start cleaning up. But when you sit down and look at the numbers calmly, it turns out the bulk of the spend isn’t so much in what you send as in what the model replies to you (the other half of the coin). And that also changes how you ask for things a little.

Why does the answer cost more than the question? It’s not a pricing whim, there’s a technical reason. Our question, the model reads it whole in a single pass. But the answer it can’t read: it has to write it «word» by «word», and each «word» forces it to run through its entire neural network again. Reading 1,000 tokens is one pass; writing 1,000 is, to put it quickly and simply, 1,000 passes. That’s where the price gap is. Input vs Output vs Reasoning Tokens Cost – LLM Pricing Explained – DEV Community. That maps straight onto the prices of the models Copilot uses. Modelos y precios para GitHub Copilot – Documentación de GitHub

Model	Input ($/1M tokens)	Output ($/1M tokens)	Ratio
GPT-5.5	$5.00	$30.00	6× more expensive output
GPT-5.4	$2.50	$15.00	6× more expensive output
GPT-5.4 mini	$0.75	$4.50	6× more expensive output

Each token the model writes costs us about 6× what each token we send does. If on the input side the saving came from removing dead weight, here the multiplier plays much harder.

And there’s a part you pay for even though you don’t see it: the reasoning tokens. Models with advanced reasoning don’t go straight to the answer; first they generate an internal monologue where they plan, self-correct and discard paths. That reasoning doesn’t show up on screen, but it’s billed just like the visible output. So a request that «looks» short can end up like this:

			
What you see on screen:        500 tokens  ← the visible part
Internal reasoning:          2,500 tokens  ← invisible, but billed as output
──────────────────────────────────────────
Total billed as output:      3,000 tokens

With that idea in mind, in this post I’ll show you where the output spend hides day to day and what I do to keep it in check.

(Habit) Copilot rewriting whole objects. It happens often: you ask «add validation to this procedure» and the Agent rewrites the entire codeunit instead of touching only what’s needed. What I do: be surgical in the prompt. «Modify only the PostSalesLine procedure: check that Customer."No." isn’t empty before the exit» generates four lines, not hundreds.

Action	Approx. output tokens
200-line file rewritten in full	~1,500 tokens
Only the 10 lines that change	~80 tokens

(Habit) Agent mode with no clear goal. The Agent is Copilot’s most powerful tool, and also the one that runs up the bill fastest if you set it loose with no direction. The docs themselves define it like this: «In agent mode, Copilot determines which files to make changes to, offers code changes and terminal commands to complete the task, and iterates to remediate issues until the original task is complete». That word, iterates, is the key: each turn resends the accumulated context that grows and generates new output. The cost isn’t linear, it climbs with every step. What I do: I don’t jump into the Agent first. If I only need to understand something or decide, I stay in Ask, which doesn’t even touch code. The rule is simple: Ask → Plan → Agent, and you move up a rung only when the previous one falls short, never the other way around. And if the Agent starts wandering, I cut it off, since an agent going in circles burns tokens too.

Agent steps	Approx. relative cost
1 step	1×
3 steps	~3.5×
6 steps	~9×
10 steps	~20×

(Habit) Explanations you didn’t ask for. The model tends to tell you what it did, why, and what alternatives it weighed. If you already knew what you asked for, that’s pure output with no value to you. A few words in the prompt save you hundreds of tokens in return:

Instead of…	Write…
«Refactor this function»	«Refactor this function. Code only, no explanations.»
«Fix the bug»	«Fix the bug. Just give me the corrected block.»
«Generate the tests»	«Generate the tests. No inline comments.»

(Habit) Miscalibrated reasoning effort. High effort on a simple task generates reasoning tokens you don’t need (invisible, but billed). Low effort on a hard one gives you a weak answer and you end up paying for several iterations. The idea is to match it to the task:

Task type	Recommended effort
Explain a block of code	Low
Find a specific, known bug	Low–Medium
Architecture design or comparing options	Medium–High
Tough bug, complex algorithm	High

(Instruction) Set the format and the cap of the output. «Diff only», «max 15 lines», «a table, no prose», «just the corrected block». Bounding the shape of the answer puts a direct ceiling on the most expensive part, and along the way the model doesn’t wander off. It’s the difference between leaving the output open and telling it exactly how much and how you want it to write.

(Instruction) State what to leave out. By default, the model pads: it explains what it did, offers alternatives nobody asked for, sums up at the end and, the most expensive part, reprints code that hasn’t changed. A short sentence shuts off those reflexes: «no explanations, no alternatives, don’t reprint what you don’t touch». It’s a trick that saves output, because it targets the model’s default habits.

(Instruction) One request, one objective. Chaining «do A, fix B and document C while you’re at it» in a single turn multiplies the output, generates all three and drags each step’s context into the next. Ask for one, review it, move on. For example, instead of «create the login endpoint, add its tests, document it in the README and check whether the registration one has the same bug» (four long deliverables, and if one goes wrong you redo the whole block), better go step by step: «create only the login endpoint», review it, then «now the tests for that endpoint». Each answer is short, you stay in control, and you don’t pay for what you haven’t validated yet.

(Instruction) Give it the mold, don’t let it invent one. If what you want resembles something that already exists, say so: «same as the InsertCustomer procedure you already have», «same format as this page». That way the model replicates a pattern instead of inventing one from scratch. It saves output and, above all, avoids the re-ask from divergence: the cost you don’t see coming.

(Instruction) Ask it to ask before generating. One line, «if you’re missing context, ask me before writing code», and you spare yourself the most expensive pattern of all: it pours out a long, wrong answer that you end up discarding. That’s paying for two rounds. Better one cheap input question than a whole expensive output generation that doesn’t serve you.

(Instruction · once and for all) Set a permanent conciseness instruction. Set once, every turn starts off already asking for little, asking when in doubt instead of inventing, and stopping before a massive change.

			
## Verbosity
- Keep responses short and direct by default.
- Expand only when explicitly asked or when the topic genuinely requires it.
## Communication Style
- Never guess, infer, or hallucinate. If unsure, ask — one extra
  question is always better than a confidently wrong answer.
## Safety
- Before changes affecting 10+ files, ask for confirmation first.

		

So the rule isn’t «minimize output». It’s: «match the verbosity to the value of the task». A short answer on a simple task is real savings. A short answer on a hard task is false economy, because you pay for it twice.

If you cut…	The risk is…	Sign you’ve cut too much
Response length	Incomplete answer → re-ask → you pay two rounds	You have to type «continue» or «what about X?»
Diffs instead of the full file	The model applies the diff wrong → you debug → it gets pricier	More time fixing than if it had rewritten it
Reasoning effort	Shallow answer → rework	It doesn’t work on the first try
Explanations	You understand less → extra questions later	«Why did you do it this way?» the next turn

Putting it all together, the output-token tricks fall into two groups: the habit ones (what we stop doing) and the instruction ones (what we add before sending). The latter are the ones truly in our hands on every prompt:

Trick	Type	Potential saving	When it applies	Risk if you overdo it
Ask for the specific change, not the file	Habit	High (~19×)	Pinpoint changes in existing code	Diff applied wrong → rework
Pick the cheapest mode that works (Ask → Plan → Agent)	Habit	Very high	Always	Moving up too late costs you a round
Cut the agent off if it wanders	Habit	Variable	Agent with a vague goal	—
Set the format, the cap and what to omit (diff only, no prose, no alternatives)	Instruction	Medium–High	When you know what you want	Losing useful context
One request, one objective	Instruction	Medium	Tasks you tend to chain	Splitting tightly-coupled work → back-and-forth
Give it a mold, don’t let it invent one	Instruction	Medium–High	When a similar pattern already exists	The example uses input (cheap) — a good trade
Ask it to ask before generating if it’s missing context	Instruction	High	Ambiguous tasks or ones with assumptions	A small extra back-and-forth
Permanent conciseness instruction (set once)	Instruction	High and recurring	Always — applies every turn	Too terse for when you do want reasoning
Calibrate reasoning effort	Configuration	Medium	Models with reasoning on	Shallow answer on hard tasks

There are some finer tricks to push a bit further, several aimed right at what you can’t see: the reasoning.

What you ask for	What it saves	What you might lose (the risk)
«No external libraries, keep the file’s style» — bound the field	Reasoning spent on paths the model was going to discard anyway, and output that didn’t fit	Almost nothing, unless you over-constrain and block a better solution
«I’m senior, skip the basics» — state your level	Paragraphs of context the reader already knows	Almost nothing, unless you overestimate yourself and miss a piece you didn’t know
«As long as it compiles or passes this test, that’s enough» — mark when it’s done	The reasoning of exploring edge cases and the output of covering them	Coverage: some rare case that falls outside that «done»
«Assume reasonable values and tell me in one line» — assume and go	A whole round of back-and-forth	If the assumption was wrong, you redo it — best only for low-risk tasks
«Pick the best option and apply it, don’t give me a menu» — decide for me	The output of listing alternatives and the reasoning of comparing them	Visibility: you lose sight of options you might have vetoed

Controlling input gives you margin; controlling output gives you the bulk of the savings. Look at how you’ve got the Agent set up, how you ask for changes, and how much reasoning you demand of each task.

Más información / More information:

Gerardo Rentería Blog

Top Posts (last 90 days)

Archive

Tag

💸The answer is the expensive part: controlling output tokens

💸The answer is the expensive part: controlling output tokens (English version)

💸La respuesta es lo que más cuesta: controla el token de salida

💸The answer is the expensive part: controlling output tokens

Compártelo:

Deja un comentario Cancelar la respuesta

Top Posts (last 90 days)

Archive

Tag