/IN-fur-ens/
The process of running a trained AI model to generate output — every time ChatGPT responds, that's inference. It's what you pay for in API pricing.
Inference is the act of using a trained AI model to produce output. Training teaches the model; inference is the model applying what it learned. Every ChatGPT response, every Claude answer, every Midjourney image is an inference operation.
This matters because inference has real costs: compute time, GPU usage, and API charges. When you hear 'inference cost,' that's the per-query expense of running the model. When companies talk about 'inference speed,' they mean how fast the model generates responses. When someone says 'run inference locally,' they mean executing the model on your own hardware instead of a cloud API.
Understanding inference helps you make architectural decisions: should you call a large model once or a small model ten times? Should you cache common responses to reduce inference calls? Should you run a smaller model locally for privacy?
When discussing AI costs, performance, deployment architecture, or the difference between training and using a model.
Inference cost is the operating expense of AI. Understanding it lets you design systems that are powerful AND affordable.
Infer = to deduce. The model infers your answer from its training — that process is inference.
A Mac app that coaches your AI vocabulary daily