Microsoft announced three new first‑party MAI models this week, all available through Foundry. The releases cover transcription, voice generation, and image creation through Microsoft Foundry and the MAI Playground.
The transcription model (MAI‑Transcribe‑1) focuses on accuracy across a broad set of languages while running faster and cheaper than the usual options. The voice model (MAI‑Voice‑1) generates natural speech from very small samples. The model can produce a full minute of audio in about a second, and it does so with unusually efficient GPU use. If you want to check it out, try it in Copilot Audio Expressions.
MAI‑Image‑2 also improves image generation speed across Copilot and Foundry, delivering roughly twice the performance while keeping quality in line with previous models. Just ask Copilot (web or Windows) to generate an image and it will use MAI‑Image‑2 where available.
Microsoft is also pricing these models well below the usual market rates. Transcription at thirty‑six cents per hour is roughly a 40 to 60 percent savings compared to the typical dollar‑per‑hour services. Voice generation at twenty‑two dollars per million characters comes in at about half the cost of most high‑quality TTS models. Image output at thirty‑three dollars per million tokens is often 70 percent cheaper than comparable offerings from the major providers. The MAI lineup is clearly positioned as the lower‑cost option.
What stands out is not any single capability, but the shift in direction. Microsoft is building more of its own stack rather than betting everything on OpenAI. That shift, I assume, has deeper implications for cost, direction, and long‑term strategy. Even more significantly, each model was built by small team about 10 and tuned for efficiency, which seems to be the through‑line of this entire effort. Suggesting that high‑quality models no longer require massive research groups.
As a note, I do work at Microsoft, but I am not part of the team that develops these models.
