Gaming industry under DDoS attack. Get DDoS protection now. Start onboarding
  1. Home
  2. Blog
  3. New AI inference models available now on Gcore
News
AI

New AI inference models available now on Gcore

  • November 17, 2025
  • 2 min read
New AI inference models available now on Gcore

We’ve expanded our Application Catalog with a new set of high-performance models across embeddings, text-to-speech, multimodal LLMs, and safety. All models are live today via Everywhere Inference and Everywhere AI, and are ready to deploy in just 3 clicks with no infrastructure management and no setup overhead.

This update brings stronger retrieval accuracy, more expressive voice generation, real-time audio-native LLMs, and enterprise-grade safety controls. Whether you’re building search pipelines, conversational agents, IVR systems, or production-scale AI applications, these additions give you more flexibility to optimize for quality, latency, and cost.

Text embeddings (5 new models)

High-quality embeddings are the backbone of any AI that needs to find, rank, or understand information, including RAG, semantic search, personalization, recommendations, and clustering. This new set of embedding models dramatically improves retrieval precision, cross-lingual reach, and overall RAG quality.

  • Alibaba-NLP/gte-Qwen2-7B-instruct: High-quality instruction-tuned embeddings for retrieval, reranking, and semantic search across broad domains. Ideal for RAG pipelines that need strong generalization.
  • BAAI/bge-m3: Multilingual, multi-function embeddings built for search, clustering, and cross-lingual retrieval. A great fit for global applications and multi-language knowledge bases.
  • intfloat/e5-mistral-7b-instruct: E5-style instruction-following embeddings optimized for retrieval tasks, question-answer matching, and ranking. Strong performance on RAG evaluation benchmarks.
  • Qwen/Qwen3-Embedding-4B: A cost-efficient, versatile embedding model delivering balanced performance for large-scale retrieval workloads.
  • Qwen/Qwen3-Embedding-8B: A higher-capacity sibling offering premium embedding quality for challenging retrieval, reranking, and high-accuracy semantic search.

Text-to-speech (2 new models)

Voice is becoming a first-class interface. These new TTS models make agents feel more natural, reduce robotic cadence, and improve clarity, especially in high-volume workflows like support, IVR, media generation, and automation.

  • microsoft/VibeVoice-1.5B: Neural TTS with natural prosody, expressive cadence, and fast synthesis, built for interactive applications where latency matters.
  • ResembleAI/chatterbox: Production-ready TTS capable of expressive, characterful speech. Ideal for agents, IVR, content workflows, and automated voice experiences.

Text + audio LLMs (2 new models)

These new multimodal LLMs accept both text and audio, enabling real-time voice agents, transcription intelligence, and interactive multimodal applications. They eliminate the need to stitch together separate ASR → LLM → TTS pipelines.

  • mistralai/Voxtral-Mini-3B-2507: A lightweight speech-and-text LLM for real-time voice agents. Handles both text and audio inputs/outputs and is optimized for low-latency scenarios.
  • mistralai/Voxtral-Small-24B: A mid-size Voxtral variant offering higher-quality multimodal reasoning and richer conversational speech. Suitable for advanced voice assistants, transcription workflows, and audio-aware applications.

Safety models (3 new models)

As enterprises deploy AI into production, safety is non-negotiable. These models offer high-quality classification, risk detection, and output transformation to help organizations stay compliant.

  • openai/gpt-oss-safeguard-120b: A high-capacity safety model supporting policy classification, risk detection, and output guidance. Built for enterprise-grade moderation systems.
  • openai/gpt-oss-safeguard-20b: A lighter, faster safeguard variant designed to power low-latency moderation pipelines without sacrificing accuracy.
  • Qwen/Qwen3Guard-Gen-8B: A guardrail model specialized in detecting unsafe content and transforming or steering outputs toward compliant responses.

Deploy the latest models in 3 clicks and 10 seconds

All models are available today via Gcore Everywhere AI and Gcore Everywhere Inference. Deploy publicly or privately, whichever fits your architecture.

You get:

  • Global low-latency routing
  • Predictable cost and usage visibility
  • Zero infrastructure management
  • Instant scaling to production workloads

Open the Gcore Customer Portal, choose a model, and deploy in just three clicks.

Deploy these new AI models today

Related articles

Subscribe to our newsletter

Get the latest industry trends, exclusive insights, and Gcore updates delivered straight to your inbox.