This article presents you a selection of AI models recommended for performing different types of tasks. These are intended to help you to identify which models are best suited for particular workloads.
| Topic |
Benchmark |
Description |
Rank |
Model |
Provider in Globant Enterprise AI |
| Coding |
SWE-Bench Verified |
Evaluates LLMs on real-world software engineering tasks: - Bug fixing. - Code generation. - Multi-file edits on GitHub repositories. Score = % of issues resolved autonomously. |
1 |
claude-sonnet-4-5-20250929 |
Anthropic, AWS Bedrock, Vertex AI |
| 2 |
claude-opus-4-5-20251101 |
Anthropic, AWS Bedrock, Vertex AI |
| 3 |
gpt-5.2-2025-12-11 |
OpenAI, Azure AI Foundry |
| 4 |
gpt-5.1 |
OpenAI |
| 5 |
gemini-3-pro-preview |
Google Vertex AI |
| Agentic |
Agentic Benchmarks |
Measures autonomous multi-step task execution including: - Tool use. - Planning. - Long-horizon reasoning. - Sequential decision-making across complex workflows. |
1 |
grok-4-1-fast-reasoning |
xAI |
| 2 |
claude-opus-4-6 |
Anthropic, AWS Bedrock, Vertex AI |
| 3 |
claude-sonnet-4-6 |
Anthropic, AWS Bedrock, Vertex AI |
| 4 |
gemini-3.1-pro-preview |
Google Vertex AI |
| 5 |
moonshotai-kimi-k2-thinking |
OpenRouter |
| Multilingual |
MMMLU |
Massive Multitask Multilingual Language Understanding benchmark that tests knowledge and reasoning across 57 subjects in multiple languages. Higher = better multilingual comprehension. |
1 |
gemini-3-pro-preview |
Google Vertex AI |
| 2 |
claude-opus-4-5-20251101 |
Anthropic, AWS Bedrock, Vertex AI |
| 3 |
claude-opus-4-1-20250805 |
Anthropic, AWS Bedrock, Vertex AI |
| 4 |
gemini-2.5-pro |
Google Vertex AI |
| 5 |
claude-sonnet-4-5-20250929 |
Anthropic, AWS Bedrock, Vertex AI |
| Reasoning |
GPQA Diamond |
Graduate-Level Google-Proof Q&A. Expert-level questions in: - Biology. - Chemistry. - Physics designed to test deep scientific reasoning. Score = % correct. |
1 |
gpt-5.2-2025-12-11 |
OpenAI, Azure AI Foundry |
| 2 |
gemini-3-pro-preview |
Google Vertex AI |
| 3 |
gpt-5.1 |
OpenAI |
| 4 |
grok-4 |
xAI |
| 5 |
claude-opus-4-5-20251101 |
Anthropic, AWS Bedrock, Vertex AI |
| Math |
AIME 2025 |
A set of challenging high-school mathematics competition problems that require: - Multi-step algebraic. - Logical reasoning. |
1 |
gemini-3-pro-preview |
Google Vertex AI |
| 2 |
gpt-5.2-2025-12-11 |
OpenAI, Azure AI Foundry |
| 3 |
moonshotai-kimi-k2-thinking |
OpenRouter |
| 4 |
o3 |
OpenAI, Azure |
| 5 |
openai-gpt-oss-20b-maas |
Google Vertex AI |
| Visual Reasoning |
ARC-AGI 2 |
Abstraction and Reasoning Corpus for AGI. Tests visual pattern recognition and abstract reasoning on novel tasks never seen during training. |
1 |
claude-opus-4-5-20251101 |
Anthropic, AWS Bedrock, Vertex AI |
| 2 | gpt-5.2-2025-12-11 | OpenAI, Azure AI Foundry |
| 3 | gemini-3-pro-preview | Google Vertex AI |
| 4 | gpt-5.1 | OpenAI |
| 5 | gpt-5 | OpenAI, Azure |
| Best Overall |
Humanity's Last Exam |
A collection of the hardest questions across all academic disciplines, designed to be unsolvable by current AI. Tests overall frontier intelligence. |
1 |
gemini-3-pro-preview |
Google Vertex AI |
| 2 | moonshotai-kimi-k2-thinking | OpenRouter |
| 3 | gpt-5 | OpenAI, Azure |
| 4 | grok-4 | xAI |
| 5 | gemini-2.5-pro | Google Vertex AI |
| Fastest |
Speed (tokens/sec) |
Measures inference throughput in tokens per second. Higher = faster response generation. Critical for real-time and high-volume applications. |
1 |
llama-4-scout-17b-16e-instruct |
Cerebras |
| 2 | llama-3.3-70b | Cerebras |
| 3 | llama3.1-8b | Cerebras |
| 4 | openai-gpt-oss-20b | Groq, AWS Bedrock, Vertex AI |
| 5 | gemini-2.0-flash | Google Vertex AI |
| Largest Context |
Context Window |
Maximum number of tokens a model can process in a single prompt+response. Larger = better for long documents, codebases, and multi-turn agents. |
1 |
grok-4-fast-non-reasoning |
xAI |
| 2 | grok-4-fast-reasoning | xAI |
| 3 | grok-4-1-fast-non-reasoning | xAI |
| 4 | qwen3-coder | OpenRouter |
| 5 | gemini-3-pro-preview | Google Vertex AI |
| Cheapest |
Cost (per 1M tokens) |
Input + Output pricing per million tokens. Lower = more economical for large-scale deployments. |
1 |
moonshotai/kimi-k2:free |
OpenRouter |
| 2 | openai-gpt-oss-20b | Groq, AWS Bedrock, Vertex AI |
| 3 | openai-gpt-oss-120b | Groq, AWS Bedrock, Vertex AI |
| 4 | gemini-2.0-flash-lite | Google Vertex AI |
| 5 | moonshotai-kimi-k2-thinking | OpenRouter |
Note: As of Version 2026-02, you can ask
Iris about recommended models by topic when you are creating a new Agent.