How much does a typical project cost?

Depends on scope, but for reference: a specific automation from USD $300, an AI agent from USD $800, a complete CRM integration from USD $2,500 and a SaaS from USD $8,000. In the diagnosis —a bounded service that's credited toward the implementation— I give you a tight range for your specific case.

Does the diagnosis have a cost?

The 30-minute intro call is free. If we move forward, the AI & Automation Diagnosis is a paid, fixed-price service (1-2 weeks) that maps your processes and quantifies the savings in money. Its value is fully credited toward the implementation if you decide to proceed.

How long to see results?

The first automations usually run in 1-2 weeks. A basic AI agent in 2-3 weeks. A complete SaaS project between 8-16 weeks. I work in short sprints so you see progress in each delivery.

What if I already have tools? Do you replace or integrate?

I integrate whenever possible. The idea is not for you to change everything, but for what you already have to work better and communicate with each other. I only recommend replacing when a tool is costing you more than it contributes.

Do you work alone or with a team?

I personally lead each project and, depending on size, add developers, designers or data experts from the Oryzo team and my network. You always have a single point of contact: me.

Why you and not an agency?

With an agency you usually talk to a salesperson, get advised by a consultant and implemented by someone else. With me you work directly with whoever diagnoses, designs and builds: a CTO with 15+ years who takes on your project as a technical partner, not just another ticket. Fewer middlemen, more context and faster decisions.

Do I need technical knowledge to maintain it?

No. I deliver everything documented, with simple dashboards and training for your team. If you later prefer continuous support, I offer monthly maintenance plans.

Do you handle projects outside Chile?

Yes. I work with clients throughout LatAm and Spanish-speaking USA. Everything is 100% remote, with meetings via Google Meet or Zoom and asynchronous communication via WhatsApp/Slack.

What happens after I apply?

You go through a 2-minute guided conversation with my agent. If your project is a fit, you immediately get the link to book your free 30-minute call in my calendar. If it's not the right time, I'll tell you honestly and leave you a recommendation. No spam, no aggressive sales calls.

ChatGPT vs Claude vs Gemini vs DeepSeek: Which LLM Should You Choose for Your Business Agent?

The Question I Get Asked Most

When a client decides to implement an AI agent in their business, they almost always arrive with the same question: should I use ChatGPT, Claude, Gemini, or DeepSeek?

The honest answer is that it depends. But "it depends" without context doesn't help anyone — so in this article I'm going to break it down concretely: what each model does well, where it falls short, how much it costs at scale, and how I've made this decision on real projects.

One warning before we start: this article is not a comparison of academic benchmarks. Scores on MMLU or HumanEval don't tell you whether a model will respond well when a WhatsApp customer writes "I want to check on the thing from Thursday" with no additional context. What matters here is performance under real production conditions — informal language, in English, with specific business instructions.

The Criteria That Actually Matter for a Business Agent

Before comparing models, it's worth defining what a business agent needs that a general-purpose model doesn't necessarily have:

System instruction following. The agent has a system prompt that defines who it is, what it can say, what it can't say, and how it should behave. A model that doesn't follow those instructions consistently is an unpredictable agent — and an unpredictable agent facing customers is a problem.

Quality in colloquial language. Your customers don't write in formal language. They use abbreviations, mix in slang, use regional expressions. A model trained primarily in formal text may understand the message, but the response it generates can sound artificial or off-tone.

Tool use (function calling). Useful agents don't just respond with text: they query databases, check availability, register information in CRMs. The ability to execute tools reliably — without inventing parameters or skipping steps — is critical.

Consistency at scale. In production, an agent may handle hundreds of simultaneous conversations. What works in 10 manual tests can fail on conversation 847 if the model isn't robust.

Cost per operation. API costs become relevant much sooner than most people expect. An agent handling 1,000 daily conversations averaging 10 messages can generate very different costs depending on the model chosen.

GPT-4o (OpenAI)

The reference standard.

GPT-4o is the model most projects start with, and for good reason: it has the most mature integration ecosystem, the most complete documentation, and it's the one most developers know. If you hire someone to build your agent, there's a higher chance they know GPT-4o than any other model.

Where it shines:

Following complex instructions with many conditions
Natural responses in open-ended conversations
Reliable and predictable function calling
Native integration with the OpenAI ecosystem (Assistants API, vector stores, threads)

Where it falls short:

It's the most expensive of the four in cost per token for production at scale
Quality in colloquial language is good but not perfect; it sometimes over-formalizes
The context window (128K tokens) is large but Claude surpasses it for processing long documents

Approximate cost: ~$5–15 USD per million input tokens depending on the model (GPT-4o vs GPT-4o mini). At medium volumes (100K messages/month), monthly cost can range from $200 to $800 USD depending on prompt length.

Best for: Front-line customer service agents, lead qualification bots, internal team assistants. If you don't have a specific reason to use another model, GPT-4o is the safe choice.

Claude (Anthropic)

The best for precise instructions and long context.

Claude — especially Sonnet and Opus — stands out in one thing above all others: it follows instructions with a fidelity that other models rarely match. If the system prompt says "never offer discounts unless the customer asks first," Claude respects that instruction consistently even in long conversations and when users try to push it off-script.

This makes it especially valuable for business agents where out-of-bounds behavior has real consequences.

Where it shines:

Following detailed instructions and behavioral restrictions
Processing long documents (context up to 200K tokens)
Very high-quality responses in English, with natural tone and no over-formalization
Reasoning in ambiguous situations — when the customer says something that doesn't clearly fit any category, Claude handles ambiguity better than the others
Resistance to jailbreaks and user manipulation attempts to break character

Where it falls short:

Cost similar to or higher than GPT-4o on the most powerful models (Opus)
The integration ecosystem is younger; fewer third-party libraries
Can be more cautious than necessary in some commercial contexts

Approximate cost: Claude Sonnet (~~$3/M input tokens) is competitive with GPT-4o. Claude Haiku (~~$0.25/M tokens) is one of the cheapest options on the market for simple tasks.

Best for: Agents handling sensitive or complex conversations, contract or document processing, any case where consistency of behavior is non-negotiable. In my implementations of agents for healthcare settings — where tone, boundaries, and precision matter a lot — Claude is the first option I evaluate.

Gemini (Google)

The most affordable at scale and best integrated with Google.

Gemini comes in two versions that serve very different purposes: Gemini Pro competes directly with GPT-4o and Claude Sonnet in quality, while Gemini Flash is the fastest and cheapest option on the market for high-frequency tasks.

Gemini's structural advantage is its integration with the Google ecosystem: if your company already uses Google Workspace, Sheets, Drive, and Gmail, Gemini has native access to those contexts without needing additional connectors.

Where it shines:

Gemini Flash: extremely low latency (ideal for real-time responses) and very low cost
Native integration with Google Workspace and Google Cloud
Enormous context window in Gemini Pro (up to 1 million tokens on some models)
Good handling of formal business language

Where it falls short:

Function calling, while functional, isn't as predictable as GPT-4o or Claude in complex scenarios
Quality in colloquial conversational language is weaker than Claude
The API had more instabilities historically than OpenAI or Anthropic in the first months of new model launches

Approximate cost: Gemini Flash (~$0.075/M input tokens) is the cheapest model in this group for production. Gemini Pro is comparable in price to GPT-4o.

Best for: Very high volumes where cost per operation matters (notification agents, automatic classification, simple FAQ responses), teams already embedded in Google Workspace, tasks requiring access to very long documents.

DeepSeek

The economic surprise with an important caveat.

DeepSeek V3 and R1 changed the conversation about costs in the LLM market. At a cost that can be up to 20 times lower than GPT-4o for equivalent input tokens, and with reasoning capabilities that compete with the best Western models, it's impossible to ignore.

For tasks where quality matters less than cost and volume, DeepSeek is currently the most economically efficient option.

Where it shines:

Cost per token significantly lower than all others
Surprisingly high reasoning capabilities for its price
Good performance in technical and formal language
Open-source option available for self-deployment (complete data privacy)
R1 especially useful for tasks requiring step-by-step reasoning

Where it falls short:

Data privacy: DeepSeek is a Chinese company. For any agent that processes confidential customer information — medical data, financial data, contracts, PII — this is a risk that needs to be evaluated with legal and compliance criteria, not just technical ones
System instruction following is less consistent than Claude or GPT-4o in complex scenarios
Less mature tooling ecosystem and enterprise support
API latency can be higher during peak demand hours

Approximate cost: DeepSeek V3 can cost ~$0.07–0.27/M tokens depending on the endpoint, compared to $3–15 for equivalent models from OpenAI or Anthropic.

Best for: Internal tasks that don't handle sensitive customer data (ticket classification, internal summaries, draft generation), projects where cost is the primary constraint, or technical teams who want to deploy the open-source model on their own infrastructure for full control.

The Practical Decision Matrix

Criterion	GPT-4o	Claude Sonnet	Gemini Flash	DeepSeek V3
Colloquial language quality	★★★★	★★★★★	★★★★	★★★
Instruction following	★★★★	★★★★★	★★★	★★★
Reliable function calling	★★★★★	★★★★	★★★	★★★
Cost at scale	★★★	★★★	★★★★★	★★★★★
Ecosystem / integrations	★★★★★	★★★	★★★★	★★
Enterprise privacy	★★★★	★★★★★	★★★★	★★
Response speed	★★★★	★★★★	★★★★★	★★★

The Approach I Use in Production

In most projects I implement, I don't choose a single model for the entire agent. The architecture that works best in production is a multi-model approach based on task:

Claude Sonnet or GPT-4o for the main customer conversation: where quality, tone, and consistency of behavior are non-negotiable.
Gemini Flash or DeepSeek for high-frequency support tasks: classifying the inquiry type before passing it to the main model, generating summaries of long conversations, extracting structured data from free text.
Open-source model (Llama, DeepSeek) on the client's own infrastructure when data is sensitive and can't leave their server.

This hybrid approach reduces total cost by 40 to 60% compared to using the premium model for everything, without sacrificing quality at the customer-facing touchpoints.

One Last Thing Before Choosing Your Model

The model matters, but it's not the most important thing. I've seen agents built with GPT-4o that work terribly because the system prompt was poorly designed, and agents built with more modest models that work with surgical precision because someone thought through the conversation logic carefully.

The model is the engine. The agent design — the system instructions, context management, edge case handling, integration with real business data — is what determines whether the agent creates value or creates problems.

If you're evaluating implementing an agent and want a second opinion on what stack makes sense for your specific case, the diagnostic is free. Thirty minutes to review your case, your expected volume, and your privacy constraints — and I'll tell you exactly what model and architecture I'd recommend.

The Question I Get Asked Most

The Criteria That Actually Matter for a Business Agent

GPT-4o (OpenAI)

Claude (Anthropic)

Gemini (Google)

DeepSeek

The Practical Decision Matrix

The Approach I Use in Production

One Last Thing Before Choosing Your Model

Does your business have this problem?