LLMs in PHP: integrating language models into production systems without rewriting everything

Every team I have spoken to over the last two years has had the same conversation. Engineers want to add LLM features, the CTO says Python, and the platform team, which owns the PHP monolith with ten years of business logic, goes quiet. The argument is that ML tooling is Python-first, LLM SDKs are better in Python, and that is where the talent pool is.

That argument is mostly wrong, and teams acting on it spend six months building a Python microservice that calls their PHP monolith for business logic over HTTP, introducing a network boundary, two deployment pipelines, and a latency budget they did not plan for.

Here is what integrating LLMs into a production PHP system actually looks like, not a demo, but a system running under real traffic.

The PHP + LLM landscape

The PHP ecosystem has three credible options for integrating with LLMs. Direct HTTP to the API, OpenAI, Anthropic, Mistral all expose REST APIs, and an HTTP client plus a JSON decoder is technically everything you need. I have used this for simple completions in systems where adding a dependency was harder than writing 40 lines of wrapper code.

LLPhant is the most complete PHP library for production LLM work. It wraps OpenAI and Anthropic, handles streaming, implements RAG patterns, and supports function calling. It is the option I reach for now in new PHP projects.

The Symfony AI integration, shipped in Symfony 7.2, is a first-party component that has proper dependency injection, event system integration, and respects framework conventions. If you are on Symfony, this is the increasingly correct answer.

The production-ready benchmark for an LLM integration is: does it handle streaming correctly, does it support function calling, can you inject observability, and does it fail gracefully when the API returns 500. LLPhant passes all four.

What I got wrong in the first deployment

Our first LLM integration was a support ticket triage system. The model read incoming tickets and classified them by urgency and department. The PHP code was clean. The deployment was a disaster.

We did not account for API latency in the queue worker timeout. LLM calls averaged 3.2 seconds. The default queue worker timeout was 30 seconds. Under burst load, workers processing multiple tickets simultaneously hit the timeout, the job was retried, and we paid the API twice for the same ticket, with different classifications, which broke downstream routing logic.

// What we had:
class TicketTriageJob implements ShouldQueue
{
    public $timeout = 30;  // default — we did not think about LLM latency

    public function handle(LLMClient $client): void
    {
        $classification = $client->classify($this->ticket->body);
        $this->ticket->update(['department' => $classification->department]);
    }
}

// What was needed:
class TicketTriageJob implements ShouldQueue
{
    public $timeout = 120;       // LLM call + processing overhead
    public $tries = 1;           // never retry — LLM calls are not idempotent
    public $uniqueFor = 3600;    // prevent duplicate processing

    public function handle(LLMClient $client): void
    {
        if ($this->ticket->fresh()->triaged_at !== null) {
            return;  // already processed by a previous attempt
        }

        $classification = $client->classify($this->ticket->body);

        DB::transaction(function () use ($classification) {
            $this->ticket->update([
                'department'  => $classification->department,
                'priority'    => $classification->priority,
                'triaged_at'  => now(),
            ]);
        });
    }
}

The non-idempotency of LLM calls is something teams consistently underestimate. The model does not return the same output for the same input, and re-running a classification after a partial failure is not safe if downstream systems have already acted on the first result.

RAG in production: the index is the product

Retrieval-augmented generation is where PHP LLM integrations get interesting and where the gap with Python shrinks to near zero. The heavy work (embedding generation, vector storage, similarity search) happens at index time, not query time. At query time you are making an HTTP call and running a database query.

use LLPhant\Embeddings\EmbeddingGenerator\OpenAI\OpenAI3LargeEmbeddingGenerator;
use LLPhant\Embeddings\VectorStores\Doctrine\DoctrineVectorStore;

// Indexing (run once or on content updates)
$generator  = new OpenAI3LargeEmbeddingGenerator();
$vectorStore = new DoctrineVectorStore($entityManager, DocumentChunk::class);

foreach ($documents as $doc) {
    $chunks = $splitter->splitDocument($doc, chunkSize: 512, overlap: 64);

    foreach ($chunks as $chunk) {
        $chunk->embedding = $generator->embedText($chunk->content);
    }

    $vectorStore->addDocuments($chunks);
}

// Query time (per user request)
$query     = $request->input('question');
$embedding = $generator->embedText($query);

// pgvector cosine similarity — single query, < 20ms on indexed data
$relevant  = $vectorStore->similaritySearch($embedding, maxResults: 5, minScore: 0.78);

$context   = implode("\n\n", array_map(fn($c) => $c->content, $relevant));

$answer = $llm->chat([
    ['role' => 'system',    'content' => "Answer using only the provided context.\n\n{$context}"],
    ['role' => 'user',      'content' => $query],
]);

The similarity threshold of 0.78 is not a default. It is tuned. Too low and you retrieve irrelevant context that confuses the model. Too high and you retrieve nothing. We ran 200 sample queries against holdout answers and measured recall at different thresholds before deploying. 0.78 was the point where recall was stable and hallucinations dropped to an acceptable level.

Function calling: where PHP fits better than expected

Function calling (the model deciding to invoke a tool and returning structured arguments) is the core mechanism that makes LLM agents practical. PHP is well-suited here because the "tools" are usually existing domain logic: fetch a customer, check order status, run a calculation. You already have that code.

$tools = [
    Tool::create('get_order_status')
        ->description('Returns the current status and ETA for a given order ID')
        ->parameter('order_id', 'string', 'Order UUID', required: true),

    Tool::create('calculate_refund')
        ->description('Calculates refund amount based on order ID and reason')
        ->parameter('order_id', 'string', required: true)
        ->parameter('reason', 'string', 'cancellation | defect | not_received', required: true),
];

$response = $llm->chat($messages, tools: $tools);

// The model may return a tool call instead of text
while ($response->hasToolCalls()) {
    foreach ($response->toolCalls() as $call) {
        $result = match ($call->name) {
            'get_order_status'  => $orderService->getStatus($call->arguments['order_id']),
            'calculate_refund'  => $refundCalculator->calculate(
                $call->arguments['order_id'],
                $call->arguments['reason']
            ),
            default => throw new UnknownToolException($call->name),
        };

        // Feed the tool result back into the conversation
        $messages[] = ['role' => 'tool', 'tool_call_id' => $call->id, 'content' => json_encode($result)];
    }

    $response = $llm->chat($messages, tools: $tools);
}

The while loop handles multi-step tool use. In practice most production agents make 1-3 tool calls per conversation turn. More than that and latency becomes the dominant UX problem.

The observability you actually need

Three metrics I track for every LLM integration. Token consumption per endpoint matters because LLM costs scale with tokens, not requests, one endpoint passing a 10,000-token system prompt on every call will dominate your API bill within days.

// After every LLM call
$this->metrics->increment('llm.tokens.prompt',     $response->usage()->promptTokens);
$this->metrics->increment('llm.tokens.completion', $response->usage()->completionTokens);
$this->metrics->timing('llm.latency_ms',           $response->latencyMs());

For structured outputs (classifications, function calls, JSON extraction) the model will occasionally return malformed output. Track the parse error rate. If it exceeds 2%, your prompt is degrading, the model was silently updated, or your input distribution shifted.

If you are doing LLM work in background jobs, queue depth is a leading indicator of whether your worker count is keeping up with request volume. Watch it before it becomes a page.

The rewrite question, answered honestly

Is Python better for LLM work? For pure ML research, training, and fine-tuning, yes, without question. For building LLM-augmented features on an existing PHP system, the gap is smaller than the migration cost in almost every case I have evaluated.

The question is not "which language is better for LLMs" but "where does the business logic live that the LLM needs to operate on?" If it lives in a PHP system with ten years of domain modelling, you will not duplicate that in six months in a new Python service. You will end up with a thin Python wrapper calling your PHP API, and you will have paid the full price of a rewrite without gaining anything LLPhant could not have done directly from PHP.