Problems with the current LLMs

Last Update: 2025-12-11

I recently developed a TUI for generating code via LLMs in my free time called scriptschnell. This gave me some insight into what currently is lacking in current LLMs.

Speed

After using Cerebras and Groq (Groq with q) to a lesser extent, the speed that OpenAI's models gpt-5.1-codex or gpt-5.1-codex-mini provide is lacking.

Code generation requires a lot of looking up current implementation details, creating or changing files and then validating the result. All this requires a lot of tokens (in my experience just searching through the codebase takes at least 20k tokens).

Some code generation applications like Windsurf use a significantly faster model to speed up some tasks like exploring the codebase.

Tool calls

You can tell that many models are trained to use specific tool calls. E.g. when prompting Kimi K2 Instruct with just "ls" it tries to use a tool call called shell even though that doesn't exist.

More specifically, it required a lot of trial and error to make LLM models write more complex programs for my golang sandbox tool call (e.g. building an application and extracting key errors with a summarize method).

Built-in assumptions

Model performance currently seems best when using built-in assumptions baked in during the training process. This is especially problematic when using newer external libraries than the training knowledge cut-off date.

Context window woes

The current context windows of state-of-the-art models are at least 128k tokens.

But when some part of this is already used by the system prompt and tool call descriptions and then the investigation of the codebase, the context window is too little.

The biggest problem here is that model performance seems to drop off a cliff when using large parts of the given context window.

E.g. when using Claude Opus 4.5, the latest, greatest and pricey model at the time of writing, it seems to love to ignore the codebase style when 3/4 of its context window is used up.

The compaction of the context window to allow seamlessly endless sessions is also hard to steer to success. Transmitting information about code style and problems solved over the summarization boundary feels like a roll of the dice. Also, somehow telling the model that's only half the story leads to failure, because then the model begins going through the codebase again, using up precious tokens.

Vision

Vision still doesn't really work. Even simple tasks like creating an HTML file or SVG file from a screenshot make the model seem incompetent (e.g. creating an SVG from a company logo).

Closing words

I think we still have a long road ahead of us with the current state of transformer models.

When you create agents for specific tasks, you have to optimize your program for the specific model family.

We've come really far with the current LLM technology and I think a lot of "hacks" on top of the current implementations can still improve the current state significantly.

Sources