Problems with the current LLMs

Last Update: 2025-12-11

I recently developed a TUI for generating code via LLMs in my free time called scriptschnell. This gave me some insight into what currently is lacking in current LLMs.

Speed

After using Cerebras and Groq (Groq with q) to lesser extent, the speed that OpenAI's models gpt-5.1-codex or gpt-5.1-codex-mini provide is lacking.

Code generation requires a lot of looking up current implementation details, creating or changing files and then validating the result. All this requires a lot of tokens (in my experience just searching through the codebase takes at least 20k tokens).

Some code generation application like Windsurf use a significantly faster model to speed up some tasks like exploring the codebase.

Tool calls

You can tell that many models are trained to use specific tool calls. E.g. when prompting Kimi K2 Instruct with just "ls" it tries to use the a tool call called shell even though that doesn't exist.

More specifically it required a lot of trial and error to make LLMs models to write more complex programs for my golang sandbox tool call (e.g. building a application and extracting key errors with a summarize method).

Built-in assumptions

Model performance currently seems at best when using built-in assumptions baked in during the training process. This is especially problamtic when using newer external libraries than the training knowledge cut off date.

Context window woes

The current context windows is at least 128k of big of state of the art models.

But when some part of this is already used by the system prompt and tool call descriptions and afterwards the investigation of the codebase, the context window is to little.

The biggest problem here is that model performce seems to drop of a cliff when using large parts of the given context window.

E.g. when using Claude Opus 4.5, the latest, greatest and pricy model at the time of writing, it seems to love to ignore the codebase style of 3/4 of it's context window are used up.

The compaction of the context window to allow seemlessly endless sessions is also hard to steer to success. Transmitting information about codestyle and problems solved over the summarization boundry feels like a roll of dice. Also somehow telling the model that's only half the story leads to failure, because then the model begins going through the codebase again, using up precious tokens.

Vision

Vision still doesn't really work. Even more simple tasks like creating a HTML file or SVG file from a screenshot makes the model seem incompetent (e.g. creating a SVG from a company logo).

Closing words

I think we still have a long road a head of us with the current state of transformer models.

When you create agents for a specific tasks, you have optimize your program for the specific model family.

We came really far with the current LLM technology and I think a lot of "hacks" on top of the current implementations can improve the current state still significantly.

Sources