AI Myths — Busted¶

Common misconceptions about AI, machine learning, and large language models. Each myth is addressed with what is actually happening under the hood, and why the distinction matters in practice.

"AI understands language"¶

Reality: Language models predict the next token based on statistical patterns learned from training data. There is no comprehension, no beliefs, no internal world model that we can verify. The output resembles understanding because the patterns are rich enough to cover most situations convincingly.

Why it matters: If you treat an LLM as a reasoning engine, you will be surprised when it confidently produces incorrect information. If you treat it as an extremely capable pattern-matching system, you will design better prompts, better systems, and more realistic safeguards around it.

"More parameters means a smarter model"¶

Reality: A smaller model trained on high-quality data with good techniques can significantly outperform a much larger model trained on poor data. The Chinchilla paper demonstrated that most large models at the time were over-parameterized relative to the amount of training data they received. Microsoft's Phi-2 (2.7B parameters) outperformed models many times its size on several benchmarks simply due to better data curation.

Why it matters: Default to the model that fits your task, latency, and cost requirements — not the largest one available. Bigger is not always better, and the gap is often narrower than expected.

"Neural networks are black boxes — you can't understand them"¶

Reality: Interpretability tools have matured significantly. Attention visualization shows which tokens a model focuses on. Probing classifiers reveal what information is encoded in hidden representations. Mechanistic interpretability research has identified specific circuits — such as induction heads — that perform identifiable functions. The picture is incomplete, but neural networks are far from opaque.

Why it matters: You can debug neural networks. Gradient analysis, activation patching, and attention maps are real tools with real diagnostic value. "It's a black box" is increasingly less accurate as an excuse for skipping analysis.

"AI will replace programmers"¶

Reality: AI has changed programming — it has not replaced it. Models handle boilerplate, autocomplete patterns, and draft initial implementations. Humans still design systems, make architectural decisions, evaluate correctness, manage edge cases, and handle everything the model gets wrong. The role has shifted from writing every line to reviewing, directing, and architecting. Engineers who use AI as a tool are more productive, not redundant.

Why it matters: The most valuable engineers going forward are those who combine domain expertise with the ability to work effectively alongside AI tools. Both skill sets together compound in value.

"You need a PhD in math to work in AI"¶

Reality: You need a working understanding of linear algebra, calculus, probability, and optimization — not graduate-level proofs. You need the intuition to understand what operations do and why they matter, not the ability to derive them from first principles. The foundational math required to build and fine-tune neural networks is accessible to anyone who has covered undergraduate-level coursework.

Why it matters: The gatekeeping around AI math is often overstated. If you can multiply matrices, compute derivatives, and reason about probability distributions, you have the mathematical foundation to build serious AI systems.

"GPT stands for General Purpose Technology"¶

Reality: GPT stands for Generative Pre-trained Transformer. Generative means it produces text as output. Pre-trained means the model was trained once on a large corpus before being adapted for specific applications. Transformer refers to the neural network architecture introduced in the 2017 paper "Attention Is All You Need."

Why it matters: Precision in terminology matters when evaluating vendor claims, reading research, and communicating with engineering teams. GPT is a specific architecture and training approach, not a general category.

"Temperature controls creativity"¶

Reality: Temperature is a mathematical scaling factor applied to output probabilities before token selection. A higher temperature flattens the probability distribution, making less likely tokens more probable — which increases variability. A lower temperature sharpens the distribution, pushing the model toward the most probable tokens. This is randomness control, not a measure of creative thinking or quality.

Why it matters: When output is too repetitive, raise temperature. When it is too incoherent or unpredictable, lower it. Framing it as a "creativity dial" leads to incorrect intuitions and misconfigured systems.

"Fine-tuning teaches the model new facts"¶

Reality: Fine-tuning adjusts how a model behaves, not what it knows. If information was not present in the pre-training data, fine-tuning will not reliably introduce it — the model may hallucinate plausible-sounding responses that are not grounded in the new material. Fine-tuning is effective for changing style, tone, output format, and task-specific behavioral patterns. For new factual knowledge, use retrieval-augmented generation (RAG).

Why it matters: Teams frequently choose fine-tuning when RAG is the right tool, and vice versa. Fine-tune to change how the model responds. Use RAG to give the model access to specific, accurate, updatable knowledge.

"A bigger context window means the model uses all of it equally"¶

Reality: Models exhibit degraded performance on long contexts. Research has documented the "lost in the middle" effect: models attend more strongly to the beginning and end of long prompts and less reliably to content in the middle. A 200K-token context window does not mean the model processes all 200K tokens with equal fidelity. Longer contexts also increase cost and latency.

Why it matters: Do not assume that stuffing a large context window with all available information is equivalent to careful retrieval. Targeted RAG with precise retrieval often outperforms brute-force context stuffing, even when the window technically fits the content.

"AI agents are autonomous"¶

Reality: Current AI agents operate in a structured loop: reason, select a tool, execute, observe the result, and repeat. The loop is defined by the engineer. The tools are built and maintained by the engineer. The guardrails are set by the engineer. The language model is the decision-making component inside that system — not the source of its autonomy. Agents do not have goals, plans, or self-direction in any meaningful sense.

Why it matters: When an agent fails, the failure is almost always in the loop design, tool definitions, or context management — not a mysterious property of the model. Treating agents as autonomous leads to under-engineered systems with insufficient guardrails.

"Transformers understand word order because of positional encoding"¶

Reality: Self-attention has no inherent sense of sequence. It treats input as a set of tokens, not an ordered list. Positional encoding is added to inject order information — different methods (sinusoidal, learned, RoPE, ALiBi) handle this differently. None of them give the model sequential understanding in the way recurrent architectures like LSTMs had it built in. Positional encoding is a practical workaround, and an active area of research.

Why it matters: This explains why very long sequences or tasks that depend on precise positional reasoning (like tracking position in a long document) can be challenging for transformers, even with large context windows.

"Pre-training is just reading the internet"¶

Reality: Pre-training is the process of learning to predict the next token across a massive text corpus. Through this deceptively simple objective, the model learns grammar, factual associations, code patterns, reasoning structures, and much more. However, it also learns biases, errors, and misinformation present in the data. The quality of pre-training data — how it is curated, filtered, and deduplicated — is one of the most significant differentiators between model families.

Why it matters: Garbage in, garbage out — at scale. Understanding that pre-training data quality is a primary driver of model quality is essential for evaluating model claims and making informed model selection decisions.

"RLHF aligns AI with human values"¶

Reality: RLHF aligns a model with the preferences of the specific humans who provided feedback — a necessarily limited, culturally situated, and internally inconsistent group. Those raters disagree with each other, carry biases, and cannot cover every possible situation. RLHF makes models more helpful and less harmful within the scope the feedback defined. It is one alignment technique among many, not a comprehensive solution to the alignment problem.

Why it matters: Do not conflate "RLHF-trained" with "aligned" in the deep sense. It is a training technique that moves behavior in a useful direction. The alignment problem remains an open research challenge.

"Embeddings capture meaning"¶

Reality: Embeddings capture statistical co-occurrence patterns. Words and phrases that appear in similar contexts during training end up with similar vector representations. This correlates strongly with semantic meaning — strongly enough to be highly useful — but the underlying mechanism is distributional statistics, not semantic comprehension. The famous "king − man + woman ≈ queen" relationship works because of distributional patterns in text, not because the model understands royalty or gender.

Why it matters: Embeddings are powerful for similarity search, clustering, and retrieval. Do not over-interpret what "similar vectors" means — it reflects distributional patterns, not semantic equivalence.

"Zero-shot means no training was involved"¶

Reality: Zero-shot means no task-specific examples are provided in the prompt at inference time. The model was still trained on billions of tokens. It generalizes from patterns learned during pre-training to handle new task formats it was not explicitly trained on. Few-shot means providing a small number of examples in the prompt. Neither term refers to a model that learned without training.

"AI models learn the way humans do"¶

Reality: Human learning is sample-efficient, continuous, and cross-domain. A child learns to recognize a dog from a handful of examples and generalizes instantly. Neural networks require millions of examples, have fixed weights after training, and generalize within (but often poorly outside) their training distribution. The analogy between machine learning and human learning is useful for intuition but breaks down quickly under scrutiny.

Why it matters: Anthropomorphizing models produces wrong expectations — about generalization, about learning from new information, about what "understanding" means. Better mental models lead to better system design.

"Scaling laws mean bigger is always better"¶

Reality: Scaling laws describe predictable, empirical relationships between compute, data volume, and model capability. They also show diminishing returns — doubling parameter count does not double performance. They assume data scales proportionally with compute. In practice, many of the most impactful improvements come from better data quality, better architectures, and better training methodologies — not raw scale alone.

Why it matters: A well-engineered smaller model often solves real problems more efficiently and cost-effectively than a larger one. Scale is a variable, not a strategy.

"Open source AI is the same as open weights"¶

Reality: Most models described as "open source" release only the model weights — not the training data, training code, evaluation infrastructure, or data pipeline. True open source AI, like the OLMo project, releases all of those artifacts. Open weights are valuable and enable local deployment and fine-tuning, but they represent a substantially more limited openness than the open source software community has traditionally understood the term to mean.

Why it matters: When evaluating models for compliance, reproducibility, or scientific transparency, the distinction matters significantly. Know what "open" actually means for each model you are assessing.

"Prompt engineering is not real engineering"¶

Reality: Prompt engineering is interface design at the boundary between human intent and model behavior. Doing it well requires understanding tokenization, context window constraints, how models handle ambiguity, output parsing, and failure modes. It is closer to API contract design than to casual conversation. The best prompt engineers apply systematic, testable, iterative methods — the same disciplines that define good software engineering.

Why it matters: Dismissing prompt engineering as non-technical leads to underinvesting in it. A well-designed prompt system is a durable engineering artifact, not a collection of lucky phrases.

"CNNs are obsolete — everything is transformers now"¶

Reality: Vision Transformers (ViT) outperform CNNs on many large-scale benchmarks, but CNNs remain widely deployed. They are faster at inference, better suited for edge and mobile devices, require less data to perform well, and have useful inductive biases like translation invariance and sensitivity to local spatial patterns. Many production computer vision systems still use CNNs or hybrid architectures that combine both approaches.

Why it matters: Architecture selection should be driven by your constraints — dataset size, inference latency, hardware, and task characteristics — not by what is currently generating the most research excitement.

"You need massive compute to build useful AI applications"¶

Reality: Massive compute is required to pre-train foundation models from scratch. It is not required to build useful AI applications. Fine-tuning, LoRA, and QLoRA allow meaningful model adaptation on a single consumer GPU. Many production AI applications require no training at all — just well-designed prompts and retrieval pipelines. The compute barrier applies to creating foundation models, not to leveraging them.

Why it matters: The barrier to entry for building AI-powered products is significantly lower than the barrier to training frontier models. These are different problems. Conflating them overstates the infrastructure requirements for most real-world applications.

"LLMs always give the same answer to the same question"¶

Reality: LLMs are stochastic by default. Given the same input, the model samples from a probability distribution at each token step, producing different outputs on different runs unless temperature is set to zero and sampling is disabled. Even with deterministic settings, minor differences in floating-point arithmetic across hardware can produce variation. Reproducibility in LLM systems requires deliberate configuration.

Why it matters: Non-determinism has significant implications for testing, evaluation, and production reliability. Regression tests that rely on exact string matching will be brittle. Evaluation pipelines must account for output variance.

"RAG and fine-tuning solve the same problem"¶

Reality: RAG and fine-tuning address fundamentally different problems. RAG gives a model access to specific, current, retrievable knowledge at inference time — facts, documents, internal data. Fine-tuning changes the model's learned behavior — its style, format, tone, and task-specific response patterns. They are complementary, not interchangeable. Using fine-tuning to inject factual knowledge is unreliable; using RAG to change behavioral style is inefficient.

Why it matters: Choosing the wrong tool wastes significant time and compute. The decision should be driven by whether you are trying to change what the model knows (RAG) or how it behaves (fine-tuning).

Last updated: April 2026 | Contributions welcome