> For the complete documentation index, see [llms.txt](https://theaihandbook.leomohan.net/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://theaihandbook.leomohan.net/chapter-8-what-are-the-different-kinds-of-ai.md).

# Chapter 8: What Are the Different Kinds of AI?

### The “Family Tree” Chapter

**Q1: What’s the difference between “Narrow AI” and “General AI”?**

**A:** Think of Narrow AI as a world-class chef who can cook any dish perfectly but can’t drive a car, balance a checkbook, or hold a conversation about anything except food. They’re brilliant in their specialty and useless outside it.

That’s every AI that exists today. Narrow AI (also called “Weak AI”) is designed for one specific task. AlphaGo plays Go better than any human but couldn’t play chess if its life depended on it. Your email spam filter can spot phishing attempts but can’t write a poem. They’re geniuses in a cage.

**General AI** (or “Strong AI”) would be more like a human—able to learn and perform any intellectual task. It could beat you at chess, then write you a sonnet about your loss, then help you plan a vacation, then diagnose your medical symptoms. It would be flexible, adaptable, and broadly capable.

General AI doesn’t exist yet. It’s the holy grail of AI research, and estimates for when (or if) we’ll achieve it range from a few decades to never. Every AI you’ve ever interacted with is Narrow AI, no matter how impressive it seems.

**Q2: Is ChatGPT the same kind of AI as a self-driving car?**

**A:** No, they’re different species in the AI family. They share some distant relatives but have completely different specialties.

ChatGPT is a **Large Language Model (LLM)** . It was trained on text—massive amounts of human writing. Its entire world is words. It understands patterns in language, so it can generate text, answer questions, write code, and hold conversations. But it has no direct experience of the physical world. It doesn’t know what “red” looks like or what “heavy” feels like. It only knows how humans describe these things.

A self-driving car uses multiple AI systems working together: **computer vision** to “see” the road and obstacles, **sensor fusion** to combine data from cameras, radar, and lidar, and **decision-making algorithms** to plan routes and avoid collisions. It understands the physical world in real-time but can’t write you a poem about the experience.

One is a words expert. The other is a navigation and perception expert. Both are Narrow AI, just with different specializations.

**Q3: What is “Generative AI” and what can it generate?**

**A:** Generative AI is the branch of AI that creates new things rather than just analyzing or classifying existing things. It’s the difference between an art critic (who looks at paintings) and an artist (who paints new ones).

Most traditional AI was **discriminative**—it categorized, predicted, or made decisions. “Is this email spam? Will this customer buy? Is this X-ray showing cancer?”

Generative AI **creates**. It learns patterns from existing data and then generates new, original examples that follow those patterns. Depending on what it was trained on, it can generate:

* **Text** (ChatGPT, Claude) — articles, poems, code, conversations
* **Images** (DALL-E, Midjourney) — photorealistic pictures, paintings, designs
* **Music** — original compositions in any style
* **Video** — short clips and animations
* **Voice** — synthetic speech that sounds like a real person
* **3D models** — objects for games or virtual reality
* **Code** — working software in various programming languages

The “generative” part means it’s producing something new, not just recognizing something old. It’s creative in a statistical sort of way.

**Q4: What is “Computer Vision” and how does AI see?**

**A:** Computer vision is the field of AI that enables machines to interpret and understand the visual world. It’s not “seeing” the way humans see—it’s mathematically analyzing pixels.

Imagine you have a grid of millions of tiny colored squares (that’s a digital image). Computer vision algorithms analyze patterns in these squares: where light meets dark (edges), where colors cluster (objects), how patterns change over time (motion).

This allows AI to do things like:

* **Object detection:** “There’s a person, a car, and a stop sign in this image”
* **Face recognition:** “This face belongs to John, not Sarah”
* **Image classification:** “This is a photo of a beach, not a mountain”
* **Optical character recognition (OCR):** Turning a photo of text into actual digital text
* **Medical imaging analysis:** “This scan shows a possible tumor”

Computer vision doesn’t “see” meaning. It sees numbers and matches patterns. But those pattern-matching abilities now rival or exceed human performance in many specific tasks.

**Q5: What is “Natural Language Processing” and how does AI understand me?**

**A:** Natural Language Processing (NLP) is the branch of AI that helps computers understand, interpret, and generate human language. It’s the bridge between how humans communicate (messy, ambiguous, context-dependent) and how computers prefer to communicate (precise, structured, unambiguous).

NLP involves multiple challenges:

* **Understanding:** Figuring out that “I’m feeling under the weather” means sick, not literally under weather
* **Context:** Knowing that “bank” means river side in one sentence and financial institution in another
* **Intent:** Recognizing that “Can you pass the salt?” is a request, not a question about your abilities
* **Sentiment:** Detecting that a customer review is angry versus satisfied
* **Generation:** Producing natural-sounding responses

When you talk to Siri, when your email suggests replies, when Google Translate converts Spanish to English, when your word processor checks grammar—that’s NLP in action.

Modern NLP, powered by large language models, has become so good that it often feels like understanding. But remember: it’s statistical pattern matching, not comprehension in the human sense.

**Q6: What are “Large Language Models” (LLMs) and why are they “large”?**

**A:** Large Language Models (LLMs) like GPT-4, Claude, and Llama are AI systems trained on enormous amounts of text to understand and generate human language. They’re the technology behind ChatGPT and similar chatbots.

They’re called “large” for two reasons:

First, they’re trained on **massive datasets**—think millions of books, billions of web pages, trillions of words. They’ve read more than any human could in a thousand lifetimes.

Second, they have **huge numbers of parameters** (the internal settings that get adjusted during training). Early language models had millions of parameters. Modern LLMs have hundreds of billions. Each parameter is like a tiny dial that fine-tunes how the model processes language. More parameters mean more capacity to capture nuance and complexity.

What makes LLMs special is their versatility. Unlike older AI that needed separate models for translation, summarization, and question-answering, a single LLM can do all of these and more. They’re generalists of language.

The catch? They’re enormously expensive to train (tens or hundreds of millions of dollars) and require massive computing power. That’s why only a handful of companies have built them.

**Q7: What is “Speech Recognition” and how does it turn my voice into text?**

**A:** Speech recognition is the technology that converts spoken words into written text. It’s what lets you dictate to your phone, talk to smart speakers, or get automatic captions on videos.

The process happens in stages:

**First**, the system captures your audio and breaks it into tiny slices—milliseconds long. It analyzes the sound waves for patterns: frequencies, amplitudes, timings.

**Second**, it identifies which sounds (phonemes) are present. English has about 44 distinct sounds—the “b” in “bat,” the “ee” in “bee.” The AI gets really good at recognizing these from all the variations in human voices.

**Third**, it figures out how the sounds combine into words. This is where language models help. “I scream” and “ice cream” sound almost identical. The AI uses context to decide which one makes sense.

**Finally**, it outputs text.

Modern speech recognition is remarkably good, even with accents, background noise, and fast talking. But it still struggles with heavy accents, overlapping speech, and technical jargon it wasn’t trained on.

**Q8: What is “Predictive AI” and how does it guess the future?**

**A:** Predictive AI looks at past data to forecast future outcomes. It’s not magic fortune-telling—it’s sophisticated pattern recognition applied to “what happens next.”

Think of it like weather forecasting. Meteorologists don’t know for certain if it will rain tomorrow. But they look at thousands of past weather patterns, compare them to current conditions, and calculate probabilities: “There’s a 70% chance of rain.”

Predictive AI does the same thing with all kinds of data:

* **Retail:** “Based on past purchases and current trends, this customer is 80% likely to buy next month”
* **Healthcare:** “Given these symptoms and patient history, there’s a 60% probability of this diagnosis”
* **Manufacturing:** “This machine’s vibration patterns suggest it will fail in about two weeks”
* **Finance:** “This transaction has a 95% probability of being fraudulent”

The AI doesn’t “know” the future. It calculates probabilities based on patterns in historical data. It’s most reliable when the future resembles the past, and least reliable during unprecedented change.

**Q9: What is “Recommendation AI” (like on Netflix or TikTok)?**

**A:** Recommendation AI is the engine that decides what you see next on streaming services, shopping sites, and social media. It’s perhaps the most influential AI in your daily life, constantly shaping what you watch, buy, and read.

These systems work through a combination of techniques:

**Collaborative filtering:** “People who liked what you liked also liked this.” The AI finds patterns across millions of users. If your viewing history matches Group A, it recommends what Group A enjoyed that you haven’t seen yet.

**Content-based filtering:** “You watched a lot of romantic comedies, so here are more romantic comedies.” The AI analyzes the properties of content you engaged with and finds similar items.

**Reinforcement learning:** “You watched this video all the way through—that’s positive reinforcement. You scrolled past this one—that’s negative.” The AI constantly updates based on your behavior.

What makes TikTok’s recommendation so powerful is its speed and granularity. It learns from every second of your attention. Scroll past something in 0.5 seconds? That’s data. Watch something twice? That’s stronger data. Rewatch a specific part? Even stronger.

The goal is to maximize engagement by predicting exactly what will keep you watching. It’s brilliant technology with profound psychological effects.

**Q10: What is “Anomaly Detection” and how does it spot fraud?**

**A:** Anomaly detection is AI that identifies things that don’t fit the pattern—the odd ones out, the unusual events, the potential problems hiding in normal data.

Imagine you’re a security guard at a mall. You learn what “normal” looks like: people walking at a certain pace, stores opening and closing on schedule, usual crowd patterns. One day, someone starts running. You notice immediately because it’s different.

Anomaly detection works the same way. The AI learns what “normal” means for your data—typical spending patterns, usual network traffic, standard machine vibrations. Then it flags anything that deviates significantly.

This is how your credit card company knows to alert you when:

* You usually buy coffee near home, but suddenly there’s a $2,000 charge in another country
* You typically spend $50-100 at groceries, but today there’s a $500 charge
* You never make purchases at 3 AM, but suddenly there’s one

The AI isn’t told specific fraud rules. It learns your normal pattern and screams “something’s different!” when that pattern breaks.

Beyond fraud, anomaly detection monitors factory equipment for early failure signs, spots network intrusions, identifies unusual medical test results, and finds manufacturing defects.

**Q11: What is “Robotics” and how is AI different from automation?**

**A:** Robotics is the field of designing and building physical machines that can interact with the world. When you add AI to robots, they stop being just programmable machines and start becoming adaptable agents.

Here’s the key difference:

**Traditional automation** (like a factory robot arm) does the same exact motion thousands of times. It’s programmed: “Move 30cm right, grip, rotate 90 degrees, release.” If you move the object 1cm left, the robot fails because it has no eyes and no adaptability.

**AI-powered robotics** can sense and adapt. A robot with computer vision can see that the object moved and adjust its grip. A robot with reinforcement learning can figure out how to pick up objects it’s never seen before. A robot with natural language understanding could respond to “grab the red one, not the blue one.”

Boston Dynamics’ robots that walk on rough terrain? That’s AI in action—constantly adjusting balance based on sensor feedback. A roomba that maps your home and cleans efficiently? That’s AI-powered robotics.

The robot is the body. AI is the nervous system that lets it respond intelligently to a changing world.

**Q12: What is “Expert Systems” (the old kind of AI)?**

**A:** Expert systems were the dominant form of AI from the 1970s through the 1990s. They were based on a simple idea: capture human expertise as a set of rules and let a computer apply those rules consistently.

Imagine you want to diagnose diseases. You’d sit down with a dozen expert doctors and ask: “What rules do you use? If symptom A and symptom B, what’s the diagnosis? What exceptions exist?” You’d encode all this as “if-then” statements—potentially thousands of them.

The resulting system could then diagnose patients by following the same rules the doctors provided.

Expert systems worked well for narrow, well-understood domains where experts could articulate their reasoning. They were used for medical diagnosis, equipment troubleshooting, and financial planning.

But they had a fatal flaw: they couldn’t learn or adapt. If a new disease appeared, a human had to add new rules. They were brittle—stray outside their programmed knowledge, and they failed completely. They also couldn’t handle the kind of pattern recognition that modern machine learning excels at.

Today’s AI learns from data. Expert systems learned from human experts writing rules. It’s a fundamentally different approach, and modern AI has largely replaced them for complex tasks.

**Q13: What is “Computer Vision” vs. “Image Generation” (seeing vs. creating)?**

**A:** This is the difference between an art critic and an artist. Both understand art, but one analyzes while the other creates.

**Computer vision** is about understanding images. Give it a photo, and it tells you what’s there: “This is a beach scene with three people, an umbrella, and a dog.” It might also tell you where things are located, how they’re moving, or whether the image has been manipulated. Computer vision takes images as input and produces information as output.

**Image generation** (like DALL-E or Midjourney) is about creating images. Give it a description—“a photorealistic beach scene with three people and a golden retriever under a red umbrella at sunset”—and it produces a new image matching that description. It takes information as input and produces images as output.

Both rely on understanding visual patterns, but they work in opposite directions. Computer vision compresses images into understanding. Image generation expands understanding into images.

Some advanced systems combine both, like an AI that can edit photos based on text commands: “Make the sunset more dramatic” requires both understanding the current image and generating the modified version.

**Q14: What is “Multimodal AI” (AI that handles text, images, and sound)?**

**A:** Multimodal AI is artificial intelligence that can work with multiple types of data—text, images, audio, video—often all at once. It’s like having a single person who can read, see, and hear, rather than separate specialists for each sense.

Most early AI was unimodal—specialized in one thing. You had a text AI, a separate image AI, a separate audio AI. They couldn’t share understanding.

Multimodal AI changes this. A multimodal model can:

* Look at a photo and describe it in words
* Read a recipe and imagine what the dish should look like
* Watch a video and generate a transcript with speaker identification
* Hear a sound and generate an image of what might make that sound
* Answer questions about the content of images or videos

GPT-4 with vision capabilities is multimodal—it can “see” images you upload and reason about them. Google’s Gemini is built as multimodal from the ground up.

The magic is that these models develop a shared understanding across different senses. The concept of “dog” exists in the same internal representation whether it comes from the word “dog,” a photo of a dog, or the sound of barking.

This is closer to how humans experience the world—we don’t have separate brains for vision and language; it all integrates into a unified understanding.

**Q15: What is “Open Source AI” vs. “Closed Source AI”?**

**A:** This is the difference between a public library and a private collection. One is freely available for anyone to use, study, and build upon. The other is owned and controlled by a company.

**Open Source AI** means the model’s code and trained parameters are publicly released. Anyone can download it, run it on their own computer, modify it, and build applications with it. Examples include Meta’s Llama, Mistral’s models, and various models from research institutions.

**Benefits:** Transparency (you can inspect how it works), customization (you can fine-tune it for your needs), privacy (you run it locally, no data sent to companies), and independence (you’re not locked into one provider).

**Closed Source AI** means the model is kept secret by its creator. You access it through an API or web interface, but you can’t see inside or modify it. Examples include OpenAI’s GPT-4, Google’s Gemini, and Anthropic’s Claude.

**Benefits:** The company bears all the computing costs, handles safety filtering, and maintains the infrastructure. You get access to cutting-edge capabilities without technical expertise.

There’s tension between these approaches. Open source enables innovation and transparency but risks misuse. Closed source offers more control but concentrates power in a few companies. Both have important roles.

**Q16: What are “AI Agents” and can they act on my behalf?**

**A:** AI agents are systems that can independently perform tasks on your behalf, making decisions and taking actions without your moment-by-moment supervision. They’re not just answering questions—they’re doing things.

Think of a personal assistant. You don’t tell them “type each letter of this email.” You say “schedule a meeting with the team for next week.” They figure out the rest: check calendars, find available times, send invitations, book a room.

AI agents work similarly. You give them a goal, and they figure out the steps:

* “Book me a flight to Chicago for under $400 leaving Friday returning Monday” — the agent searches airlines, compares prices, checks your calendar, and makes the purchase
* “Research competitors for my business and email me a summary each morning” — the agent browses websites, reads news, compiles findings, and sends reports
* “Order groceries for the week based on my usual shopping list, but don’t buy things I already have” — the agent checks your inventory, compares prices, and places the order

Current AI agents are still primitive compared to humans. They struggle with multi-step tasks, get confused by unexpected situations, and can’t easily learn from mistakes. But they’re improving rapidly and represent a major shift from chatbots to do-bots.

**Q17: What is “Edge AI” (AI that runs on my phone, not in the cloud)?**

**A:** Edge AI is artificial intelligence that runs directly on your device—phone, laptop, smart speaker, camera—rather than sending data to the cloud for processing. The “edge” means the edge of the network, where your device lives.

Most early AI required cloud computing. You’d speak to your phone, it would send your voice to Google’s servers, they’d process it, and send back the answer. This worked but had drawbacks: required internet connection, introduced delays, and sent your private data to someone else’s computer.

Edge AI changes this. Your phone now has a tiny AI chip that can process language, recognize faces, and translate speech entirely on device. Examples:

* Google’s Pixel phones transcribe voice recordings without internet
* iPhones recognize faces in your photos locally
* Smart cameras detect motion without uploading video
* Keyboard apps predict your next word without sending typing to servers

The benefits are huge: privacy (your data never leaves your device), speed (no internet lag), reliability (works offline), and cost (no cloud computing fees).

The trade-off is that edge AI is less powerful than cloud AI—your phone can’t match a data center full of supercomputers. But chips are getting better, and more AI is moving to the edge every year.

**Q18: What is “Diffusion” and how does it create images from noise?**

**A:** Diffusion is the technology behind modern AI image generators like DALL-E, Midjourney, and Stable Diffusion. It’s a fascinating process that works kind of like reverse sculpting.

Imagine you have a block of marble. A sculptor removes material until a statue emerges. Diffusion works the opposite way: it starts with pure noise (like TV static) and gradually removes the noise until a clear image emerges.

The training process teaches the AI what “removing noise” looks like. Researchers take millions of real images and gradually add noise until they become pure static. They show the AI this process: “Here’s a photo of a cat. Here’s the same photo with a little noise. Here’s with more noise. Here’s pure static.”

The AI learns to reverse this—to look at a noisy image and predict what the less-noisy version should look like. It learns how to “denoise.”

When you give it a prompt like “a cat wearing a hat,” the generation process starts with random static. The AI repeatedly applies its denoising, each time making the image clearer and more aligned with “cat wearing a hat.” After many steps, the static resolves into your requested image.

It’s like starting with a cloudy sky and watching the clouds slowly form into a specific shape—except the AI is guiding the clouds based on your description.

**Q19: What are “Transformers” (the “T” in ChatGPT) and why were they revolutionary?**

**A:** Transformers are a type of neural network architecture introduced in a 2017 paper titled “Attention Is All You Need.” They transformed AI (pun intended) and became the foundation for virtually all modern language models.

Before transformers, AI processed text sequentially—one word at a time, left to right. This was slow and missed connections between far-apart words. In a long sentence, the AI might forget the subject by the time it reached the verb.

Transformers introduced a mechanism called **“self-attention.”** This allows the model to look at all words in a sentence simultaneously and weigh their relationships. It can connect “they” to “the researchers” even if they’re paragraphs apart. It can understand that in “The animal didn’t cross the street because it was too tired,” “it” refers to the animal, but in “…because it was too narrow,” “it” refers to the street.

The name “transformer” comes from how the architecture transforms sequences of words into rich internal representations that capture meaning and relationships.

This breakthrough enabled models to handle much longer texts, understand nuance better, and train more efficiently. Every major AI system you’ve heard of—GPT, BERT, Gemini, Claude—is built on transformer architecture.

The “T” in ChatGPT stands for “Generative Pre-trained Transformer.” It’s right there in the name.

**Q20: Which kind of AI is most like science fiction’s “thinking machines”?**

**A:** None of them—at least not yet. Science fiction imagined machines that think, feel, and understand like humans. Today’s AI, no matter how impressive, is fundamentally different.

The closest might be **Large Language Models** like ChatGPT. They can hold conversations, answer questions, write stories, and even seem to reason. They feel the most “human-like” in interaction. But it’s an illusion. They’re sophisticated pattern-matching engines, not thinking beings.

**Multimodal models** come closer to human-like perception because they integrate different senses. An AI that can see images, read text, and hear audio begins to approach something like unified understanding.

**AI agents** that take actions in the world feel more like autonomous entities. A system that can book your travel, manage your calendar, and negotiate with customer service seems almost like a digital person.

But the fundamental gap remains: today’s AI has no consciousness, no desires, no genuine understanding, no sense of self. It doesn’t know what it’s saying. It doesn’t care about outcomes. It’s a tool, no matter how eloquent.

Science fiction’s thinking machines remain fiction. What we have is something else—powerful, useful, transformative—but not artificial people. The day we create real machine consciousness, we’ll need a whole new chapter to describe it.

***

💬 Enjoyed this chapter? Have questions or thoughts?\
Join the discussion on GitHub → [**Click here to Comment**](https://github.com/leomohan/theAIhandbook/discussions)


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://theaihandbook.leomohan.net/chapter-8-what-are-the-different-kinds-of-ai.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.