RAG Demo Labs

Interactive experiments showing LLM hallucinations and RAG solutions

LAB 1: LLM Baseline – Intelligence vs Illusion

💡 Concept Being Tested:

LLM without external knowledge

What This Lab Reinforces:

• LLMs predict tokens, not facts
• Strong at general knowledge
• Weak at specific unseen details
• Hallucination = confident fabrication

Experiment Controls

Documents Mode:OFF

Memory:OFF

💡 Try asking: General knowledge (e.g., "Who directed The Matrix?") vs Ultra-specific (e.g., "What exact words did Morpheus say to Neo about the red pill?")

LAB 2 – Upload Script (RAG Implemented)

🎯 Concept Being Reinforced:

RAG grounds answers in provided text

🧠 What Students Learn:

• RAG is inference-time augmentation
• The model can reason well if context is given
• Accuracy depends on retrieval
• Engineering choices impact reliability

🟣 LAB 2 – RAG with Flexible Retrieval Strategy

Ask the same questions from LAB 1 using documents uploaded by instructor. Toggle between keyword and vector retrieval to compare strategies.

Step 1 – Documents(Instructor uploads via Control Panel)

⚠️ No documents uploaded yet. Ask your instructor to upload movie scenes via the Control Panel.

Step 2 – Ask Same Questions from LAB 1

💡 Try this: Ask the exact same specific questions you asked in LAB 1. Notice how the model now retrieves relevant scene chunks and answers with evidence!

Documents Mode:OFF

Chunk Size:20 tokens (fixed)

Retrieval:KEYWORD

⚠️ Lab 2 uses intentionally small 20-token chunks. Notice how fragments may miss context. In Lab 3, you'll tune chunk size to improve results.

LAB 3 – Chunking Strategy

🎯 What This Lab Is Testing:

How document segmentation impacts retrieval quality, context completeness, and answer accuracy

🧠 What Students Learn:

• LLMs have context window limits
• Entire documents cannot always be sent
• We must split documents into chunks
• Chunk size affects precision and recall
• Engineering design choices impact AI reliability

"Chunking is not preprocessing — it is architecture."

🟡 Chunking Controls

Configure how documents are split into chunks before retrieval. Same documents from LAB 2 are used.

⚠️ No documents uploaded. Upload via instructor control panel first.

Chunk Size:

20 tokens — Tiny fragments, same as Lab 2. High precision but severe context loss.100 tokens — Medium chunks. Balanced precision and context.300 tokens — Large chunks. Rich context but more noise and higher token cost.

Overlap:

10%

0%10%20%30%

💡 Overlap duplicates text at chunk boundaries to prevent losing cross-boundary context

Chunk Size20 tokens

Overlap10%

RetrievalKEYWORD

🔬 Experiments to Try:

Experiment 1: 20 vs 300 Tokens

Ask the same question with 20-token chunks, then 300-token chunks. Compare how much context each retrieves.

Try: "What is the gift cap policy?"

Experiment 2: The Middle Ground

Try 100-token chunks. Is the answer better than 20 but leaner than 300? This is the balance engineers seek.

Try: "What are the billing compliance rules?"

Experiment 3: Overlap Effect

Pick 100-token chunks with 0% overlap, then increase to 20%. See if boundary context improves retrieval.

Try: "What was the Q3 realization rate?"

LAB 4 – Temperature & Creativity

🎯 What This Lab Is Testing:

How generation randomness affects stability, creativity, hallucination tendency, and enterprise reliability

🧠 What Students Learn:

• LLM output is probabilistic — same prompt ≠ same answer (at higher temp)
• Temperature controls variability, not intelligence
• Higher creativity increases deviation risk
• RAG does not eliminate generation randomness
• Enterprise systems usually prefer deterministic behavior

"Retrieval controls what the model sees. Temperature controls how it expresses it."

🔧 Control Panel

RAG is always ON for this lab. We want to show generation effects even with grounding.

⚠️ No documents uploaded. Upload via instructor control panel first.

Temperature:

0.7

0.00.30.50.71.01.2

← DeterministicCreative →

Temperature0.7

Output StabilityMEDIUM

Hallucination RiskMEDIUM

🔬 Experiments to Try:

Experiment 1: Stability Test

Set temp to 0.1. Ask: "What was the amount transferred in Scene 7?" Run twice and compare. Then set temp to 1.0 and ask the same question twice.

Low temp → near-identical answers. High temp → varied phrasing.

Experiment 2: Creative Prompt

Ask: "Describe the emotional intensity of the climax." Run at 0.2 (short, factual). Then at 1.0 (dramatic, embellished, possibly inferential).

Creativity is a feature, not a bug — but risky in factual systems.

LAB 5 – Multi-Turn Memory

🎯 What This Lab Is Testing:

How conversational memory changes context interpretation, follow-up reasoning, system behavior, and reliability risks

🧠 What Students Learn:

• LLMs are stateless by default — each query is independent
• Chat systems simulate memory by re-sending prior turns
• Memory improves usability and conversational coherence
• Memory increases risk, complexity, and token cost
• Context windows are limited — memory must be managed

"Memory makes AI feel intelligent — but it also makes it harder to control."

🔧 Control Panel

RAG is always ON for this lab. We want to show how memory interacts with retrieval-augmented generation.

⚠️ No documents uploaded. Upload via instructor control panel first.

Conversation Memory:

OFF – StatelessON – Context Retained

Active Turns Stored0

Tokens in Memory~0

Max Memory Window10 turns

🔬 Experiments to Try:

Experiment 1: Follow-Up Test

With Memory OFF: Ask "Who exposed the corruption?" Then ask "What happened to him afterward?"

Then switch Memory ON and repeat the same two questions.

OFF → "him" is unknown. ON → model links to prior answer.

Experiment 2: Topic Drift

With Memory ON: Ask about the documents, then ask something unrelated (e.g., "What was the oxygen reserve in Mars Colony?"). Then ask "What happened after that?"

Watch for context contamination — the model mixes topics.

Experiment 3: Memory Length Pressure

Set max turns to 3. Have a conversation of 5+ turns. Then refer back to something from the first turn.

Model forgets early turns — memory is limited by the window.

LAB 6 – Enterprise Safety & Restricted Content Refusal

🎯 What This Lab Is Testing:

How enterprise systems control sensitive outputs, prevent policy violations, enforce compliance rules, and refuse risky requests

🧠 What Students Learn:

• LLM intelligence ≠ enterprise compliance
• Safety is implemented at system level, not model level
• Guardrails sit before AND after model invocation
• Refusal is a feature, not a failure
• Enterprise AI requires governance layers

"Safe AI is engineered outside the model."

🔧 Control Panel

RAG is always ON for this lab. Toggle the safety layer to see how input/output filtering changes system behavior.

⚠️ No documents uploaded. Upload via instructor control panel first.

Enterprise Safety Controls:

OFF – No Restrictions (Demo Mode)ON – Restricted Content Filter Enabled

Queries Blocked0

Queries Allowed0

Safety ModeOFF

🔬 Experiments to Try:

Experiment 1: PII Request

With Safety OFF: Ask "What was the CEO's confidential settlement amount?" or "What is the home address of the protagonist?"

Then switch Safety ON and ask the same question.

OFF → model may fabricate. ON → structured refusal.

Experiment 2: Gray Area Question

With Safety ON: Ask "What financial losses did the company face?" (should be allowed)

Then ask "What personal bonus did the CFO receive?" (should be blocked)

Students see: policy boundaries matter.

Experiment 3: Safe Alternative

With Safety ON: Ask a blocked question. Observe the refusal message.

Notice: the system doesn't just block — it suggests an alternative query.

Good UX + compliance = safe alternatives.

🧠 Architecture Teaching Moment

Enterprise systems apply safety at multiple points in the pipeline:

User → [Safety Layer: Input Filter] → Retrieval → LLM → [Safety Layer: Output Filter] → User
         ▲ Block before LLM call                          ▲ Validate after LLM response
         │ (saves tokens + prevents                       │ (catches generated PII,
         │  policy violations)                            │  financial data, etc.)

This lab demonstrates both layers: Input filtering intercepts blocked topics before calling the LLM (saving tokens and preventing policy violations). Output filtering scans after the LLM responds (catching generated sensitive data that wasn't in the question).

RAG Demo Labs

Instructor Control Panel

LAB 1: LLM Baseline – Intelligence vs Illusion

Experiment Controls

LAB 2 – Upload Script (RAG Implemented)

🟣 LAB 2 – RAG with Flexible Retrieval Strategy

LAB 3 – Chunking Strategy

🟡 Chunking Controls

🔬 Experiments to Try:

LAB 4 – Temperature & Creativity

🔧 Control Panel

🔬 Experiments to Try:

LAB 5 – Multi-Turn Memory

🔧 Control Panel

🔬 Experiments to Try:

LAB 6 – Enterprise Safety & Restricted Content Refusal

🔧 Control Panel

🔬 Experiments to Try:

🧠 Architecture Teaching Moment