Prompting Text AI vs Image AI: Totally Different Games
The fundamental differences between prompts for LLMs and generative AI for images and video.
-0033.png&w=3840&q=75)
Here's something that tripped me up when I started generating images. I thought writing prompts was writing prompts. Same skill, different tool. Nope.
Prompts for ChatGPT and prompts for Midjourney require completely different thinking. One wants structure and instructions. The other wants descriptions and vibes. Once I understood this, my results got way better.
Let me break it down.
Two Different Worlds
Language Models (LLMs)
These are trained on text to generate, analyze, and process text.
Examples: ChatGPT, Claude, Gemini, Llama, Mistral
Typical tasks: writing, analysis, code, Q&A, summarization
Generative Models (Images, Video)
These create visual content from text descriptions.
For images: DALL-E 3, Midjourney V7, Stable Diffusion 3.5, Flux, Ideogram, GPT-4o Image, Nano Banana
For video: Sora 2, Veo 3, Runway Gen-4, Kling 2.6, Pika, Luma
The Core Difference
LLM Prompts: Instructions and Structure
| Element | Purpose |
|---|---|
| Role | Sets expertise and style |
| Instructions | Step-by-step what to do |
| Context | Background info and data |
| Constraints | What NOT to do |
| Output format | Structure of the response |
LLM prompt example:
<role>You are a senior marketing analyst</role>
<instructions>
Analyze the campaign data and provide:
1. Key metrics summary
2. Performance trends
3. Recommendations
</instructions>
<constraints>
- Use only provided data
- Be concise (max 500 words)
</constraints>
<data>{{CAMPAIGN_DATA}}</data>
Image Prompts: Descriptions and Visual Attributes
| Element | Purpose |
|---|---|
| Style | Artistic aesthetic |
| Subject | Main object/character |
| Setting | Environment and context |
| Lighting | Light sources and mood |
| Composition | Angle and framing |
| Technical | Resolution, aspect ratio |
Image prompt example:
A photorealistic portrait of an elderly Japanese ceramicist
with deep, sun-etched wrinkles and a warm, knowing smile.
Natural window light from the left, shallow depth of field,
neutral background. Serene and masterful mood.
See the difference? One is giving orders. The other is painting a picture with words.
Quick Comparison
| Aspect | LLMs | Image/Video Models |
|---|---|---|
| Format | Structured (XML/Markdown) | Descriptive text |
| Keywords vs sentences | Full sentences | Depends on model* |
| Negative instructions | <constraints> tags |
Negative prompts |
| Iteration | Dialogue and refinement | Rerolls and variations |
| Examples | Text examples | Reference images |
| Length control | Specified in instructions | Not applicable |
| Style control | Tone and format | Artistic aesthetic |
*Modern models like Midjourney V6+, Flux, Nano Banana prefer full descriptive sentences over keyword lists.
Writing Image Prompts
Basic Structure
[Style/Aesthetic] + [Subject] + [Setting] + [Lighting] + [Composition] + [Technical]
The Big Shift: Sentences Over Keywords
Older approach (doesn't work as well anymore):
woman, red dress, cafe, morning, coffee, vintage, 4k, award winning
Modern approach (works much better):
A young woman in a flowing crimson dress sits at a Parisian sidewalk cafe,
her fingers wrapped around a steaming espresso cup as golden morning light
filters through the awning, creating soft shadows on the vintage iron table.
Modern models - especially Midjourney V6+, Flux, and Nano Banana - understand descriptive sentences much better than keyword lists.
Platform-Specific Examples
DALL-E 3 / GPT-4o Image
A high-resolution, studio-lit product photograph of a minimalist ceramic
coffee mug in matte black, presented on a polished concrete surface.
Soft diffused lighting from above, subtle shadow, clean background.
Square image.
Midjourney V7
Haute-couture advertising campaign photographed by Erik Madigan Heck.
Two models wearing Comme des Garcons Avant-Garde costume.
Mongol steppe in background. Northern lights in sky --ar 2:3 --v 7
Midjourney has special parameters:
--ar 2:3- aspect ratio--v 7- model version--cref [URL]- character reference--sref [URL]- style reference
Stable Diffusion 3.5
Positive: majestic lion with golden mane, hyperrealistic, 8K, detailed fur
Negative: blurry, low quality, distorted, bad anatomy, extra fingers
SD's superpower: full negative prompts and prompt weighting with (keyword:1.5).
Flux
A hyperrealistic portrait of a weathered sailor in his 60s,
with deep-set blue eyes, a salt-and-pepper beard, and sun-weathered skin.
He's wearing a faded blue captain's hat and a thick wool sweater.
The background shows a misty harbor at dawn.
Flux uses dual-encoder (T5 + CLIP) and has the best text rendering in the industry.
Writing Video Prompts
Basic Structure
[CAMERA/SHOT] + [SUBJECT] + [ACTION] + [ENVIRONMENT] + [STYLE] + [AUDIO]
Components
- Subject - who/what is in focus
- Context - where the action happens
- Action - what the subject does
- Style - visual aesthetic
- Camera - shot type and movement
- Lighting - mood and atmosphere
- Audio - sound effects, dialogue (for Sora 2, Veo 3)
Platform Examples
Sora 2 (OpenAI)
Style: Hand-painted 2D/3D hybrid animation with soft brush textures.
Inside a cluttered workshop, a small round robot sits on a wooden bench.
Cinematography:
Camera: medium close-up, slow push-in with gentle parallax
Lens: 35mm virtual lens; shallow depth of field
Lighting: warm key from overhead; cool spill from window
Actions:
- The robot taps the bulb; sparks crackle.
- It flinches, dropping the bulb.
- Robot says: "Almost lost it... but I got it!"
What I learned about Sora:
- Short clips (4 sec) are more stable than long ones
- One camera move per shot
- Dialogue goes in a separate block
Veo 3 (Google)
Camera: Medium shot, slow push-in
Subject: A seasoned grey-bearded man in sunglasses and paisley shirt
Setting: Vibrant mural wall background
Audio: Faint city murmurs, distant chatter, mellow soulful hip-hop beat
Dialogue: [Character says: "This is the moment..."]
Veo 3 generates audio natively - describe sounds in separate sentences.
Kling 2.6
A static shot of a burger as it assembles in mid-air.
The entire shot is in dramatic slow-motion.
Background is a clean professional studio gradient.
Style: TV food commercial
++sleek red convertible++
Kling uses ++keyword++ to emphasize important elements.
Handling "Don't Do This" Instructions
LLMs: Constraints in Structure
<constraints>
- Do not include personal opinions
- Do not exceed 500 words
- Do not use technical jargon
</constraints>
Images: Negative Prompts
Stable Diffusion:
Negative: blurry, low quality, distorted, bad anatomy, extra fingers,
watermark, text, signature
Midjourney:
--no text, watermark, blurry background
Semantic negatives (Nano Banana, GPT-4o Image):
No extra fingers or hands; no text except the title;
avoid watermarks; avoid clutter; no background distractions.
Video: Exclusions
Avoid Dutch angles; no on-screen text; no lens flare;
no subtitle overlays; no watermarks.
When to Use What
Use LLMs for:
- Text analysis and processing
- Content generation (articles, posts, emails)
- Programming and code review
- Q&A and research
- Document summarization
- Translation and localization
Use Image Generation for:
- Marketing visuals
- Concept art and illustrations
- Product mockups
- Social media content
- Stickers and icons
- Infographics
Use Video Generation for:
- Short promo clips
- Product videos
- Social media content
- B-roll footage
- Animated concepts
- Music visualizations
Platform Comparison Tables
Image Generation
| Platform | Text in Image | Prompt Adherence | Negative Prompts | Best For |
|---|---|---|---|---|
| DALL-E 3 | Okay | Good | None | General tasks |
| Midjourney V7 | Okay | Good | --no |
Artistic quality |
| Stable Diffusion 3.5 | Good | Good | Full support | Customization |
| Flux | Excellent | Excellent | Limited | Text, realism |
| Ideogram | Excellent | Good | Limited | Typography |
| GPT-4o Image | Excellent | Good | Semantic | Conversational editing |
| Nano Banana | Good | Good | Semantic | Speed, editing |
Video Generation
| Platform | Duration | Audio | Physics | Best For |
|---|---|---|---|---|
| Sora 2 | 10-20 sec | Excellent | Excellent | Complex scenes |
| Veo 3.1 | 4-8 sec | Excellent | Good | Native audio |
| Runway Gen-4 | 10 sec | Okay | Okay | Image-to-video |
| Kling 2.6 | 5-10 sec | Good | Good | Lip-sync |
Tips for Both
For LLMs
- Structure your prompt with XML or Markdown
- Set a role for expertise and style
- Be explicit, especially for Claude
- Use few-shot examples for complex tasks
- Iterate through dialogue
For Generative Models
- Describe, don't list keywords
- Always specify lighting - it dramatically affects results
- Use reference images when available (cref, sref)
- Add negative prompts to exclude unwanted elements
- Experiment with variations - each generation is unique
Universal Principles
- Specificity matters everywhere - precise descriptions get better results
- Know your platform's quirks - each model is different
- Iterate and improve - first prompt is rarely perfect
- Study examples - see what works for others
The Takeaway
LLM prompts and image/video prompts are fundamentally different:
- LLMs want structured instructions with roles, constraints, and format
- Images want descriptive sentences with visual attributes
- Video wants cinematography terminology plus audio components
Understanding this difference gives you significantly better results from each type of model.