Back to blog
Generative AIDec 23, 202512 min read

Prompting Text AI vs Image AI: Totally Different Games

The fundamental differences between prompts for LLMs and generative AI for images and video.

Prompting Text AI vs Image AI: Totally Different Games

Here's something that tripped me up when I started generating images. I thought writing prompts was writing prompts. Same skill, different tool. Nope.

Prompts for ChatGPT and prompts for Midjourney require completely different thinking. One wants structure and instructions. The other wants descriptions and vibes. Once I understood this, my results got way better.

Let me break it down.


Two Different Worlds

Language Models (LLMs)

These are trained on text to generate, analyze, and process text.

Examples: ChatGPT, Claude, Gemini, Llama, Mistral

Typical tasks: writing, analysis, code, Q&A, summarization

Generative Models (Images, Video)

These create visual content from text descriptions.

For images: DALL-E 3, Midjourney V7, Stable Diffusion 3.5, Flux, Ideogram, GPT-4o Image, Nano Banana

For video: Sora 2, Veo 3, Runway Gen-4, Kling 2.6, Pika, Luma


The Core Difference

LLM Prompts: Instructions and Structure

Element Purpose
Role Sets expertise and style
Instructions Step-by-step what to do
Context Background info and data
Constraints What NOT to do
Output format Structure of the response

LLM prompt example:

<role>You are a senior marketing analyst</role>
<instructions>
Analyze the campaign data and provide:
1. Key metrics summary
2. Performance trends
3. Recommendations
</instructions>
<constraints>
- Use only provided data
- Be concise (max 500 words)
</constraints>
<data>{{CAMPAIGN_DATA}}</data>

Image Prompts: Descriptions and Visual Attributes

Element Purpose
Style Artistic aesthetic
Subject Main object/character
Setting Environment and context
Lighting Light sources and mood
Composition Angle and framing
Technical Resolution, aspect ratio

Image prompt example:

A photorealistic portrait of an elderly Japanese ceramicist
with deep, sun-etched wrinkles and a warm, knowing smile.
Natural window light from the left, shallow depth of field,
neutral background. Serene and masterful mood.

See the difference? One is giving orders. The other is painting a picture with words.


Quick Comparison

Aspect LLMs Image/Video Models
Format Structured (XML/Markdown) Descriptive text
Keywords vs sentences Full sentences Depends on model*
Negative instructions <constraints> tags Negative prompts
Iteration Dialogue and refinement Rerolls and variations
Examples Text examples Reference images
Length control Specified in instructions Not applicable
Style control Tone and format Artistic aesthetic

*Modern models like Midjourney V6+, Flux, Nano Banana prefer full descriptive sentences over keyword lists.


Writing Image Prompts

Basic Structure

[Style/Aesthetic] + [Subject] + [Setting] + [Lighting] + [Composition] + [Technical]

The Big Shift: Sentences Over Keywords

Older approach (doesn't work as well anymore):

woman, red dress, cafe, morning, coffee, vintage, 4k, award winning

Modern approach (works much better):

A young woman in a flowing crimson dress sits at a Parisian sidewalk cafe,
her fingers wrapped around a steaming espresso cup as golden morning light
filters through the awning, creating soft shadows on the vintage iron table.

Modern models - especially Midjourney V6+, Flux, and Nano Banana - understand descriptive sentences much better than keyword lists.

Platform-Specific Examples

DALL-E 3 / GPT-4o Image

A high-resolution, studio-lit product photograph of a minimalist ceramic
coffee mug in matte black, presented on a polished concrete surface.
Soft diffused lighting from above, subtle shadow, clean background.
Square image.

Midjourney V7

Haute-couture advertising campaign photographed by Erik Madigan Heck.
Two models wearing Comme des Garcons Avant-Garde costume.
Mongol steppe in background. Northern lights in sky --ar 2:3 --v 7

Midjourney has special parameters:

  • --ar 2:3 - aspect ratio
  • --v 7 - model version
  • --cref [URL] - character reference
  • --sref [URL] - style reference

Stable Diffusion 3.5

Positive: majestic lion with golden mane, hyperrealistic, 8K, detailed fur
Negative: blurry, low quality, distorted, bad anatomy, extra fingers

SD's superpower: full negative prompts and prompt weighting with (keyword:1.5).

Flux

A hyperrealistic portrait of a weathered sailor in his 60s,
with deep-set blue eyes, a salt-and-pepper beard, and sun-weathered skin.
He's wearing a faded blue captain's hat and a thick wool sweater.
The background shows a misty harbor at dawn.

Flux uses dual-encoder (T5 + CLIP) and has the best text rendering in the industry.


Writing Video Prompts

Basic Structure

[CAMERA/SHOT] + [SUBJECT] + [ACTION] + [ENVIRONMENT] + [STYLE] + [AUDIO]

Components

  1. Subject - who/what is in focus
  2. Context - where the action happens
  3. Action - what the subject does
  4. Style - visual aesthetic
  5. Camera - shot type and movement
  6. Lighting - mood and atmosphere
  7. Audio - sound effects, dialogue (for Sora 2, Veo 3)

Platform Examples

Sora 2 (OpenAI)

Style: Hand-painted 2D/3D hybrid animation with soft brush textures.

Inside a cluttered workshop, a small round robot sits on a wooden bench.

Cinematography:
Camera: medium close-up, slow push-in with gentle parallax
Lens: 35mm virtual lens; shallow depth of field
Lighting: warm key from overhead; cool spill from window

Actions:
- The robot taps the bulb; sparks crackle.
- It flinches, dropping the bulb.
- Robot says: "Almost lost it... but I got it!"

What I learned about Sora:

  • Short clips (4 sec) are more stable than long ones
  • One camera move per shot
  • Dialogue goes in a separate block

Veo 3 (Google)

Camera: Medium shot, slow push-in
Subject: A seasoned grey-bearded man in sunglasses and paisley shirt
Setting: Vibrant mural wall background
Audio: Faint city murmurs, distant chatter, mellow soulful hip-hop beat
Dialogue: [Character says: "This is the moment..."]

Veo 3 generates audio natively - describe sounds in separate sentences.

Kling 2.6

A static shot of a burger as it assembles in mid-air.
The entire shot is in dramatic slow-motion.
Background is a clean professional studio gradient.
Style: TV food commercial
++sleek red convertible++

Kling uses ++keyword++ to emphasize important elements.


Handling "Don't Do This" Instructions

LLMs: Constraints in Structure

<constraints>
- Do not include personal opinions
- Do not exceed 500 words
- Do not use technical jargon
</constraints>

Images: Negative Prompts

Stable Diffusion:

Negative: blurry, low quality, distorted, bad anatomy, extra fingers,
watermark, text, signature

Midjourney:

--no text, watermark, blurry background

Semantic negatives (Nano Banana, GPT-4o Image):

No extra fingers or hands; no text except the title;
avoid watermarks; avoid clutter; no background distractions.

Video: Exclusions

Avoid Dutch angles; no on-screen text; no lens flare;
no subtitle overlays; no watermarks.

When to Use What

Use LLMs for:

  • Text analysis and processing
  • Content generation (articles, posts, emails)
  • Programming and code review
  • Q&A and research
  • Document summarization
  • Translation and localization

Use Image Generation for:

  • Marketing visuals
  • Concept art and illustrations
  • Product mockups
  • Social media content
  • Stickers and icons
  • Infographics

Use Video Generation for:

  • Short promo clips
  • Product videos
  • Social media content
  • B-roll footage
  • Animated concepts
  • Music visualizations

Platform Comparison Tables

Image Generation

Platform Text in Image Prompt Adherence Negative Prompts Best For
DALL-E 3 Okay Good None General tasks
Midjourney V7 Okay Good --no Artistic quality
Stable Diffusion 3.5 Good Good Full support Customization
Flux Excellent Excellent Limited Text, realism
Ideogram Excellent Good Limited Typography
GPT-4o Image Excellent Good Semantic Conversational editing
Nano Banana Good Good Semantic Speed, editing

Video Generation

Platform Duration Audio Physics Best For
Sora 2 10-20 sec Excellent Excellent Complex scenes
Veo 3.1 4-8 sec Excellent Good Native audio
Runway Gen-4 10 sec Okay Okay Image-to-video
Kling 2.6 5-10 sec Good Good Lip-sync

Tips for Both

For LLMs

  1. Structure your prompt with XML or Markdown
  2. Set a role for expertise and style
  3. Be explicit, especially for Claude
  4. Use few-shot examples for complex tasks
  5. Iterate through dialogue

For Generative Models

  1. Describe, don't list keywords
  2. Always specify lighting - it dramatically affects results
  3. Use reference images when available (cref, sref)
  4. Add negative prompts to exclude unwanted elements
  5. Experiment with variations - each generation is unique

Universal Principles

  1. Specificity matters everywhere - precise descriptions get better results
  2. Know your platform's quirks - each model is different
  3. Iterate and improve - first prompt is rarely perfect
  4. Study examples - see what works for others

The Takeaway

LLM prompts and image/video prompts are fundamentally different:

  • LLMs want structured instructions with roles, constraints, and format
  • Images want descriptive sentences with visual attributes
  • Video wants cinematography terminology plus audio components

Understanding this difference gives you significantly better results from each type of model.

Want to improve your prompts instantly?