Exploring MCP: When Your Cool Idea Fails, Then Succeeds Differently

I spent the last few weeks building a Model Context Protocol (MCP) server for MediaDroppy with a specific vision: let AI agents remix videos. Create mashups, add effects, generate compilations—the whole creative suite powered by LLMs. What actually happened was more interesting: I hit hard technical limits, pivoted completely, and discovered what LLMs are actually exceptional at when integrated with media platforms.

The Vision: AI Video Remixing

The idea was compelling. MediaDroppy hosts user-uploaded videos, and MCP provides a protocol for giving LLMs tool-based access to external systems. Why not let an AI agent browse the video library, analyze content, and create new compositions? Users could ask for "a compilation of the funniest moments" or "create a 30-second highlight reel" and the LLM would orchestrate the entire workflow.

The technical pieces seemed straightforward:

Expose video metadata through MCP tools (easy—already had REST APIs)
Let the LLM fetch actual video data for analysis (harder—needed base64 encoding)
Provide an upload endpoint for remixed results (achievable—POST with video data)
Let the agent coordinate the remixing logic (the "magic" that LLMs are supposed to be good at)

Building Base64-Transactible Endpoints

MCP tools communicate through JSON. You can't send raw binary data through JSON without encoding it. The solution: base64 encoding. I built a series of Spring Boot endpoints specifically designed to make binary media data accessible to LLM agents:

@GetMapping(value = "/files/videos/listed_links/thumbnails/base64/{externalId}")
public String getLinkThumbnailBase64(@PathVariable String externalId) {
    return fileServiceV2.getThumbnailBase64ByLinkExternalId(externalId);
}

This pattern repeated across the API surface:

get_thumbnail_base64 - Fetch base64-encoded JPEG thumbnails for video preview
get_video_data_base64 - Fetch chunks of video using byte ranges, returned as base64
send_new_video - Accept base64-encoded video data for uploads

The MCP server wrapped these endpoints as tools, handling authentication and providing friendly interfaces for the LLM. I added range request support so agents could fetch partial video data—theoretically enabling streaming analysis without loading entire files into context.

On paper, this was elegant. Base64 encoding is a solved problem. The APIs were RESTful. The MCP integration was clean. Everything compiled and deployed successfully.

The Hard Reality: Context Limits Are Real

Then I actually tried to use it.

Even a small 10-second video clip at modest resolution is several megabytes. Base64 encoding inflates binary data by roughly 33%. A 3MB video becomes 4MB of base64 text. That's a massive chunk of an LLM's context window—and that's before you've done any analysis, processing, or reasoning.

Current LLM context windows are large (Claude supports 200K tokens, GPT-4 Turbo goes even higher), but they're not infinite. More importantly, filling context with base64-encoded video data means:

Higher costs: You're paying for tokens consumed by raw data, not reasoning
Slower processing: Every API call includes massive payloads
Reduced workspace: Less room for actual logic, examples, or multi-step reasoning
Poor tool orchestration: Complex workflows require multiple video fetches, multiplying the problem

Byte range requests helped marginally—fetching just the first megabyte instead of the full file—but the fundamental math didn't work. Video remixing requires analyzing multiple clips, comparing content, making editing decisions, and generating output. You can't do that when a single video consumes half your context budget.

The cool idea failed. Not because the code was wrong, but because the use case didn't align with how LLMs actually work in practice.

The Pivot: What LLMs Are Actually Great At

Here's where it got interesting. While wrestling with context limits, I realized I was sitting on a different capability entirely: LLM vision models are excellent at perceiving and analyzing visual content, not manipulating it.

MediaDroppy is a public media sharing platform. Users upload videos. Some get shared publicly. And like any platform with user-generated content, moderation is a challenge. Traditional content moderation requires:

Manual review (doesn't scale)
Automated rule-based filters (brittle and easy to circumvent)
Machine learning models trained on specific violations (expensive to build and maintain)

But an LLM with vision capabilities can look at a video thumbnail and reason about appropriateness, context, and potential policy violations. This is perception, not manipulation. And thumbnails are small—typically under 100KB, even smaller when JPEG-compressed.

Building Content Moderation Tools

I refocused the MCP server on content perception and moderation. The tools shifted from "fetch and remix videos" to "scan and evaluate content":

mcpServer.tool(
    "post_scan_report",
    "Send the examined IDs as well as detailed flagging records for scanned public links. " +
    "Each flagged item should include the external ID, confidence level (0-100), and reasoning for the flag.",
    {
        scannedIds: z.array(z.string()),
        flaggingRecords: z.array(z.object({
            externalId: z.string(),
            confidence: z.number().min(0).max(100),
            reasoning: z.string()
        }))
    },
    async ({scannedIds, flaggingRecords}) => {
        // Submit moderation report to backend
    }
);

The workflow became:

List public videos using list_public_videos (returns metadata: titles, descriptions, tags, file info)
Fetch thumbnails using get_thumbnail_base64 (small JPEG images, minimal context impact)
Analyze content using both visual data (thumbnail) and textual metadata (title, description, user-provided tags)
Take action using toggle_public_link_listing to delist violating content
Submit findings using post_scan_report with confidence scores and reasoning

Suddenly the context math worked. A thumbnail is 50-80KB. Base64-encoded, maybe 100KB of text. The metadata (title, description, tags) adds minimal overhead—just a few hundred bytes. An LLM can process dozens of videos with full context in a single window, analyze them for policy violations, and provide structured reports with reasoning.

The Metadata Trust Problem

Here's where it gets interesting: the system doesn't just analyze thumbnails in isolation. The list_public_videos endpoint returns comprehensive metadata for each video—filename, description, user-provided tags, content type, dimensions, upload date. This metadata provides valuable context for moderation decisions.

But user-provided metadata introduces a trust problem. Unlike a thumbnail (which objectively shows what the video looks like), descriptions and tags are whatever the uploader claims them to be. This creates several challenges:

Deliberate mislabeling: Bad actors can upload violating content with innocuous titles. A video titled "Family Beach Vacation 2024" might contain something entirely different.
Ambiguous descriptions: Artistic or creative titles might not literally describe content. Is "Midnight Dancing" a performance art piece or something that violates policy? The title alone doesn't tell you.
Missing context: Users might leave descriptions blank, provide minimal tags, or use vague language that doesn't help classification.
Language barriers: Descriptions in languages the LLM handles less effectively might miss nuances or cultural context.

This is where LLM-based moderation reveals a unique strength: it can reason about discrepancies between visual content and textual claims. Traditional keyword filters would flag "explicit" in a title but miss a mislabeled video. Traditional computer vision would analyze pixels but ignore context. An LLM can do both simultaneously and detect inconsistencies.

For example, if a thumbnail shows suggestive content but the title claims "Educational Biology Lecture," the LLM can flag that mismatch with higher confidence. Conversely, if both the thumbnail and description align with policy-compliant content, confidence in the "safe" classification increases. The metadata isn't trusted blindly—it's cross-referenced against visual evidence.

This multi-modal analysis makes the system more robust than either approach alone. You're not relying on user honesty, but you're not ignoring useful signals either. The LLM weighs both sources of information and reasons about their consistency.

But here's what makes this truly powerful: the system isn't just reporting violations—it can take action. The toggle_public_link_listing tool gives the agent the ability to delist content directly:

mcpServer.tool(
    "toggle_public_link_listing",
    "Toggle public links' listing",
    {
        externalId: z.string().describe("The external ID of the file link")
    },
    async ({externalId}) => {
        const url = `${BASE_URL}/v2/links/${externalId}/toggle_listing`;
        const response = await patchWithAuth(url, {});
        return response;
    }
);

This transforms the system from passive monitoring to active moderation. For high-confidence violations (95%+ confidence that content violates policy), the agent can immediately delist content and submit a detailed report. For medium-confidence flags (70-94%), the agent can submit reports without taking action, documenting the concern for future implementation of human review workflows. For low-confidence flags, it can simply record patterns without flagging violations.

This tiered approach balances automation with safety. You're not blindly trusting an AI to make all moderation decisions—you're using confidence scores to determine the appropriate level of intervention. Currently, the system handles the extremes well (auto-delist obvious violations, ignore low-confidence noise), but the middle ground—routing medium-confidence flags to human moderators—remains future work.

What This Actually Solves

This isn't just a theoretical exercise. Real moderation challenges this approach addresses:

Nuanced judgment: An LLM can distinguish between artistic nudity and explicit content, between political speech and hate speech, between satire and misinformation. Rule-based systems struggle with context.
Explainability: The reasoning field in scan reports provides human-readable explanations. A moderator reviewing flagged content sees why the system flagged it, not just a binary violation score.
Confidence scoring: Not all violations are equal. A confidence score enables triaging decisions: auto-delist high-confidence violations, document medium-confidence concerns for future review infrastructure, log low-confidence for pattern analysis.
Automated action: The de-listing tool means the system can immediately remove clear violations without waiting for human review. An agent can process 100 videos, identify 3 clear policy violations, delist them automatically, and document 5 borderline cases in scan reports—all in one scan.
Adaptability: Policy changes don't require retraining models. Update the system prompt with new guidelines, and the LLM adapts immediately.

The agent can scan hundreds of videos efficiently, take immediate action on clear violations, flag edge cases with reasoning for human review, and integrate seamlessly into existing moderation workflows. This scales in a way that manual review can't and provides context in a way that traditional ML models don't.

Designing for Safety: Guardrails on Automated Action

Giving an AI agent the power to delist content raises important questions about safety and accountability. A few design decisions that help balance automation with control:

Authentication required: The toggle_public_link_listing tool requires OAuth credentials. Not just anyone can run moderation—only authorized agents with proper credentials.
Audit trail: The post_scan_report creates a permanent record of every decision with timestamps, confidence scores, and reasoning. If content is wrongly delisted, you can trace exactly why the agent made that call.
Confidence thresholds: The system can be configured with minimum confidence levels for automated action. Start conservative (only delist 99%+ confidence), tune based on accuracy metrics.
Documented decision trail: Medium-confidence flags are documented in scan reports with full reasoning. While a human review queue isn't implemented yet, the reports provide the foundation for building one—every borderline decision is logged with context for later review infrastructure.
Toggle, not delete: The tool delists content (removes from public view) rather than deleting it. If the agent makes a mistake, the content can be relisted. Mistakes are reversible.

This isn't about replacing human moderators—it's about triaging at scale. The agent handles the obvious violations instantly, documents edge cases with detailed reasoning (ready for human review infrastructure when built), and creates an audit trail for accountability. The pieces are in place for a complete human-in-the-loop system; the review queue is the next logical step.

Lessons From the Journey

1. Start with use case, not capability
I started with "LLMs can do complex workflows" and tried to force-fit video remixing. Better approach: identify a real problem (content moderation) and evaluate whether LLM capabilities (vision, reasoning, language) solve it effectively.

2. Context limits are design constraints, not suggestions
Reading "200K token context window" feels like infinite headroom until you try to cram base64-encoded videos into it. Treat context as a limited resource. Design accordingly.

3. Perception beats manipulation for multimodal LLMs
Current LLMs excel at understanding and reasoning about content—describing images, analyzing sentiment, detecting patterns. They're not optimized for creating or transforming binary media. Use them for what they're good at.

4. Failed experiments reveal better paths
The base64 thumbnail endpoint was built for video remixing. It became essential for content moderation. The "failed" infrastructure enabled the successful pivot.

5. MCP shines for domain-specific integration
MCP isn't about giving LLMs generic capabilities—it's about connecting them to your specific data and workflows. MediaDroppy's MCP server exposes public video metadata, thumbnail access, and moderation reporting. That's a bespoke integration that wouldn't make sense as a general API, but it's perfect for agent-based content management.

What's Next

The moderation tools are working, but there's critical infrastructure still to build:

Human review queue: This is the missing piece. Currently, scan reports document medium-confidence flags with detailed reasoning, but there's no mechanism for routing them to human moderators. Building this would require:
- Database schema for flagged content tracking (state: pending/reviewed/approved/rejected)
- Admin UI for moderators to review flagged items alongside thumbnails, metadata, and LLM reasoning
- Workflow buttons (approve/reject/escalate) that update content listing status
- Feedback loop: track moderator decisions to tune confidence thresholds and refine system prompts
- SLA tracking: flag age, review time, moderator workload metrics
This transforms the system from "AI with audit trail" to "AI-assisted human moderation at scale." The scan reports already provide the data structure needed—the queue is about surfacing it to humans efficiently.
Resource-based interactions: MCP supports resource URIs that avoid base64 encoding entirely. Instead of returning encoded data in tool responses, tools return URIs, and the MCP client fetches raw binary data separately. This is more efficient for larger assets and already prototyped in the codebase.
Batch processing: Current workflow scans videos sequentially. Could optimize by fetching multiple thumbnails in parallel and processing them as a batch, reducing total scan time.
Multi-frame analysis: For videos, a single thumbnail might miss context. Fetching multiple frames (start, middle, end) could improve accuracy without overwhelming context limits. Would require backend support for multi-frame thumbnail generation.
Metadata validation signals: Cross-reference user-provided tags against visual content more systematically. Flag videos where metadata and visual analysis strongly disagree—these might indicate deliberate mislabeling or need human review regardless of content policy.

The Real Discovery

I set out to build an AI video remix tool and ended up with a content moderation system. That's not failure—it's discovery.

The lesson isn't "don't try ambitious ideas." It's "pay attention when reality pushes back." Context limits killed video remixing, but they revealed a better fit: LLMs analyzing thumbnails at scale. The same base64 infrastructure, the same MCP integration, the same Spring Boot endpoints—just aimed at a problem where the math actually works.

Building with emerging technology means half your assumptions will be wrong. The key is building small enough to pivot when you discover which half.

Try It Yourself

The MediaDroppy MCP server is running at https://mcp.mediadroppy.com/mcp with public endpoints for listing videos and fetching thumbnails. The moderation tools require authentication, but the perception capabilities are accessible through any MCP-compatible client.

If you're exploring MCP integration, a few recommendations:

Start with small data—metadata, thumbnails, summaries, not full binary files
Measure context consumption—know how much of your budget each tool call uses
Design for perception first, manipulation second—LLMs see and reason better than they create
Build modular tools—my video upload endpoint is ready when someone solves the context math

The cool idea might fail. But if you build it modular and stay curious, you'll find the idea that works.