Exploring MCP: When Your Cool Idea Fails, Then Succeeds Differently

I spent the last few weeks building a Model Context Protocol (MCP) server for MediaDroppy with a specific vision: let AI agents remix videos. Create mashups, add effects, generate compilations—the whole creative suite powered by LLMs. What actually happened was more interesting: I hit hard technical limits, pivoted completely, and discovered what LLMs are actually exceptional at when integrated with media platforms.

The Vision: AI Video Remixing

The idea was compelling. MediaDroppy hosts user-uploaded videos, and MCP provides a protocol for giving LLMs tool-based access to external systems. Why not let an AI agent browse the video library, analyze content, and create new compositions? Users could ask for "a compilation of the funniest moments" or "create a 30-second highlight reel" and the LLM would orchestrate the entire workflow.

The technical pieces seemed straightforward:

Building Base64-Transactible Endpoints

MCP tools communicate through JSON. You can't send raw binary data through JSON without encoding it. The solution: base64 encoding. I built a series of Spring Boot endpoints specifically designed to make binary media data accessible to LLM agents:

@GetMapping(value = "/files/videos/listed_links/thumbnails/base64/{externalId}")
public String getLinkThumbnailBase64(@PathVariable String externalId) {
    return fileServiceV2.getThumbnailBase64ByLinkExternalId(externalId);
}

This pattern repeated across the API surface:

The MCP server wrapped these endpoints as tools, handling authentication and providing friendly interfaces for the LLM. I added range request support so agents could fetch partial video data—theoretically enabling streaming analysis without loading entire files into context.

On paper, this was elegant. Base64 encoding is a solved problem. The APIs were RESTful. The MCP integration was clean. Everything compiled and deployed successfully.

The Hard Reality: Context Limits Are Real

Then I actually tried to use it.

Even a small 10-second video clip at modest resolution is several megabytes. Base64 encoding inflates binary data by roughly 33%. A 3MB video becomes 4MB of base64 text. That's a massive chunk of an LLM's context window—and that's before you've done any analysis, processing, or reasoning.

Current LLM context windows are large (Claude supports 200K tokens, GPT-4 Turbo goes even higher), but they're not infinite. More importantly, filling context with base64-encoded video data means:

Byte range requests helped marginally—fetching just the first megabyte instead of the full file—but the fundamental math didn't work. Video remixing requires analyzing multiple clips, comparing content, making editing decisions, and generating output. You can't do that when a single video consumes half your context budget.

The cool idea failed. Not because the code was wrong, but because the use case didn't align with how LLMs actually work in practice.

The Pivot: What LLMs Are Actually Great At

Here's where it got interesting. While wrestling with context limits, I realized I was sitting on a different capability entirely: LLM vision models are excellent at perceiving and analyzing visual content, not manipulating it.

MediaDroppy is a public media sharing platform. Users upload videos. Some get shared publicly. And like any platform with user-generated content, moderation is a challenge. Traditional content moderation requires:

But an LLM with vision capabilities can look at a video thumbnail and reason about appropriateness, context, and potential policy violations. This is perception, not manipulation. And thumbnails are small—typically under 100KB, even smaller when JPEG-compressed.

Building Content Moderation Tools

I refocused the MCP server on content perception and moderation. The tools shifted from "fetch and remix videos" to "scan and evaluate content":

mcpServer.tool(
    "post_scan_report",
    "Send the examined IDs as well as detailed flagging records for scanned public links. " +
    "Each flagged item should include the external ID, confidence level (0-100), and reasoning for the flag.",
    {
        scannedIds: z.array(z.string()),
        flaggingRecords: z.array(z.object({
            externalId: z.string(),
            confidence: z.number().min(0).max(100),
            reasoning: z.string()
        }))
    },
    async ({scannedIds, flaggingRecords}) => {
        // Submit moderation report to backend
    }
);

The workflow became:

  1. List public videos using list_public_videos (returns metadata: titles, descriptions, tags, file info)
  2. Fetch thumbnails using get_thumbnail_base64 (small JPEG images, minimal context impact)
  3. Analyze content using both visual data (thumbnail) and textual metadata (title, description, user-provided tags)
  4. Take action using toggle_public_link_listing to delist violating content
  5. Submit findings using post_scan_report with confidence scores and reasoning

Suddenly the context math worked. A thumbnail is 50-80KB. Base64-encoded, maybe 100KB of text. The metadata (title, description, tags) adds minimal overhead—just a few hundred bytes. An LLM can process dozens of videos with full context in a single window, analyze them for policy violations, and provide structured reports with reasoning.

The Metadata Trust Problem

Here's where it gets interesting: the system doesn't just analyze thumbnails in isolation. The list_public_videos endpoint returns comprehensive metadata for each video—filename, description, user-provided tags, content type, dimensions, upload date. This metadata provides valuable context for moderation decisions.

But user-provided metadata introduces a trust problem. Unlike a thumbnail (which objectively shows what the video looks like), descriptions and tags are whatever the uploader claims them to be. This creates several challenges:

This is where LLM-based moderation reveals a unique strength: it can reason about discrepancies between visual content and textual claims. Traditional keyword filters would flag "explicit" in a title but miss a mislabeled video. Traditional computer vision would analyze pixels but ignore context. An LLM can do both simultaneously and detect inconsistencies.

For example, if a thumbnail shows suggestive content but the title claims "Educational Biology Lecture," the LLM can flag that mismatch with higher confidence. Conversely, if both the thumbnail and description align with policy-compliant content, confidence in the "safe" classification increases. The metadata isn't trusted blindly—it's cross-referenced against visual evidence.

This multi-modal analysis makes the system more robust than either approach alone. You're not relying on user honesty, but you're not ignoring useful signals either. The LLM weighs both sources of information and reasons about their consistency.

But here's what makes this truly powerful: the system isn't just reporting violations—it can take action. The toggle_public_link_listing tool gives the agent the ability to delist content directly:

mcpServer.tool(
    "toggle_public_link_listing",
    "Toggle public links' listing",
    {
        externalId: z.string().describe("The external ID of the file link")
    },
    async ({externalId}) => {
        const url = `${BASE_URL}/v2/links/${externalId}/toggle_listing`;
        const response = await patchWithAuth(url, {});
        return response;
    }
);

This transforms the system from passive monitoring to active moderation. For high-confidence violations (95%+ confidence that content violates policy), the agent can immediately delist content and submit a detailed report. For medium-confidence flags (70-94%), the agent can submit reports without taking action, documenting the concern for future implementation of human review workflows. For low-confidence flags, it can simply record patterns without flagging violations.

This tiered approach balances automation with safety. You're not blindly trusting an AI to make all moderation decisions—you're using confidence scores to determine the appropriate level of intervention. Currently, the system handles the extremes well (auto-delist obvious violations, ignore low-confidence noise), but the middle ground—routing medium-confidence flags to human moderators—remains future work.

What This Actually Solves

This isn't just a theoretical exercise. Real moderation challenges this approach addresses:

The agent can scan hundreds of videos efficiently, take immediate action on clear violations, flag edge cases with reasoning for human review, and integrate seamlessly into existing moderation workflows. This scales in a way that manual review can't and provides context in a way that traditional ML models don't.

Designing for Safety: Guardrails on Automated Action

Giving an AI agent the power to delist content raises important questions about safety and accountability. A few design decisions that help balance automation with control:

This isn't about replacing human moderators—it's about triaging at scale. The agent handles the obvious violations instantly, documents edge cases with detailed reasoning (ready for human review infrastructure when built), and creates an audit trail for accountability. The pieces are in place for a complete human-in-the-loop system; the review queue is the next logical step.

Lessons From the Journey

1. Start with use case, not capability
I started with "LLMs can do complex workflows" and tried to force-fit video remixing. Better approach: identify a real problem (content moderation) and evaluate whether LLM capabilities (vision, reasoning, language) solve it effectively.

2. Context limits are design constraints, not suggestions
Reading "200K token context window" feels like infinite headroom until you try to cram base64-encoded videos into it. Treat context as a limited resource. Design accordingly.

3. Perception beats manipulation for multimodal LLMs
Current LLMs excel at understanding and reasoning about content—describing images, analyzing sentiment, detecting patterns. They're not optimized for creating or transforming binary media. Use them for what they're good at.

4. Failed experiments reveal better paths
The base64 thumbnail endpoint was built for video remixing. It became essential for content moderation. The "failed" infrastructure enabled the successful pivot.

5. MCP shines for domain-specific integration
MCP isn't about giving LLMs generic capabilities—it's about connecting them to your specific data and workflows. MediaDroppy's MCP server exposes public video metadata, thumbnail access, and moderation reporting. That's a bespoke integration that wouldn't make sense as a general API, but it's perfect for agent-based content management.

What's Next

The moderation tools are working, but there's critical infrastructure still to build:

The Real Discovery

I set out to build an AI video remix tool and ended up with a content moderation system. That's not failure—it's discovery.

The lesson isn't "don't try ambitious ideas." It's "pay attention when reality pushes back." Context limits killed video remixing, but they revealed a better fit: LLMs analyzing thumbnails at scale. The same base64 infrastructure, the same MCP integration, the same Spring Boot endpoints—just aimed at a problem where the math actually works.

Building with emerging technology means half your assumptions will be wrong. The key is building small enough to pivot when you discover which half.

Try It Yourself

The MediaDroppy MCP server is running at https://mcp.mediadroppy.com/mcp with public endpoints for listing videos and fetching thumbnails. The moderation tools require authentication, but the perception capabilities are accessible through any MCP-compatible client.

If you're exploring MCP integration, a few recommendations:

The cool idea might fail. But if you build it modular and stay curious, you'll find the idea that works.