Deploying to Kubernetes: Lessons from a Real-World Project
Deploying MediaDroppy to Kubernetes was both empowering and humbling. While Kubernetes provides incredible capabilities for orchestrating containerized applications, it comes with a steep learning curve and numerous opportunities to make costly mistakes. This article shares practical lessons learned from deploying and managing a multi-service application on Kubernetes.
Helm Charts: Managing Complexity
Rather than managing raw Kubernetes manifests, I used Helm to template and organize the deployment configuration. Each service (auth, user, files, thumbs, webserver) has its own set of deployment, service, and ingress templates within a single Helm chart.
The benefit: Helm's templating allowed environment-specific configurations through values files (values.yaml, values.local.yaml). Promoting changes from development to production became manageable through value overrides rather than maintaining separate manifest files.
The pitfall: Over-templatization. I initially created overly complex templates with numerous conditional logic blocks. This made the charts hard to understand and debug. I learned to keep templates simple and prefer explicit configuration over clever abstraction.
Persistent Storage Challenges
MediaDroppy requires persistent storage for uploaded files and MongoDB data. Kubernetes persistent volumes (PVs) and persistent volume claims (PVCs) manage this, but the abstraction comes with complexity.
Key lessons learned:
- Storage classes matter: Different storage classes have different performance characteristics and reclaim policies. Understanding these differences upfront prevents data loss scenarios.
- StatefulSets for databases: Initially using a Deployment for MongoDB caused issues with volume mounting when pods rescheduled. StatefulSets provide stable network identities and persistent storage guarantees that databases require.
- Backup strategies are critical: Kubernetes doesn't automatically backup persistent volumes. Implementing a backup strategy from day one is essential, not optional.
Ingress Configuration and Routing
Exposing multiple services through a single ingress controller required careful planning. MediaDroppy uses path-based routing: /api/auth routes to the auth service, /api/user to the user service, and /api/files to the file service.
Challenge: Request size limits. Default ingress configurations impose body size limits that were too small for media file uploads. Setting the appropriate annotations (nginx.ingress.kubernetes.io/proxy-body-size) was necessary but easy to overlook.
Challenge: CORS configuration. Cross-origin requests from the React frontend to backend APIs required proper CORS headers. Handling this at the ingress level versus within each service was a decision that affected both configuration complexity and flexibility.
Resource Management: Requests and Limits
One of the most impactful lessons involved resource requests and limits. Initially, I deployed without specifying these, leading to several problems:
- Pods were scheduled on nodes without sufficient resources, causing out-of-memory kills
- Resource-intensive operations (thumbnail generation) impacted other services on the same node
- Lack of guaranteed resources made performance unpredictable
Setting appropriate resource requests and limits required profiling each service under realistic load. The thumbnail service, for instance, needed significantly more CPU and memory than the authentication service.
Service-to-Service Communication
Within Kubernetes, services communicate using DNS-based service discovery. While this works well, debugging connectivity issues required understanding how Kubernetes networking operates at a deeper level than I initially anticipated.
Lessons learned include:
- Use service names as hostnames (e.g., http://auth-service:8080) rather than pod IPs
- Network policies can inadvertently block inter-service communication; start permissive and tighten as needed
- Implement health checks (readiness and liveness probes) early; they're critical for reliable service mesh behavior
Deployment Strategies and Versioning
Kubernetes supports various deployment strategies (rolling updates, blue-green, canary). MediaDroppy uses rolling updates, which gradually replace old pods with new ones.
Critical lesson: Image versioning discipline is essential. Using :latest tags in production is tempting but dangerous. I learned to tag every build with a specific version (e.g., 2.0.1) and update the Helm values accordingly. This makes rollbacks reliable and audit trails clear.
Monitoring and Observability
Running applications in Kubernetes without proper monitoring is like flying blind. Implementing logging aggregation and metrics collection should happen early, not as an afterthought.
I integrated a log aggregation service (log-eater) that collects logs from all services, making debugging distributed issues tractable. Without centralized logging, correlating events across multiple services would have been nearly impossible.
Key Takeaways
- Start simple with Kubernetes: Don't try to implement every advanced feature immediately. Get basic deployments working, then incrementally add complexity.
- Infrastructure as code is non-negotiable: Helm charts, stored in version control, make deployments repeatable and auditable.
- Understand storage before you need it: Storage is stateful and complex; plan your persistence strategy early.
- Resource management prevents surprises: Always specify resource requests and limits based on profiling, not guesses.
- Observability is a requirement, not a feature: Build logging, metrics, and tracing into your deployment from the start.
- Security configurations matter: Network policies, RBAC, and secrets management aren't optional in production.
Kubernetes is powerful but complex. The learning curve is real, and mistakes in production can be expensive. However, the operational benefits—auto-scaling, self-healing, rolling updates—justify the investment for applications that need these capabilities. The key is approaching Kubernetes with respect for its complexity and investing time in understanding its core concepts before deploying critical workloads.