What I learn from model deployment failure

Building lessons for ML in the future

Jul 17, 2023

I spent the entire week deploying my cute stable diffusion model on Sagemaker and failed. Here are what I learn about ML model deployment and building.

Points of failure

Sagemaker is way too expensive. I spent $44 for using it for three days, and I cannot afford it.
The AI-generated image produced by Sagemaker is of poor quality. Despite using their stable AI model, the output looks terrible.

3. The technical difficulty is quite high. I got stuck generating my images on the endpoints for 2 days because images are very large outputs when represented as RGB numbers. I couldn’t find any tutorial on how to automatically upload them to S3 data storage and send them to the front end as a link.

What do I learn from my failure?

Gaining more information at the early stage may be more important than completing a single subproblem. I spent a lot of time working on the (3) endpoint technical difficulty, which caused me to overlook other important aspects, such as (2) the output quality. I was overly excited about trying AWS and had too much trust in their models. However, I should have first tested the AI image generation output. While software engineering involves breaking down problems into subproblems, starting with tracer bullets (proofs that the architecture, prototype, and outputs are compatible and feasible) should still be the primary focus.

tracer bullet from the book, the pragmatic programmer

2. Deploying machine learning models requires much more attention than just deploying the models themselves. I encountered many extra tasks beyond the deployment process. First, my computer would easily crash while running. Second, I needed to constantly monitor billing. Third, there were many new things to learn about setting up the system, such as dealing with permission problems when requesting quota increases. In order to design an effective system, I need to feel comfortable navigating AWS and understanding its potential constraints.

3. Gain an overall understanding of the system before following the setup steps. Rather than treating AWS like a user manual and following the setup steps step-by-step, like assembling a bike, I should approach it like reading a research paper and browsing through everything to gain an overall understanding of each part. This will help me avoid the fear of the unknown and allow me to more easily unblock myself when issues arise.

4. When I wake up in the morning, I should spend 10 minutes asking myself:

“Am I solving the right problem?”

I usually wake up and start thinking about problems like a teenage boy in a messy love affair. But is that the right problem to pursue at this stage, or should I focus on fixing other parts first? What are the most important problems related to feasibility that I need to address now? I shouldn’t shy away from scary bottleneck problems and potentially call the project quits if it doesn’t pass feasibility and usability tests.

5. Time boxing when trying something new. When experimenting with new technology, there are always unexpected bottlenecks. Timing each task is the best way to determine where I get stuck and how much time I spend on each task. Although this week’s failure is disappointing, I can still see that Tuesday and Wednesday were quite productive, and the bottleneck was the same problem as last Thursday and Friday.

6. Put more effort into technical planning during the initial stages and create hypothesis-driven documentation throughout the process. Although documentation and technical planning may seem like a waste of time, I have learned that they provide significant value in ensuring that you are on the right track. To improve my documentation process, I started using a new system that clearly lays out different MVP versions, proof of concepts, and technical planning. Additionally, I documented all the encountered problems along with their potential problem hypotheses during the development process.

This methodology is helpful when dealing with complex problems because it allows you time to reflect and readjust before pursuing every new idea that comes your way. A simple way to implement this is to imagine that you are reporting your progress to your manager.

7. Remember to take a break. Last week, I overworked myself while thinking about a problem. As a result, my wrist and fingers hurt this week, making it difficult to do anything.

8. Unblocking Your Emotions: I’ve learned that unblocking my emotions is a prerequisite to unblocking problems in the project. For instance, I was demotivated when I learned how expensive SageMaker was after trying it out on the first day. Should I stop the project because it’s not feasible?

To unblock my motivation issue, I created a brainstorming page called

I am only willing to tackle the technical challenges in my path when I can unblock my emotions. Although I may give up on this project for now, I am confident that I will resolve it at some point in my life

Although it was a disappointing week, I didn’t let it defeat me. Instead, it encouraged me to learn about the entire machine-learning production process. To do so, I started reading “Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications” by Chip Huyen. Without this pain, I wouldn’t have picked up a technical book, and without this experience, I wouldn’t understand all the terms and related to the processes in the book either.

Esther is a confused human being

Discussion about this post

Ready for more?