Suyog Ghimire

I’m a third-year Computer Engineering student at Institute of Engineering (IOE), Paschimanchal campus. I write blogs as a way to learn and document my journey. I’m especially curious about multimodal AI, vision-language models, and how machines can connect perception with understanding. Along the way, I also share thoughts on programming, machine learning, and anything else that sparks my interest.

Cross Entropy and KL Divergence: Quantifying Differences Between Distributions

Understanding why modern machine learning models optimize cross entropy and how KL divergence comes into play. Introduction Probability distributions are essential to machine learning. Whether we are building a classifier, language model, or generative model, the objective is the same: Making the predicted probability distribution as close as possible to the true distribution. But we can’t just subtract the probabilities and find loss/error so,we ask How do we measure the difference between two probability distributions? ...

Understanding Transformer Architecture from First Principles: A Detailed Exploration

Introduction Transformer Architecture lies at the heart of modern AI models. Nearly all the state-of-the-art large language models(LLMs) like ChatGPT, LLaMa and Gemini ,all of them are built upon the transformer architecture.The transformer architecture was introduced in the paper Attention is all you need [^1]. Although it is used everywhere these days,it was first introduced for the purpose of language translation but then it was quickly generalized to other task as well.Since then, due to its scalability and self-attention mechanism it has found itself in not just natural language processing(NLP) but also in computer vision,speech and multimodal AI systems. ...

ResNets Explained - Solving Deep Network Degradation with Residual Learning

Introduction Deep convolutional neural networks have been around for a while and have completely revolutionized how we tackle image recognition task in computer vision. When AlexNet came out in 2012 it revolutionized how we use CNN’s as it was the first time we saw an architecture with consecutive Convolutional Layers with significant improvement in training speed and performance. This was achieved by leveraging deeper architecture and GPU accelaration.This 8 layer deep CNN was one of first to have this kind of performance for a largescale image classification. ...