Winter 2026 Course Review: EECE 504¶

2026-05-07

Course Title: Foundations of Computer Vision

Rating: 3/5

There exist two other CV courses, EECS 442 and 542. They didn't let me take 442 because I'm a grad student. Weirdly, the same professor has poor ratings in 442 but great ratings in 542; I never knew why. I chose 504 because it counts toward credits in my major area.

Instructor (Jason Corso)¶

It's obvious he cares what we think of him and his course, as he took the course evaluation very seriously and even asked us to rate him on RateMyProfessors for extra credit. I commented that lectures have room for improvement, especially with the flipped classroom format, which isn't a problem itself but could use better materials than his 2018/2019 lecture recordings spliced together.

Course topics¶

The course focuses on "classical" CV, namely theories and techniques from the 70s through early 2000s. We read a paper related to our topic every week, and it's interesting to see what cutting edge CV research was like back then. Material is loosely organized, so I'm just throwing off what I remember.

mosfet_grandma

One common trick people use to optimize a thing (e.g. how to find the best segmentation of an image) is to formulate the "badness" of a solution with an energy function (aka loss function), and then minimize it, either symbolically or numerically. Early papers often had sections dedicated to the mathematical derivation (which I usually did not understand). Such function often has parameters, and tuning them allows you to penalize one thing more than another.

Images can be transformed, including translation, rotation, scaling, shear, distortion, reflection, etc. In my SJTU capstone I dealt with translation and rotation of a 3D object, otherwise known as a Euclidean transformation, which has 6 degrees of freedom. The most general case is a homography, which has 15.

We can treat images as different mathematical objects. When treated as functions, we can convolute them with kernels to achieve e.g. the Gaussian blur. When treated as vectors, we can compare the distance between images in the vector space (this is the theory behind Eigenfaces, my favorite topic of the course). When treated as graphs, we can represent pixels (or clusters thereof, called superpixels) as vertices, and manipulate the edges to segment them into e.g. foreground and background.

Cameras have intrinsic and extrinsic parameters that determine how objects in 3D space project to a 2D image. Given enough known correspondences, you can calibrate a camera.

We also touched on the concept of machine learning, just enough to tell a dense network from a convolutional one. This forms the basis of our final project.

Project¶

Our project goal was to compare classical methods against "modern" methods in finding an embedding for human face recognition. By classical we mean Eigenfaces from 1991, and by "modern" we mean AlexNet from 2012.

An "embedding", as I understand it, is a way to compress an image as a bunch of metrics that describe it within a dataset. For example, you could theoretically have a metric for the aspect ratio of a face, and you'd be able to distinguish between Eric Cartman and Stan Marsh, and this metric would be scale and rotationally invariant in that the aspect ratio does not change no matter what angle you look at Cartman (maybe not a top-down profile though).

In practice, it's impractical to come up with such metrics purely by human, so we let computers themselves do it. A good embedding would capture the majority of the information needed to distinguish a person from someone else within the dataset with the fewest metrics. We want fewer metrics to make the next step, i.e. classification faster. A few hundred or thousand metrics would be much more compact than the raw image, which could be tens of thousands of pixels.

Eigenfaces is an early attempt at embeddings. Basically it treats the human face images as points in a vector space, finds the subspace in which the faces vary the most, and projects the images onto such subspace. This projection assigns a vector to an image. The limitations are that the images must be the same size, and ideally everyone's face has to be upright and facing forward; this sparked our project name, Face Forward.

Eigenfaces are not specific to faces; it was named that because the original paper dealt with faces. In fact we did digit recognition in our homework. It was the most fun I've had in all homework of this course.

mosfet_thisisfive

AlexNet is a CNN where C stands for convolutional, and because convolution works in small windows, it is aware to local features. It was trained on a dataset to classify 1000 object classes, like ships and mushrooms and leopards. However, we wanted to tell apart people, not objects. Fortunately a neural network is like a lasagna, and if you don't like the top layer (output), you can peel it off and slap on your own, and that's what we did. We stripped AlexNet of its fully-connected (dense) top layer, later all dense layers, to harness more of its power to identify general-purpose features, not just object classification.

At this point, both Eigenfaces and AlexNet can convert images to vectors. All we need to do is to feed the vectors to classifiers, which will tell us who the person is. To do this, we have to train them.

We originally planned to do recognize characters from TV shows; this idea was scrapped due to complexity of dealing with videos. We ended up using an extremely cherry-picked subset of the Labeled Faces in the Wild (LFW) dataset by UMass Amherst, where we curated a set of 12 people with more than 50 images each. We use 10 as a test set and the rest as a training pool, where the training set is selected from. This is the reason our presentation had the phrase "pool of George W. Bushes". (We don't need to train AlexNet; it has already been trained with its own dataset and we assume it's general purpose and good enough for us. We do need to get the Eigenfaces from the training set though.)

Even with this scale that modern CV researchers would consider Fisher-Price, I was worried about computation. We got access to a HPC cluster but nobody knows how to use it, not even the professor; and though my teammate has a 4090, I don't think he has CUDA installed correctly so it's still running on the CPU. All I know is my laptop is Ryzen + Radeon and is not made for AI training. I feared it would be atrociously slow, but wait! We're doing things that were cutting edge 14 years ago, and hardware has dramatically improved. Even on my CPU, it took no more than a minute for a typical iteration. As I was doing hyperparameter sweeps, I needed to run hundreds of iterations, but I could just leave it overnight and it'll be ready by 3 AM.

It was at this point that I realized what we're doing is nothing advanced, even laughably simple by today's standards where everyone's competing for the most tokens per second on an overpriced Mac Studio designed to create the next artistic masterpiece but forced to churn out TikSlops of gorillas singing We Are Charlie Kirk.

mosfet_drip

But I don't care. While playing with it, I realized neural networks are nothing scary, and while I remain ignorant of the theory behind transformers, I have one more stepping stone to it.

Verdict¶

The paper-reading experience was not as bad as 571 last semester, because we start with foundational papers before moving on to more modern ones. Homework was not too bad, though I did forget how Kruskal's Algorithm worked and had to look up my 281 slides. Not a fan of the weekly quizzes, but maybe it's on me for not reviewing the slides. Exam was more reasonable as it required understanding above memorization, and was graded generously. My project was a lot of fun, and it was the first time that I didn't feel extremely pressed for time. Most things were done (or almost done) days in advance. I would rate this #1 in all my four grad courses so far.

mosfet_drake_yes