Kyle McDonald is an artist and researcher in New York with a background in computer science and philosophy. In October 2011, he released FaceOSC, a tool for prototyping face-based interaction. FaceOSC was based on the work of Jason Saragih, a research scientist at CSIRO. In addition to FaceOSC, Kyle has produced an addon for working with face tracking in OpenFrameworks as well as a growing body of work that uses face tracking in an artistic context, notably Face Substitution with Arturo Castro.

I interviewed Kyle about FaceTracker on March 20, 2012. I asked him to explain how the FaceTracker algorithm works. We started the conversation by talking about his prior experience with face tracking and how he first came across Saragih’s work.


Hi Kyle.

Hey, what’s up Greg?

How’s it going?

I’m pretty good.

Good. So I wanted to ask you a little bit about FaceTracker.


So first of all how did you hear about it in the first place? How did you discover Jason’s work and how did you come across it?

I think it was from this guy Parag Mital, who’s another interactive artist, and he’s always exploring really crazy high level algorithms and taking code from one researcher and then wrapping it up and trying to figure out how to use it in an arts context. I think he was the first person I saw using FaceTracker and he was also testing another library called AAMLib, which is based on OpenCV, but it does something similar – it’s just really slow and not as accurate. So I think it was from him. He was doing some research with face tracking.

So you saw him actually using the code that Jason had released?

Yeah, exactly. Well I don’t know if he had it finished into a piece yet. I think he was just doing some experiments with it, and I forget even what the experiments were. I don’t know. I’m gonna look it up, actually.

Yeah, go for it. I’m really curious, like what that – I think a lot of people the first moment they see the results of FaceTracker as pretty striking so I’m curious as to what your experience was at that moment.

Yeah, so Parag – going to his Vimeo. So maybe it was – maybe it was this one, like 9 months ago or something. Is this it? Let’s see… Nah, it was – it must have been longer than that. … Well, definitely this one.

6 DOF Head Tracking

But I don’t think – I don’t know if this one is actually using FaceTracker. Oh, this one is using Face API. So FaceAPI is a commercial version of FaceTracker but written by completely different people, and it’s been around for much longer and only recently did they allow it to be used for noncommercial products — used for free for noncommercial projects. Before that you used to have to contact them even if it was noncommercial and try and get a copy and license it from them. So he was using that. So that was like a year ago. And then it looks like a little later: so this is 9 months ago [referring to this video]. So then he says “using Jason Saragih’s FaceTracker code”. And you can see he has lots of low level stuff he’s doing already. There’s “load an existing AAM model,” “reinitialize the FaceTracker,” “train an AAM model,”.

So he’s doing some extra stuff on top of what Jason was already doing, and then his next video is actually, in a way, really similar to what Arturo and I did much later.

He was reprojecting his own face back onto his tracked face. So I think he was taking the trained model — the trained appearance model — and then drawing that on top of his face from the current estimated parameters. So he was doing Active Appearance Modeling instead of Jason’s just Active Shape Modeling, ASM. And so one of the differences between those two is that Active Shape Modeling is just about getting these contours and it’s based on the landmarks, which are these little patches of the face that you’re tracking. But, if you want to do Active Appearance Modeling then you start to take into account all of that information between the landmarks. So it’s more like instead of having little patches you have a full texture map that you’re trying to map and you deforming the texture map. So, Parag was already going one step further by not just using Jason’s code by itself but actually using it with AAM and doing complete texture deformation. So, I think this the first place that I saw Jason’s stuff being used.

So then what was your next step? Did you get in touch with him or did you…?

Yeah, I got super excited (laughs) and I think I heard about it, I mean I follow Parag’s stuff but I think Zach Lieberman actually sent it to me and I was like, “Oh, I have to play with this! Do you have Jason’s e-mail?” and I was looking around and then I found it buried on Jason’s webpage. I sent an e-mail and then Jason very kindly sent the code back. I think Parag had a wrapper at the time already for FaceTracker, he had an example of it working with Open Frameworks, but it was really minimal and it was just taking the code that Jason provided and just throwing it into OF. So when Jason sends the code to people he sends a demo application that you can compile and run from the command line.

And if you can read through it you can see what it’s doing and you can make guesses as to how you would do the same thing in the OF app. And Parag basically did that. He took this demo code from Jason and then threw it into OF and instead of having it run through command line he was feeding it stuff from the OF camera and the getting the data back into OF. So it worked and you could see the results and start doing stuff since it was in OF. But the more that I read through the code and as I started reading Jason’s paper I realized there was a lot of higher level stuff you could do with the library. But you had to know where to look for that. So for example, orientation — the direction that your head is facing — there is one matrix hidden deep within Jason’s code where if you extract the values from that matrix you can get the head orientation. But Parag wasn’t wrapping that as he didn’t quite know where to find it. I talked a lot with Jason and I asked what are the different meanings of the different matrices and how can I use them. And then I wrapped them in a way that was more useful so people didn’t have to dig through Jason’s code they could just see a function that was called ‘getOrientation()’ or something like that. So yeah, Parag initially had a very basic wrapper that got me kickstarted on my way but then I took a completely different direction by creating a full-featured addon.

So did you know about ASM and AAM and this field of face tracking before that?

Yeah a little bit. One the first times I was exposed to it was working on a project with Theo Watson called Portrait Machine. And we were doing high level analysis of people’s faces: things about skin color and the shape of their forehead as far as where their hair meets the forehead and the shape of that and what color clothing they were wearing and whether it was patterned clothing or just a solid color, whether they were tall or short, if they were tilting their head one way or another. So we were doing all types of analysis on these images and we wanted to get more information about their facial expressions as well. We had a library that let us do smile detection, it told us how much someone was smiling, but that was it. So we looked into looking into using AAM with one of these older libraries before Jason’s FaceTracker. And it was the kind of thing where if you wanted to do a reasonably-sized image something like 320 by 240 it took like 2 seconds to process the image and it kind of worked, but it could get really messed up, like the face could get tracked really poorly and it wasn’t very robust to different lighting situations. Jason has a lot of novel features with the way he wrote his code that makes it more robust to different lighting and that was really important fro what Theo and I were doing. So in the end we weren’t able to use this other code for that project so we just done everything by hand with Haar face detection instead.

So at that point you thought that you wanted to do more face tracking than just smile tracking and that was when you started searching around to see what research had been done in that regard. And that was when you came across AAM and ASM?

Right exactly, yeah yeah yeah. So Theo already knew about the smile detection stuff and we thought ‘oh that’s great’ but maybe we can get some more things like how much the eyebrows are raised. With OpenCV you can use Haar detection to just find maybe the eye. But then analyzing an eye to find out how high the eyebrow is raised isn’t really trivial because everyone has different face-shapes and different eyebrow colors. So it’s not easy to do it by hand. You would need to re-implement this for each feature. If you want to see how big or how open or closed the eyes are, how high or low the eyebrows are or how wide or narrow the mouth is you would have to write really specific code for each of these things. And we did for a few of them but we were hoping to find something that would do that for us and that’s when we found AAM and ASM. But it was just to slow at the time and there were no good implementations that we could use. We were looking into using FaceAPI actually but they they wanted an unreasonable amount of money for that.

So then when you got Jason’s code, did you also get his research? Like, his paper that he published about it? When did you start looking at that?

Yeah. So the first thing that I got from him was just the code. And then I asked him, I think, a few days later. I was like, “Hey, can I get a copy of the paper that went with this?” (Deformable Model Fitting by Regularized Landmark Mean-Shift — ed.) Because normally those kind of scientific papers are published in journals where the journal makes a lot of their money based on the fact that people are subscribed to the journal through the institution or through people individually paying for articles. I think I had just finished school so I wasn’t associated with any institution at the time and I didn’t have an easy way of getting access to the article and I didn’t feel like spending forty dollars or whatever on a few pieces of paper. So I just sent Jason an email and he sent it over to me and we could talk about it a little more and I could ask him questions about what’s going on in there.

So that helped in figuring out the code as well as understanding the algorithm and how it’s working?

Yeah, definitely. One of the things that the paper really helped with is understanding the failure modes of the algorithm, like knowing when to expect it to break. Because I got a feeling from playing with it, but knowing what Jason thought was the problem with it was a totally different experience because then I could look at it and try to set up that situation that was a failure and then I don’t just see the failure but I also understand why it exists and that influences the way I use the algorithm after that. And then sometimes you design things for that as well, sometimes you design things to actually take advantage of that rather than just avoid it. So when you were doing stuff with the pareidolia experiments, that’s taking advantage of the failure mode of the algorithm instead of working around it.

Will you talk about those failure modes a little bit?

Yeah. With FaceTracker it’s so much based on lighting differences in little areas or just color differences in little areas depending on how you think about it. Like with their eyebrow it’s just a color difference because their eyebrows are generally darker than the skin, but with this part of your face [under the chin] then it’s a lighting difference because there’s a shadow beneath here and not here. So if you had really unusual lighting, like someone had a flashlight beneath their face and that was the only light in the room, that would be so strange — I don’t know, I’ll have to try it sometime — but that’s something I would expect to cause a problem.

You can also have a problem if you have really directional light that casts really harsh shadows and then the FaceTracker might think that the edge of a shadow is the edge of a feature. Like here [indicates his jawline] where it transitions from light to dark the FaceTracker says, “Oh, there’s an edge there. There’s something I can track there.” But if you get that transition somewhere else because of a shadow, then that’s going to make a bad guess about that.

Golan Levin has a funny experience with the FaceTracker because his beard is turning grey here [chin] and not over here[cheek] — or maybe it’s the inverse, I can’t remember — and so it kind of looks to the FaceTracker like a really weird lighting situation and it will end up bringing his chin up to here [just below his lower lip] all the time.

And a big reason that these failure modes exist is because of the dataset that Jason trains the library on. There’s this dataset from CMU called the Multi-PIE database. It’s just a huge collection of, I think, tens of thousands or hundreds of thousands of images of faces that are all marked up with information about where the face is and where the features are. And the way FaceTracker works is that Jason didn’t just write code to detect things, but he wrote code to train those detectors on this database. So the things that FaceTracker is detecting are based on things that it saw in the Multi-PIE database. And the Mulit-PIE database is constrained to real world situations. So it represents, on average, the real world because it’s taken in a lot of different conditions. And it’s not like it’s pictures of people with a flashlight beneath their face. It’s people in daylight, people indoors with lighting above. Because of that, that’s what FaceTracker is best suited for tracking — because it’s been trained to those situations. So when you start to get situations that are really unusual lighting or just really unusual facial features then it starts to have problems with that but because on the other hand it was trained in lots of normal lighting scenarios: indoors with above lightings, outdoors with overcast, outdoors with directional lighting then all of those situations work kind of OK.

Maybe that’s a good transition to actually start talking about how the algorithm works. Since we’re talking about the learning behind it. So I think let’s do what we were doing before: starting to go through the abstract and have you talk us through that little by little might be a good starting point for that.

Yeah. Definitely. So this is the paper from Jason. I mean he is written a few papers on this but this is a really nice one explaining the core of FaceRracker and how that library works. The paper is called “Deformable Model Fitting via Regularized Landmark Mean Shift”. The title itself is kind of quite a bundle that words but we can actually unpack this a little bit. So we know what “deformable” means. Deformable means it can kind of be twisted and pushed around. “Model fitting”? That’s the process of taking something that is a form or something or that’s perfect in some way, whether it’s a mask of someone’s face or whether it’s an equation describing some curve, those are all models of something. A manikin would be also kind of a model, a perfected form that represents the human body. So, deformable model fitting is something like taking a form of the face and then squishing it until it fits some target, in this case its a photo that you feed in or a camera feed.

And then he says, “by regularized landmark mean shift”. So, this is the technique that’s being used for doing this. Deformable model fitting using regularized landmark mean shift. Well first of all “landmark”: that’s what I was talking about earlier with all little patches. All of the different points on the face are being detected using little patches of the image. It’s not analyzing the whole thing and then instantly understanding how the face exist there, it analyses lots of little portions. I don’t understand what “regularized” means. I have to read the paper more thoroughly again. And “mean shift” is referring to this algorithm called mean shift which has to do with making iterative refinements to estimations. So mean shift might be something like — What’s that game where you throw the balls and then who ever gets closest…

Bocci ball?

Yeah, bocci ball. So mean shift is kind of like bocci ball except…No-no-no. Even better! It’s like …curling. It’s like curling. So, mean shift is like curling. You throw it in generally the right direction and then everyone is trying to make up for little inconsistencies along the way. Mean shift is just that idea. Making an initial guess like throwing the whatever, whatever it’s called in curling

The iron?

Yeah. Throwing the iron or whatever it is. The stone yeah-yeah, the stone. So, you make this initial guess and then you make some estimation about the direction that you really wanted to be going or where you really want to land and then you just remove it a little bit. And the reason you do that with curling is because you can’t throw it the whole way, right? It would be impossible. It’s way too hard. The reason you do it here is because you actually don’t know your answer in advance just by looking at the image but you have a general sense of ‘maybe there is something here that matches up with this feature I am looking for and it’s part of the same face that has a feature over here so maybe we can use those things to inform each other about the overall structure and individual positions of where things are’. So mean shift just means that iterative refinement of things.

So, again. “Deformable model fitting”: that’s squishing something around using these little features that are moved around a little bit. And the mean shift is actually related to the deformation [the mean shift is how the model is deformed — ed.].

And then abstract is a massive bundle of words. It makes more sense to actually read through the whole paper first and then it go back to the abstract and be like “Oh! I get it now.” Because when you read the abstract the first time if you’re not familiar with the terminology then it can be kind of overwhelming. I mean if you just pick a random section here it will say something like: “in this work a principled optimization strategy is proposed where non-parametric representations of these likelihoods are maximized within a hierarchy of smoothed estimates.” What?!? That was at least like, seven terms there that maybe I’m not familiar with. I definitely still don’t know some of those terms. But when you read through the whole thing instead of just the abstract, then you can start to get a feeling for what’s going on. The only scientific papers I feel like I really understand are some of the ones around 3D scanning because I’ve read each of them five or six times and I’ve read a lot of them. I haven’t just read one paper, I’ve read other papers that are on the same topic. This is one of the only face tracking papers I’ve read. I’ve read a more general description of how it works, but a lot of these terms I’m not familiar with because I’ve never implemented it myself and I haven’t seen them show up in places.

But if we just go through the beginning of the introduction. He starts off and he says “Deformable model fitting is the problem of registering” — there we’re already at this weird phrase. So, “registering”. All of these words are like super technical words, but they’re technical not because, anyone’s trying to confuse anyone else, but just because they have a very specific meaning that they want to communicate. And when you say “registering”, all he really means is ‘lining up’. If I have two things that are not aligned and I do this [brings to separate hands together to form a vertical line], then they’re registered now. But “registering”, that word, has this huge history of image registration and all these other kinds of registration. So when a scientist reads that word and says “Ahh, registering…”, they immediately have this image not just of lining something up of all the research that has been done about image registration in the past and other kinds of registration.

So “deformable model fitting is the problem of registration” — or lining up — “a parameterized shape model”. And again there’s a bunch of stuff in here that I’m not super familiar with. So when he says “parameterized”? Parameters: you can think of them as knobs. But when he says “parameterized shape model” I think what he’s saying is that you have this mask, this model, and there’s different ways of deforming it. You can think of each of those deformations as like a knob, and those knobs are the parameters. So maybe if I go like this [opens mouth], then that’s one kind of parameter. It’s one kind of deformation. So he doesn’t just have a shape model, he doesn’t just have a mask, but he actually has knowledge about the ways in which that mask deforms. So, if I just had the mask, then I could take it in my hand and twist it in random directions, but our face doesn’t really do anything. Our face actually just does a few things. It does this [raises eyebrows] and it does this [lowers jaw], and it does this [smiles], and that’s it. So, when he said “parameterized”, I’m pretty sure that’s what he’s talking about.

So that deformable model is actually a 3D model of a face?

Right. So it’s a 3D model of a face and it looks something like this.

FaceTracker reference 3D model

And you can see, it kind of matches up with what you would expect, but it’s kind of weird in some ways. It looks weirdly elongated. This [the model’s cheeks] should be out more, right along here, this [the model’s right cheekbone] should be protruding more, but because it’s not important, there’s no features here to track, so there’s no reason for the 3D model to represent that. All of the points here are in a good place, but the overall structure can be strange sometimes. Also here across the nose, it makes it look like there’s a giant triangle right here. But again, that’s because there’s no features to track there. So there’s kind of this thing inside FaceTracker and he talks about it. This is the model. It s a bunch of triangles and there’s a point for all of these triangles.]

FaceTracker reference 3D model in wireframe

This is the model. It’s made up of a bunch of triangles. And when he says “deformable”, he’s just saying that it’s not a fixed mask that we’re moving and orienting, but it’s a stretchy mask that we’re stretching in different ways. And when he says “parameters”, that we have parametric control, or it’s a “parameterized” shape model, that means that it’s not just deforming in arbitrary ways, but there’s very specific ways in which it deforms. Among the things like the mouth moving.

So then he says, “registering a parameterized shape model to an image…” (something like a camera image) “…such that its landmarks” (the landmarks of the shape model) And there are landmarks kind of around each of these points, and that’s like a little patch of the image that we’re looking at. The landmarks correspond to “consistent locations”. So these landmarks from the face model correspond to the same point on the input image. When he says “consistent locations” that’s all it means. “…on the object of interest”: so the object in the image. So this whole sentence is the description of, well he says it’s “the problem”. “Deformable model fitting” is the problem of doing this, which weird because he’s also describing what the solution is. This is what’s happening. So this is not the first paper like this about deformable model fitting and what he’s saying is there’s a whole class of things that people do called deformable model fitting and this is generally how they work. Then the rest of the paper is him going through his solution for that problem, which is the best way of accomplishing this kind of model fitting.

Again, it’s similar to how some other algorithms work. So, I’ll go on just a little more. He says, “It’s a difficult problem,” this whole thing of figuring out how this mask fits on someone’s face, “it’s a difficult problem as it involves an optimization in high dimensions.” So that’s another really weird phrase — optimization in high dimensions. Optimization is just a scientific way of saying getting something right. So, if you have a paper that you’re writing for a class and you’re making it better and better, you could say you’re optimizing it. If you have a car that isn’t running at quite the right miles per gallon and you tweak some things then you’re optimizing it. When you have a model of a face that you’re fitting and it’s kind of misaligned then you want to make it fit a little better, and that would be optimizing the match. When he says “high dimensions” what that means is that there’s a lot of knobs to tweak. If all I was optimizing was the position of my mouse on the screen and the optimal position of my mouse on the screen would be right on the nose, then I would move my mouse slowly and now it’s optomized. The position of my mouse is optimized to be on the nose.

That’s only two dimensions. I’m moving my mouse in two dimensions in an X-axis and a Y-axis. If I want to do something more complicated like optimize this [indicating the default 3D model that comes with FaceTracker, which is on screen] towards having the face facing perfectly right and zoomed in so that it expands to fit the whole screen, then I already have three dimensions for rotation that I’m talking about. [rotates the model] That’s kind of facing right. And then I have another dimension for zoom and I can get it to do what I was going for. So it’s now four dimensions we’re talking about. But, something like fitting a face when he says “high dimensions” he’s talking about a ton of dimensions, not just two or four or something like that but actually each one of these points is a two-dimensional point and it has to move all of those points. They can’t really move completely independently but that’s about the order of magnitude of the dimensionality problem, that’s about how many knobs there are.

So again, when I’m moving my mouse, there’s kind of like two knobs, there’s the X and the Y. Then when I’m doing something like positioning the face, there’s only like three or four knobs for the orientation and the zoom. When I’m doing something like face fitting, there’s like hundreds of knobs, because there’s 66 points in the model and each point has two values, or three values depending on how things are organized. So there’s a lot of the knobs to tweak to get the face to fit correctly. When you have a more knobs to tweak, the problem is generally harder to solve. So he says it’s a difficult problem as it involves optimization — solving a problem in high dimensions. So, it’s a difficult problem because there’s a lot of knobs to tweak.

“Appearance can vary greatly between instances of the object” — the object being the face [laughs] it’s really funny language. “Appearance” can vary greatly, so the way that a face looks to a camera can really be different “due to lighting conditions, image noise, resolution, and intrinsic sources of variability”, which probably means something like some people have beards.

And so, even here, there’s some scientific language, like “intrinsic sources of variability”. “Variability” means something like ‘there’s differences from one case to another’, but “intrinsic sources” can have a lot of different meanings. It could be something like just because of the way the object is (like just because someone happens to have a beard) or it could mean something like, maybe cameras are generally distorted in some way, and that’s an intrinsic source of variability. So there’s a lot of knobs to tweak for one, but not only are there a lot of knobs to tweak, but we’re not even sure all the time what we’re optimizing towards. We’re not always sure what the goal we’re getting towards is supposed to be.

If you put yourself in the computer’s mindset for a second, you have to imagine that you’ve just been given a face of someone from the middle of Papua New Guinea who is decked out in all of these decorations on their face. Then someone asks you to draw an outline of what you think their face is and they say, “One, put lines on where their eyebrows are. Two, put lines on where their chin is. Three, put a line on where their nose is.” You get this image and you’re like, “Well, what am I supposed to do? This nose doesn’t look like any other nose I’ve seen before. What I’m supposed to do with this chin, because it doesn’t look like any other chin I’ve seen before. And I don’t think this person even has eyebrows.” The computer is going through that experience with every single face you give it, because it’s kind of seen a lot of faces from the database that it was trained on. Still, it’s never seen the face that you’re giving it right now, and the face that you’re giving it can change in a lot of different ways. Like he says, the ways that it can change are the things like lighting or image noise or having a beard or not having a beard. So, there’s a lot of issues involved here.

How does it compare individual features in the image that it’s seeing? How does it look through the image in order to find parts of the image that match those features?

So now we’re getting into how the algorithm actually works. So far he’s just framed the general problem which is that we want to fit this model to the face. And he said why it’s difficult. So now to get into actually how to solve it, one of the first steps is you go through and one of the first steps you do is what’s called a Haar face detection [also known as Viola-Jones face detection after the authors of the original paper]. And Haar detection is kind of a classic face detection technique. There are some people you can talk to more about Haar face detection [See our interview with Adam Harvey]. But the basic idea with Haar detection is that you decompose the image into just light and dark regions and then you look for a comparison between the light and dark regions. So, in general if you have a face, then the face has a darker area here [indicating below the chin] and darker areas here [indicating the eyes] and some dark area here [indicating the temples] and they’re all arranged – maybe a dark area here [indicating the mouth] – and they’re all kind of arranged in this manner.

So all Haar detection is doing is trying to find some area of an image that matches that collection of bright and dark regions. So everything that’s not dark that I mentioned is probably bright — the cheeks are bright and the forehead is bright. And this can be a problem actually. This algorithm can be a problem if you’re dealing with people who have like darker skin for example because it was designed by white and Asian researchers. The algorithm has to make assumptions about the relative brightness of different features. So if there’s a light above you, then there’s going to be a dark spot in your eye because it’s a shadow being cast, but not here [indicating cheek] because there’s no shadow being cast. If you have light skin it’s really easy to see that difference, but if you have darker skin then it can be harder to see that distinction. So you also have to do just some kind of preprocessing on the image in order to get a good result. You want to do what’s called image histogram equalization first and Jason is doing that in his code. He does histogram equalization which makes it so that people with all different skin tones have the same relative differences so that shadows pop out a lot. Once you’ve done the histogram equalization, then you can run the Haar detector and it will tell you, “Hey, I just found this area of the image that has something that looks kind of face like, it has the dark and light spots in the right places.”

So the Haar detector assumes that your face is oriented like this [vertical with good posture], kind of straight and straight and straight and if I’m like this [tilts head to the side] then the Haar detector can’t see me, and if I’m like this [tilts head back] the Haar detector has trouble seeing me. And if I do something like this [covers right eye with right hand] it definitely can’t see me, if I just cover up half my face it has no idea because it needs all of those things to be there. One way of talking about it is it’s a holistic approach. Jason uses the word “holistic” in the paper to refer to something else but I’ll say this is a kind of holistic approach with Haar detection. So, that’s the first thing that happens with FaceTracker: it runs a Haar detector that figures out generally where something that looks like a face might be in the image.

And then from there it gets into the landmark fitting and that’s what this image is showing.

Saragih's Landmark fitting illustration

This image is showing that from the general idea of where the face can be you can throw down some initial markers — you can throw your curling stone in the generally right direction (I like that example; I can go to Canada and use it). So you throw your stone in the right direction, you make some guesses based on this initial Haar detection and then from each of these guesses you have some information that looks like this [indicating the highlighted regions in the section of the diagram labeled “Image and Search Windows”]. So he shows if you’re looking at a chin, at this landmark on the chin, then you get some kind of feature that looks like this [indicating the zoomed images labeled “p(ln|x)”]. And this looks really funny. It doesn’t look like the chin because it looks like its been normalized and inverted. So even though this part of the chin is darker, here it appears brighter. I think its inverted or might just be equalized in a weird way. I’m not sure. There’s some kind of processing that happens on the landmark so that it will be more consistent in different lining situations and different skin colors and all that sort of thing.

After you get this initial guess and you lay down all these landmarks, then you feed them into this model [indicating the middle image in the diagram, labeled “Optimization”], which tells you what that landmark should look like. So this is what it looks like right now which is our initial guess, but we have something to compare it to which is what it should look like. Then you run through that mean shift algorithm — I mean it’s not exactly mean shift; it’s related to mean shift — but we run through that algorithm slowly refining the position of the landmark until it matches up better.

And there’s one more thing going on which is that the positions of these landmarks can’t move completely independently. The eyebrows are great examples actually. If I have these landmarks here and these land marks here [indicating both rows of landmarks on the eyebrows in the “Image and Search Windows” section of the diagram]. Well, let’s just say this one and this one, the two outermost eyebrow landmarks. Each one has an optimal way that it should look. Right now it looks like this [indicating the appearance of the landmark in the row of images labeled “p(n|x)”] but it should look like this [indicating the appearance of the landmark in the image labeled “Optimization”]. And if we were to optimize each of those landmarks individually, then we would move them around until they were in just the right spot and then we’re done. And then we’d move this next one into the right spot until its done. However, they are kind of tied to each other so with the FaceTracker all of the different features are tied to each other little bit. The eyebrows are always assumed to be the same height. They can move independently a little bit, but in general if you raise your eyebrow you raise them at the same time. Its very uncommon in day-to-day experience to raise just one eyebrow. And because because FaceTracker was trained on day-to-day images it doesn’t model different eyebrows very well.

So it makes some guesses about things that are tied together. Some other things that are tied together are the eye openness. You can’t really tell it to detect one eye being open while the other eye is closed. The mouth is another good example. If you open up your mouth then its not like one side of your mouth could get detected as being closed and the other side could be detected as open. The mouth exists in one of these states that is two dimensional where you’re opening it a certain amount and you’re making it wider a certain amount. So the idea is that all these points are kind of constrained to each other and there are some relationships between them but they also have individual properties that you are trying to optimize for. You’re trying to twist those knobs until the goal is met the way it should look like. and so this mean shift process isn’t just about fitting each of these landmarks by themselves but it’s a lot about filling these landmarks relative to the landmarks as well.

So, those constraints are to reduce the difficulty of the optimization.

Exactly, it’s to reduce the number of knobs you have to deal with or to reduce the number of dimensions that you are talking about. And like I said earlier when the number of dimensions for a problem is really high it makes it really hard to solve because there are so many different ways of solving it. Imagine you sit down in front of a box that has two norms and someone says “Please turns these norms until you hear the tone.” And there is only one right answer and it’s 5 o'clock and 6 o'clock and then you hear the tone “Ah few! Got it.” Now if someone sits you down in front of a box with 200 knobs and they are like “Please turns the knobs before you hear the tone.” And the answer is like 5 o'clock, 6 o'clock, 7 o'clock, 5 o'clock, 2 o'clock then you are never going to find it because it takes too much time to try tweaking all of them and they are all independent of each other. So, we know what this is doing. It’s kind of like someone giving you a hint like: “By the way the bottom hundred knobs all have the same value.” Then all of a sudden your problem is significantly reduced. So, what he is doing here is saying we can make some guesses about how the features of the face move together. So, instead of trying to optimize them independently we can kind of use the information from one to reduce the number of dimensions.

So is that one of the things that makes this algorithm run faster than the slow ones you were talking about?

Yeah. I think so. I think that’s one of the things that’s going here. The other algorithms have some kind of dependency between the landmarks as well, but I think he is using it in a different way here because he keeps saying in the paper that it converges faster than the other algorithms, which I suspect would mean that he is using this kind of dimensionality reduction in a really smart way. So, yeah this is my high level understanding.

And then the thing we’re comparing as we move each of the features around, the thing we’re comparing it to is the result of the trained data?


The middle graph part of that graphic.

Yeah. So this is what the training thing may you are trying optimizing should look like yeah.

Great! I think that really covers it well. Was there any other things there to talk about?

I thought it was actually pretty good. Let see if there is any other cool pictures just for a second. Yeah, actually so here.

Saragih's response maps illustration

He’s showing some of the responses of the different landmarks to look like. This is good. So you can see this is for different databases maybe like what different noses look like for different databases. Oh no sorry these aren’t for different databases, I’m not actually sure what that’s comparing.

Saragih mean-shift illustration

This is just another description of what mean shift generally looks like: this idea of picking a point then refining it.

It’s always fun to just jump through the paper and look for images and see if you get a feeling for something like that…

Is that also a way for the author to tell you what’s important in the paper?

I think so. I think sometimes, but it’s complicated. You can’t quite make that assumption because when people publish papers they’re often constrained in different ways. Sometimes the person that they are working under tells them that they have to have certain images or the person that they’re trying to get it published with tells them you can’t use these images because they don’t add to the paper even though the author might think they’re really important. So, in general I would say that images are demonstrating important things about the paper or things that might not be intuitive from reading pseudo code or type, but I wouldn’t make the assumption quite that those are always the important things.

Dustyn was telling me that often times the journals actually end up owning the copyright of the images too, so you can’t reproduce them yourself. So people generate images they are not necessarily the ones that there most treasured specifically for journals.

Yeah, I think I have heard that. Or generate a huge database of different instances so that no one owns all of those instances. Yeah.

This is a really nice one.

Saragih landmark optimizations under occlusion

Jason shows that if you take the same sequence of four images 1, 2, 3, 4 and then use different algorithms to optimize the landmark positions you can get different results over the course of a face being occluded. He moves this newspaper across his face and you can see that with good tracking it should look like this [pointing at an image from the leftmost column], you should have the face in a good position. Then it’s messed up but that’s only because he’s completely obscuring his eyes. And then [after he removes the newspaper] it’s still a little messed up, but then when he takes it away it’s in the right place again. And then there’s another algorithm just above that where it shows the same thing is going on, but then when he takes the newspaper off it’s really messed up. And then there’s others still. And you can see these shapes don’t look anything like the faces. You can see there’s the jaw here, and the nose here, and the eye here, but — I’d have to read through again — but whatever algorithm he’s using to do this is probably really ignoring those kind of constraints that I was talking about earlier, the things that are learned from the training data.

One other small point that was not obvious: you said that using that 3D model means that there’s actually some 3D information associated with the face points.

Right. That’s a really good point. The way that FaceTracker works it’s not just working in 2D. The way that I understand it bootstraps everything by making a 2D guess about stuff. There’s the Haar detection and it figures out the general location and the size of the face and the orientation and the location and size. And then you put down some initial points in 2D. But this process of optimizing them doesn’t just have to do with the image content. It doesn’t just have to do with the landmarks or the features. It also has to do with the constraints between the features that I was talking about, like the eyebrows moving together. And it has to do with constraints of the face as a whole. So if this algorithm keeps moving a bunch of the features in a way that it’s leading you to think that it’s a rotated face — if it’s moving the features at the top to the right, and the features at the bottom to the left — then what’ll happen is, internally FaceTracker will start to think, ‘oh, this face looks like it’s rotated’. So when it makes more new guesses about the positions of the landmarks, it’s going to assume that the landmarks are rotated now. And it’ll do comparisons based on that. So, Not just are the eyebrows tied together and the mouth points tied together, but actually all of the points on the face are tied together by the orientation of the face. And there’s some kind of complex feedback I don’t understand between the points in 2D space and the points in 3D space. Because once you have the points in 2D space, you can get an idea of what the 3D points should be like, and then you can smooth out those points in 3D space to represent a face better and then feed that back into 2D space. And there’s a name for this. In 2D space when you’re moving things around I think it’s called procrustes alignment [original paper: here and another interesting one here], and that’s just referring to jiggling things around a little bit. But when you’re in 3D space then it has another name and you’re orienting and scaling stuff. So, internally there is a 3D representation of the face, but it’s only 3D in a faux 3D sense because you don’t really know about the field of view of the camera, for example. Since you don’t know about the field of view of the camera or other features of the camera, you don’t really know haw far away the face is, or how big it is, or the absolute distance from the camera to the face, or any of that stuff.

Does that 3D information help with occlusion, like knowing you shouldn’t be looking for the left eye because the face is turned to a three-quarter view?

Yeah, it’s hard to say. It should. It should help with that, but I also know that there’s just some modes inside FaceTracker where it’s either in frontal mode or left facing mode or right facing mode. So I think it’s probably something as it’s tracking the face. It also tracks the face from frame to frame. It’s not reinitializing from the Haar detection every frame. It does that once to find the face and then as the face moves from frame to frame it just refines its guess. So you have to start by looking at it straight so the Haar detector can get you, and then you can do all sorts of crazy stuff because it’s refining its guess to catch up with you. But as you start moving slowly to look in an extreme direction then it might just have some mode internally where it says, ‘oh, you’ve looked too far, stop looking for the points on that side of the face because you can’t see them anymore’. I’m guessing it just has a mode once you reach that orientation

So that initial Haar detection phase is part of why FaceTracker runs slower when it’s searching for a face than once it’s found one?

Exactly, yeah. Because Haar detection is a much slower algorithm then doing the landmark mean shift deformation.

And then that switch to the mode when it uses the time-based aspect you’re saying that it’s refining based on where it knew the points were last frame, for example. So it looks nearby to that to start?

Yeah, right. Exactly. That’s also why you’ll get errors when you move your face really fast especially in and out because it makes assumptions about how far or how fast you could possibly be moving your face and if you start moving it really fast then it gets surprised and doesn’t know where to look for things so it can lose your face a little bit. Especially in and out is really uncommon. Its pretty common for people to do this [tilts head side-to-side]. Not so common for people to do this [raises head up and down], and definitely uncommon for people to do this [pushes head forward and back]. So it makes some guesses about what kind of movement is reasonable and it searches within reasonable limits for that kind of movement.

So i guess as a final question what other avenues of research has this opened up for you? After having gone through this and understanding and learned about it, what now do you think is the next step? What’s the next door in the house of knowledge that opens up for you?

This is kind of connected to the Microsoft skeleton tracking paper in a way, because they are both using this idea of landmarks and identifying parts of something based on local information but also refining in a global sense as well. So I don’t know if I can say that one really led me to the other but I would say that seeing that both of the techniques are based on that local refinement and global refinement simultaneously helps me understand both of them together better. And this has also got me interested in other techniques for doing 6 degree of freedom head pose estimation. That just means 3 degrees for translation or position: x, y, and z, and then 3 degrees for orientation which is like you know this [turns head left and right, tilts head up and down, wobbles head side-to-side]. There are some other techniques for doing 6 degree of freedom head tracking. One of them is with Kinect, using depth data you can do something similar to what they did in the Microsoft skeleton tracking paper for tracking the head position and orientation. Its completely different than anything that Jason is doing so its really stable in some situations and gives you weird results in other situations. You can fake it out by making something small and face shaped with your hand and it will track it as its moving around.

Is that how the face tracking stuff in the Microsoft Kinect for Windows API works?

I don’t know I haven’t looked at that. I wouldn’t be surprised because I think that they are really leveraging that skeleton tracking algorithm.

Okay great thanks a lot. This was great.