Eye State Detection in ML.NET - Part 2/2

This two-part blog series explores usage of ONNX models in the ML.NET framework to build a pipeline for detecting eye states (open/closed) in images and video frames captured via webcams. Part two (this blog) focuses on utilizing the MediaPipe Face Landmarks Detection model to identify landmarks of left and right eyes in the cropped image, and compute Eye Aspect Ratio (EAR) to detect eye state.
Previous in the Series - Eye State Detection in ML.NET - Part 1/2
Overview
In the first part of this series, we tackled the challenge of finding faces in images using ML.NET and the YuNet model. Now, let’s take things a step further!
In this post, we’ll dive into the fascinating world of facial landmarks—specifically, how to pinpoint the eyes and figure out if they’re open or closed. We’ll use the MediaPipe Face Landmarks model, but with a twist: we’ll convert it from TensorFlow Lite to ONNX so it plays nicely with .NET.
Along the way, I’ll show you how to prepare your data, run the model, visualize the results, and finally, use a clever metric called the Eye Aspect Ratio (EAR) to detect eye state. Whether you’re building a drowsiness detector or just curious about computer vision, you’ll find plenty of practical tips here!
Converting the MediaPipe Model (TFLite to ONNX)
You should download the MediaPipe Face Landmarker task before you continue.
The MediaPipe Face Landmarks model is distributed as a .task
file, which is actually a compressed archive containing the TFLITE model and some metadata.
Here’s a little hack - you can rename the .task
file to .tar.gz
, extract it, and you’ll find the TFLITE model inside (usually named something like model.tflite
).
⚠️ Note: This trick works for the MediaPipe task file used in this post, but may not work for other task files. Google may change the format or structure in the future, so always check the contents after extraction!
Once you have the TFLITE model, you can convert it to ONNX format using Python and tf2onnx
:
%pip install numpy==1.26.0 tf2onnx tensorflow
import tf2onnx
(onnx_model, metadata) = tf2onnx.convert.from_tflite(
tflite_path="./models/face_landmarks_detector.tflite"
)
print(metadata)
with open("./models/face_landmarks_detector.onnx", 'wb') as f:
f.write(onnx_model.SerializeToString())
Unlike all other code blocks in this post (and the first part), the conversion should run in Python enviroment, preferably in a Jupyter Notebook. It is possible to run this in the .NET interactive notebook as well, but to manage the dependencies, I suggest to run in a separate Python environment.
Why numpy 1.26.0? Some versions of tf2onnx and TensorFlow have compatibility quirks with newer numpy releases. Sticking to 1.26.0 ensures a smooth conversion process—no mysterious errors or version headaches!
This quick Python snippet takes the original TFLite model and converts it to ONNX, making it compatible with ONNX Runtime and .NET. The conversion is fast and preserves all the model’s capabilities—plus, you get to peek at the model’s metadata for extra insight!
Setting Up the Environment
I recommend you to continue working on the directory from the first post. You should already have the cropped images in images/cropped
folder, along with the converted onnx
model for landmarks detection.
EyeStateDetection
|__images
| |__original
| | |__image1.jpg
| | |__image2.jpg
| | |__...
| |__cropped
| | |__0.jpg
| | |__1.jpg
| | |__...
| |__...
|__models
| |__face_detection_yunet_2023mar.onnx
| |__face_landmarks_detector.onnx // created after conversion
| |__face_landmarks_detector.tflite // obtained after extracting the task file (after renaming to .tar.gz)
|__MediaPipeLandmarkDetection.dib // the interactive notebook to run code in this post
|__MediaPipeModelConversion.ipynb // the python notebook used for model conversion
|__YuNetFaceDetection.dib
Just like before, we’ll use ONNX Runtime for inference and ImageSharp for image processing. The workflow assumes you’ve already got your cropped face images from the previous step—so we’re ready to roll!
#r nuget:Microsoft.ML.OnnxRuntime
#r nuget:SixLabors.ImageSharp
#r nuget:SixLabors.ImageSharp.Drawing
using System.IO;
using System.Numerics;
using System.Threading.Tasks;
using Microsoft.ML.OnnxRuntime;
using Microsoft.ML.OnnxRuntime.Tensors;
using SixLabors.Fonts;
using SixLabors.ImageSharp;
using SixLabors.ImageSharp.Processing;
using SixLabors.ImageSharp.PixelFormats;
using SixLabors.ImageSharp.Drawing;
using SixLabors.ImageSharp.Drawing.Processing;
Here we bring in all the libraries we’ll need. ONNX Runtime powers the model inference, while ImageSharp handles all things image-related. This setup is flexible and works great for .NET-based computer vision projects!
Set up directories and load the cropped face images:
// image directories
const string croppedImages = "./images/cropped";
const string landmarkedImages = "./images/landmarked";
// create directories
foreach (var path in new string[]{croppedImages, landmarkedImages}) {
if (!Directory.Exists(path)) {
Directory.CreateDirectory(path);
}
}
// Load the images
var images = Directory.GetFiles(croppedImages).Select(imagePath => Image.Load<Rgb24>(imagePath)).ToArray();
Organizing your images into clear directories makes it much easier to keep track of your data as it moves through each stage of the pipeline. This code ensures all necessary folders exist and loads your cropped face images for processing.
Loading the MediaPipe Face Landmarks Model
After converting the model, it’s time to load it up in C# and take a peek at its metadata:
var session = new InferenceSession("./models/face_landmarks_detector.onnx");
(session.ModelMetadata, session.InputMetadata, session.OutputMetadata)
This line loads the ONNX model into an inference session, ready for predictions. Checking the metadata is a great way to confirm the model’s input and output expectations before you start feeding it data!
Preprocessing Images
The model expects 256x256 images in NHWC format, with normalized pixel values. Here’s how you can get your images ready:
int N = images.Length;
const int H = 256;
const int W = 256;
const long imageDataLength = 3 * H * W; // C*H*W
long[] inputShape = [N, H, W, 3]; // this model supports batching
// convert images to OrtValue in NHWC format
float[] input = new float[N*imageDataLength];
Parallel.For(0, N, n => {
var image = images[n];
Span<Rgb24> pixels = new Rgb24[H * W];
image.CopyPixelDataTo(pixels);
for (int idx = 0; idx < pixels.Length; idx++) {
var pixel = pixels[idx];
// mediapipe model expects normalized channel values
input[n * imageDataLength + 3 * idx] = (pixel.R - 127f) / 128f;
input[n * imageDataLength + 3 * idx+ 1] = (pixel.G - 127f) / 128f;
input[n * imageDataLength + 3 * idx + 2] = (pixel.B - 127f) / 128f;
}
});
Want to know all the details about the model’s inputs, outputs, and training data? Check out the official MediaPipe Face Mesh V2 Model Card. It’s a goldmine for understanding what the model expects and how it was built!
Preprocessing is crucial! Here, we resize and normalize each image so it matches the model’s requirements. The normalization step helps the model make accurate predictions, and batching speeds up inference if you have multiple images.
Running Inference and Extracting Landmarks
Now for the magic! We run the model and extract the 3D facial landmarks for each image:
var ortValue = OrtValue.CreateTensorValueFromMemory(input, inputShape);
var sessionInputs = new Dictionary<string, OrtValue> { { session.InputNames[0], ortValue } };
var runOptions = new RunOptions();
var sessionOutputs = session.Run(runOptions, sessionInputs, session.OutputNames);
var result = new Dictionary<string, float[]>(
session.OutputNames.Select((name, index) => {
var output = sessionOutputs[index];
return new KeyValuePair<string, float[]>(name, output.GetTensorDataAsSpan<float>().ToArray());
})
);
runOptions.Dispose();
sessionOutputs.Dispose();
This block runs the ONNX model and collects the output for each image. The result contains all the facial landmark coordinates you’ll need for the next steps. It’s fast, efficient, and works seamlessly with .NET!
The model returns 478 landmark coordinates (x, y, z) per image.
Why three coordinates? The z value gives you depth information, which is super useful for augmented reality (AR) and 3D applications. Even if you’re only using x and y for 2D overlays, it’s cool to know you’ve got a full 3D face mesh at your fingertips!
Visualizing Eye Landmarks
Here’s where things get fun! We focus on the landmarks for the left and right eyes, and draw them on the images so you can see exactly what the model found.
If you want to see where every landmark is on the face, check out this landmarks reference image. It’s a fantastic visual guide for mapping indices to facial features!
public struct Landmark{
public float X;
public float Y;
}
const int numLandmarks = 478;
int[] leftEyeLandmarks = [263, 387, 385, 362, 380, 373];
int[] rightEyeLandmarks = [33, 160, 158, 133, 153, 144];
int[] bothEyeLandmarks = [..leftEyeLandmarks, ..rightEyeLandmarks];
List<Landmark>[] allLandmarks = Enumerable.Range(0, N).Select(_ => new List<Landmark>()).ToArray();
var font = SystemFonts.CreateFont("Arial", 12);
Parallel.For(0, N, n => {
var image = images[n];
var landmarks = allLandmarks[n];
image.Mutate(ctx => ctx.Resize(3*W, 3*H));
foreach (int i in bothEyeLandmarks) {
float x = result["Identity"][1434 * n + 3 * i];
float y = result["Identity"][1434 * n + 3 * i + 1];
landmarks.Add( new() { X = x, Y = y });
var color = leftEyeLandmarks.Contains(i) ? Color.Green : Color.Blue;
image.Mutate(ctx => {
ctx.Draw(color, 1, new EllipsePolygon(3*x, 3*y, 2));
ctx.DrawText($"{bothEyeLandmarks.ToList().IndexOf(i)}", font, color, new PointF(3*x, 3*y));
});
}
image.SaveAsJpeg($"{landmarkedImages}/{n}.jpg");
});
This visualization step is not just for show—it’s a fantastic way to debug and understand what the model is doing. By drawing the landmarks, you can instantly see if the model is accurately finding the eyes, and you’ll have images to share or use in presentations!
Eye Aspect Ratio (EAR) and Eye State Detection
The Eye Aspect Ratio (EAR) is a clever little metric for figuring out if an eye is open or closed. It’s based on the distances between specific eye landmarks. If the EAR drops below a certain threshold, the eye is probably closed!
Why use 6 points for EAR? This approach is inspired by a research (see this paper), which shows that using six well-chosen landmarks around the eye gives a robust, rotation-invariant measure of eye openness. It’s a simple formula, but it works wonders for blink and drowsiness detection!
public struct EyeStateDetection{
public float LeftEyeAspectRatio;
public float RightEyeAspectRatio;
public bool LeftEyeOpen;
public bool RightEyeOpen;
}
EyeStateDetection[] eyeStateDetections = Enumerable.Range(0, N).Select(_ => new EyeStateDetection()).ToArray();
float Distance(Landmark A, Landmark B) => new Vector2(A.X - B.X, A.Y - B.Y).Length();
float CalculateEAR(Landmark P1, Landmark P2, Landmark P3, Landmark P4, Landmark P5, Landmark P6) =>
(Distance(P2, P6) + Distance(P3, P5))/(2*Distance(P1, P4));
Parallel.For(0, N, n => {
var ears = new int[][] {[0,1,2,3,4,5], [6,7,8,9,10,11]}.Select(indices => {
var coords = indices.Select(index => allLandmarks[n][index]).ToArray();
return CalculateEAR(coords[0], coords[1], coords[2], coords[3], coords[4], coords[5]);
}).ToArray();
eyeStateDetections[n].LeftEyeAspectRatio = ears[0];
eyeStateDetections[n].RightEyeAspectRatio = ears[1];
// eye considered closed, if EAR <= 0.2
eyeStateDetections[n].LeftEyeOpen = ears[0] > 0.2;
eyeStateDetections[n].RightEyeOpen = ears[1] > 0.2;
});
eyeStateDetections
The EAR formula is simple but powerful. It’s widely used in blink and drowsiness detection research. By comparing the aspect ratio to a threshold, you can reliably tell if someone’s eyes are open or closed—no fancy classifiers needed!
If the EAR is below a certain threshold (commonly 0.2), the eye is considered closed.
Cleanup
Don’t forget to tidy up! Dispose of the inference session and images to free up resources:
session.Dispose();
foreach(var image in images) { image.Dispose(); }
Always clean up resources when you’re done. This keeps your application running smoothly and prevents memory leaks, especially when working with large images or running many inferences in a row.
And that’s it! You’ve now got a full pipeline: from face detection and cropping (in Part 1) to landmark detection and eye state analysis (in this post), all running smoothly in .NET with ONNX Runtime and ML.NET. Whether you’re building a real-world application or just experimenting, I hope you found this journey as exciting as I did.
Happy coding! 🚀
References
- ONNX Runtime in C#
- ImageSharp Library
- MediaPipe Face Landmark Detection
- EAR Research Paper
tf2onnx
Python Library- GitHub Gist with Source Code
This blog was written with AI assistance.
You can read this post in Medium.