Eye State Detection in ML.NET - Part 1/2

June 22, 2025 10 minutes read
Eye State Detection in ML.NET - Part 1/2 header image

This two-part blog series explores usage of ONNX models in the ML.NET framework to build a pipeline for detecting eye states (open/closed) in images and video frames captured via webcams. Part one (this post) focuses on leveraging the YuNet Face Detection model to identify and crop the most prominent face.


Overview

Face detection is the essential first step in any robust eye state detection pipeline. Before we can analyze the state of the eyes, we must first accurately locate the face within an image. This is because subsequent steps—such as facial landmark detection and eye region analysis—depend on a reliable face region to work from. Without precise face detection, landmark detection can fail or produce inaccurate results, leading to unreliable eye state classification.

In this post, we’ll focus on the face detection stage—using ML.NET, ONNX Runtime, and the YuNet face detection model. We’ll walk through the workflow, from setting up the environment and preprocessing images, to running inference, post-processing results, and visualizing detections. Each step builds on the last, so by the end, you’ll have a clear picture of how to prepare your data for more advanced facial analysis.

The code snippets are taken from an interactive notebook (Polyglot Notebook). These are intended to be run in .NET Interactive runtime instead of the normal .NET Runtime.


Environment Setup

To follow the code in this post, you should first download the YuNet Face Detection model. I recommend organizing your project with a folder structure similar to this — it will make the workflow much easier to follow through each stage:

EyeStateDetection
|__images
|   |__original
|     |__image1.jpg
|     |__image2.jpg
|     |__...
|__models
|  |__face_detection_yunet_2023mar.onnx
|__YuNetFaceDetection.dib //the interactive notebook to run the codeblocks

To get started, we need to bring in a couple of powerful libraries. ONNX Runtime lets us run machine learning models in the ONNX format, making our solution flexible and high-performance. ImageSharp, on the other hand, is a modern .NET library for working with images—perfect for loading, manipulating, and saving our data.

#r nuget:Microsoft.ML.OnnxRuntime
#r nuget:SixLabors.ImageSharp
#r nuget:SixLabors.ImageSharp.Drawing

Next, let’s import the namespaces we’ll use throughout the project. These cover everything from file I/O and math to image processing and drawing.

using System.IO;
using System.Numerics;
using System.Threading.Tasks;

using Microsoft.ML.OnnxRuntime;
using Microsoft.ML.OnnxRuntime.Tensors;

using SixLabors.ImageSharp;
using SixLabors.ImageSharp.Processing;
using SixLabors.ImageSharp.PixelFormats;

using SixLabors.ImageSharp.Drawing;
using SixLabors.ImageSharp.Drawing.Processing;

ONNX Runtime is our engine for running the face detection model, while ImageSharp gives us all the tools we need to handle images in .NET. With these in place, we’re ready to start working with our data.


Directory Preparation and Image Loading

Organization is key! We’ll set up directories for original, resized, detected, and cropped images. This way, we can keep track of our data at every stage and avoid confusion as we process each image.

// image directories
const string originalImages = "./images/original";
const string resizedImages = "./images/resized";
const string detectedImages = "./images/detected";
const string croppedImages = "./images/cropped";

// create directories
foreach (var path in new string[]{originalImages, resizedImages, detectedImages, croppedImages}) {
    if (!Directory.Exists(path)) {
        Directory.CreateDirectory(path);
    }
}

// Load the images
var images = Directory.GetFiles(originalImages).Select(imagePath => Image.Load<Rgb24>(imagePath)).ToArray();

You might notice we’re using the Rgb24 format. This means each pixel is stored as three 8-bit values—one for red, green, and blue. It’s a simple, efficient way to handle color images and works seamlessly with most image processing tasks.


Model Loading

Now it’s time to load the YuNet ONNX model. We use ONNX Runtime to create an inference session, and we also take a look at the model’s metadata. This tells us what kind of data the model expects and what it will give us back.

// Load the YuNet Model and create a new inference session
// Model Download Link: https://huggingface.co/opencv/face_detection_yunet/blob/main/face_detection_yunet_2023mar.onnx

var session = new InferenceSession("./models/face_detection_yunet_2023mar.onnx");

(session.ModelMetadata, session.InputMetadata, session.OutputMetadata)

Why does this matter? Well, the metadata helps us make sure we’re feeding the model the right kind of data (like the correct image size and color order) and helps us interpret the results. YuNet is a great choice here because it’s fast, accurate, and has a strong community behind it—so you’re not alone if you run into questions.


Image Preprocessing

Before we can run our images through the model, we need to get them into the right shape and format. That means resizing them to 640x640 pixels and converting them to a float array in NCHW format (that’s batch, channel, height, width).

// Preprocessing implemented by inspecting the Netron Graph for downloaded model
int N = images.Length;

const int H = 640;
const int W = 640;

const long imageDataLength = 3 * H * W; // C*H*W

long[] inputShape = [1, 3, H, W]; // this model doesn't support batching

// convert images to OrtValue in NCHW format
float[][] inputs = Enumerable.Range(0, N).Select(_ => new float[imageDataLength]).ToArray();

Parallel.For(0, N, n => {
    var image = images[n];
    // resize the image to match input dimenstions
    image.Mutate(ctx => ctx.Resize(new ResizeOptions{
            Mode = ResizeMode.BoxPad,
            Size = new(H, W)
        }));

    // save the resized image
    image.SaveAsJpeg($"{resizedImages}/{n}.jpg");

    // populate the inputs using image pixel data
    Span<Rgb24> pixels = new Rgb24[H * W];
    image.CopyPixelDataTo(pixels);

    for (int idx = 0; idx < pixels.Length; idx++) {
        inputs[n][idx] = pixels[idx].B;
        inputs[n][idx + pixels.Length] = pixels[idx].G;
        inputs[n][idx + 2 * pixels.Length] = pixels[idx].R;
    }
});

If you’re curious about how the model is structured, check out Netron—it’s a fantastic tool for visualizing ONNX models. Also, you might see terms like NCHW and NHWC; these are just different ways of organizing image data. YuNet expects NCHW, and since it’s part of the OpenCV family, it uses BGR color order. We use a float[][] array because the model doesn’t support batching, so each image is handled separately. And by using Parallel.For, we can process multiple images at once, making things much faster than a regular for loop.


Running Inference

With our images prepped, we’re ready to run them through the model. We use ONNX Runtime’s ORTValue to efficiently pass data to the model, and after inference, we parse the results into a friendly dictionary so we can easily work with them.

// run the inference
List<Dictionary<string, float[]>> results = [];

Parallel.ForEach(inputs, (input) => {

    var ortValue = OrtValue.CreateTensorValueFromMemory(input, inputShape);

    var sessionInputs = new Dictionary<string, OrtValue> { { session.InputNames[0], ortValue } };

    var runOptions = new RunOptions();
    var sessionOutputs = session.Run(runOptions, sessionInputs, session.OutputNames);

    var result = new Dictionary<string, float[]>(
        session.OutputNames.Select((name, index) => {
            var output = sessionOutputs[index];
            return new KeyValuePair<string, float[]>(name, output.GetTensorDataAsSpan<float>().ToArray());
        }
    ));

    // dispose runOptions and sessionOutputs
    runOptions.Dispose();
    sessionOutputs.Dispose();

    results.Add(result);
});

ORTValue acts as a bridge between our .NET code and the ONNX Runtime engine. Once inference is done, we get the outputs as a dictionary, mapping output names to float arrays. This makes it easy to grab exactly the data we need for the next step.


Post-processing Model Output

The model gives us a lot of raw data, but we need to turn that into something useful—like bounding boxes around faces. We do this by applying anchor-based calculations and confidence thresholds, filtering out anything that’s not likely to be a face.

// YuNet Post Processing Reference (Method - PostProcess): https://github.com/opencv/opencv/blob/b8e3bc9dd866b028e33b769e3c0992fc2b55a660/modules/objdetect/src/face_detect.cpp

// post processing result and applying confidence threshold
public struct BoundingBox{
    public int X1;
    public int Y1;
    public int X2;
    public int Y2;
    public float Confidence;
}

List<BoundingBox>[] boundingBoxes = Enumerable.Range(0, N).Select(_ => new List<BoundingBox>()).ToArray();

int[] strides = [8, 16, 32];

const float scoreThreshold = 0.7f;

Parallel.For(0, N, n => {
    var result = results[n];

    foreach (int stride in strides) {

        int cols = W / stride;
        int rows = H / stride;

        float[] cls = result[$"cls_{stride}"];
        float[] obj = result[$"obj_{stride}"];
        float[] bbox = result[$"bbox_{stride}"];

        for (int r = 0; r < rows; r++) {
            for (int c = 0; c < cols; c++) {

                int idx = r * cols + c;

                var score = (float)Math.Sqrt(Math.Clamp(cls[idx], 0, 1)*Math.Clamp(obj[idx], 0, 1));

                if (score < scoreThreshold) {
                    continue;
                }

                float cx = ((c + bbox[idx * 4 + 0]) * stride);
                float cy = ((r + bbox[idx * 4 + 1]) * stride);
                float w = (float)Math.Exp(bbox[idx * 4 + 2]) * stride;
                float h = (float)Math.Exp(bbox[idx * 4 + 3]) * stride;

                float x1 = cx - w / 2;
                float y1 = cy - h / 2;

                var boundingBox = new BoundingBox()
                {
                    Confidence = score,
                    X1 = (int)Math.Round(x1),
                    Y1 = (int)Math.Round(y1),
                    X2 = (int)Math.Round(x1 + w),
                    Y2 = (int)Math.Round(y1 + h)
                };

                boundingBoxes[n].Add(boundingBox);
            }
        }
    }
    boundingBoxes[n] = boundingBoxes[n].OrderByDescending(box => box.Confidence).ToList();
});

If you want to dig deeper into how this works, the OpenCV YuNet implementation is a great resource. But in short, we’re turning the model’s output into a list of possible faces, sorted by how confident the model is in each one.


Non-Maximum Suppression (NMS)

Sometimes, the model finds more than one box for the same face. To clean this up, we use Non-Maximum Suppression (NMS). This algorithm keeps only the box with the highest confidence for each face, so we don’t end up with duplicates.

// helper function to compute IoU
float CalculateIoU(BoundingBox boxA, BoundingBox boxB) {
    // box of intersection
    int iX1 = Math.Max(boxA.X1, boxB.X1);
    int iX2 = Math.Min(boxA.X2, boxB.X2);
    int iY1 = Math.Max(boxA.Y1, boxB.Y1);
    int iY2 = Math.Min(boxA.Y2, boxB.Y2);

    // area of intersection
    float areaOfIntersection = Math.Max(0, iX2 - iX1) * Math.Max(0, iY2 - iY1);

    // area of union
    float areaOfUnion = (boxA.X2 - boxA.X1) * (boxA.Y2 - boxA.Y1) + (boxB.X2 - boxB.X1) * (boxB.Y2 - boxB.Y1) - areaOfIntersection;

    return areaOfUnion == 0 ? 0 : areaOfIntersection/areaOfUnion;
}

// Non-Maximum Suppression (MNS)
const float nmsThreshold = 0.5f;

var bestDetections = new BoundingBox[N];

Parallel.For(0, N, n => {

    var boundingBoxSet = boundingBoxes[n];

    int keepIndex = 0;
    bool[] suppressed = new bool[boundingBoxSet.Count];

    for (int i = 0; i < boundingBoxSet.Count; i++)
    {
        if (suppressed[i])
        {
            continue;
        }

        keepIndex = i;

        // Compare this box with all subsequent boxes.
        for (int j = i + 1; j < boundingBoxSet.Count; j++)
        {
            if (suppressed[j])
            {
                continue;
            }

            float iou = CalculateIoU(boundingBoxSet[i], boundingBoxSet[j]);

            if (iou > nmsThreshold)
            {
                suppressed[j] = true;
            }
        }
    }
    bestDetections[n] = boundingBoxSet[keepIndex];
});

bestDetections

NMS works by comparing all the boxes and removing any that overlap too much with a higher-confidence box. The result? Each face is marked just once, making our detections much cleaner and more reliable.


Visualization and Cropping

Now for the fun part—seeing the results! We draw bounding boxes on the detected faces and save these images. We also crop out the face region, resize it, and save it for the next stage of the pipeline.

Parallel.For(0, N, n => {
    var bestDetection = bestDetections[n];
    var image = images[n];

    var width = bestDetection.X2 - bestDetection.X1;
    var height = bestDetection.Y2 - bestDetection.Y1;

    var cropDimension = Math.Min( Math.Max(W, H) - 12, Math.Max(width, height) + 12);

    var wDiff = cropDimension - width;
    var hDiff = cropDimension - height;

    var boundingBoxRectangle = new Rectangle(
        Math.Max(0, bestDetection.X1 - wDiff/2),
        Math.Max(0, bestDetection.Y1 - hDiff/2),
        cropDimension,
        cropDimension
    );

    var mutableImage = image.Clone();
    // draw the bouding box on detected images
    mutableImage.Mutate(ctx => {
        ctx.Draw(Color.Red, 3, boundingBoxRectangle);
    });
    mutableImage.SaveAsJpeg($"{detectedImages}/{n}.jpg");

    mutableImage.Dispose();

    // crop the images, enlarge them and save copies
    image.Mutate(ctx => {
        ctx.Crop(boundingBoxRectangle);
        ctx.Resize(256, 256);
    });

    image.SaveAsJpeg($"{croppedImages}/{n}.jpg");
});

Resizing the cropped face image isn’t just for looks—it ensures that every face is the same size, which is super important for the next step: landmark detection. Consistent input sizes make downstream models more accurate and easier to work with.


Cleanup

Finally, we dispose of the inference session and all loaded images to free up resources.

// Clean Up
session.Dispose();
foreach(var image in images) { image.Dispose(); }

Conclusion

This workflow demonstrates how to use ML.NET and ONNX Runtime to perform face detection with the YuNet model, including all steps from data preparation to visualization. By following this approach, you can efficiently detect and crop faces in images, paving the way for further tasks such as eye state analysis or emotion recognition. The modular structure allows for easy adaptation to other models or image processing tasks.


References


Next in the Series - Eye State Detection in ML.NET - Part 2/2


This blog was written with AI assistance.
You can read this post in Medium.

onnx opencv yunet mediapipe object detection tflite machine learning computer vision ML.NET python c# dotnet dotnet interactive