ASMNet: a Lightweight Deep Neural Network for Face Alignment and Pose Estimation

Published in

TDS Archive

11 min readJul 7, 2021

This article explains the ASMNet, a lightweight Convolutional Neural Network (CNN) for facial landmark points detection (a.k.a Face Alignment) and face pose estimation in the wild.

The code and the pre-trained models are available on Github here. You can also read the original paper here.

Introduction

Facial Landmark Points Detection is an essential task in many facial image analyses and applications. It is crucial for facial image alignment, face recognition, pose estimation, and facial expression recognition. Several methods have been proposed for facial landmark points detection such as Constrained Local ModelBased Methods [1, 2], AAM [3, 4], part models [5], and Deep Learning (DL) based methods [6, 7]. Although DL-based methods are considered state-of-the-art methods, facial landmark points detection is still challenging for faces with large pose variations. Accordingly, the price to pay to achieve a high accuracy is the rise in computational complexity and the fall in the efficiency.

Furthermore, the information contained in the features is distributed throughout deep neural networks hierarchically. More specifically, while lower layers contain information about edges and corners and hence are more appropriate for localization tasks such as facial landmark points detection and pose estimation, deeper layers contain more abstract information which is more suitable for classification tasks. Inspired by the idea of multi-task learning, we design our CNN model and hence the associated loss function to learn multiple correlated tasks simultaneously.

Recent methods have focused on improving accuracy and this is normally achieved by introducing new layers and increasing the number of parameters and longer inference time. These methods are accurate and successful in desktop and server applications, but with the growth of IoT, mobile devices, and robotics, there is a growing need for more accurate and efficient algorithms.

We propose a new network structure that is inspired by MobileNetV2 and is specifically designed for facial landmark points detection with the focus on making the network shallow and small without losing much accuracy. To achieve this goal we propose a new loss function that employs ASM as an assistant loss and uses multi-task learning to improve the accuracy. Fig. 1 depicts a general framework of our proposed idea. We tested our proposed method with the challenging 300W [8] dataset and the Wider Facial Landmarks in the Wild (WFLW) [9] dataset. Our experimental results show that the accuracy of facial landmark points detection and pose estimation is comparable with the state-of-the-art methods while the size of the network is 2 times smaller than MobileNetV2.

ASM Network

We have designed a network that is about two times smaller than MobileNetV2 [33], both in terms of the number of parameters and FLOPs. In designing ASMNet, we only use the first 15 blocks of MobileNetV2 [33] while the main architecture has 16 blocks. Nevertheless, creating a shallow network would eventually lower the final accuracy of the system. To avoid this problem we purposefully add a few new layers. Fig. 1 shows the architecture of ASMNet.

Besides that, in a CNN, lower layers have features such as edges, and corners which are more suitable for tasks like landmark localization and pose estimation, and deeper layers contain more abstract features that are more suitable for tasks like image classification and image detection. Accordingly, training a network for correlated tasks simultaneously builds a synergy that can improve the performance of each task.

Accordingly, we designed a multitask CNN to detect facial landmarks as well as estimating the pose of the faces (pitch, roll, and yaw) simultaneously. In order to use features from different layers, we have created shortcuts from block-1-batch-normalization, block-3-batch-normalization, block-6-batch-normalization, block-10-batch-normalization, and finally block-13-batch-normalization. We connect each of these shortcuts to the output of block 15 of MobileNetV2, block-15-add, using a global average pooling layer. Finally, we concatenate all the global average pooling layers. Such architecture enables us to use features that are available in different layers of the network while keeping the number of the FLOPs small. In other words, since the original MobileNetV2 is designed for image classification task — where the more abstract features are required — it might not be suitable for face alignment task — which needs both abstract features that are available in the deeper layers as well as features that are available in the lower layers such as edges and corners.

Furthermore, we add another correlated task to the network. As Fig. 1 shows, the proposed network predicts 2 different outputs: the facial landmark points (the main output of the network), as well as the face pose. While the correlation and the synergy between these two tasks can result in more accurate results, we also wanted our lightweight ASMNet to be able to predict face pose as well so that it might be used in more applications.

ASM Assisted Loss Function

We first review the Active Shape Model (ASM) algorithm and then we explain our customized loss function based on ASM that improves the accuracy of the network.

Active Shape Model Review

Active Shape Model is a statistical model of shape objects. Each shape is represented as n points as well as S set is defined in Eq. 1 in the following:

To simplify the problem and learn shape components, Principal Component Analysis (PCA) is applied to the covariance matrix calculated from a set of K training shape samples. Once the model is built, an approximation of any training sample (S) is calculated using Eq. 2:

Consequently, a set of parameters of a deformable model is defined by vector b, so that by varying the elements of the vector, the shape of the model is changed. Consider that the statistical variance (i.e., eigenvalue) of the ith parameter of b is λi . To make sure the generated image after applying ASM is relatively similar to the ground truth, the parameter bi of vector b is usually limited to ±3√λi [7]. This constraint ensures that the generated shape is similar to those in the original training set. Hence, we create a new shape
SN ew after applying this constraint, according to Eq. 3:

where b̃ is the constrained b. We also define ASM operator according to Eq. 4:

ASM transforms each input point (Px i , Py i ) to a new point (Aix , Aiy ) using Eqs. 1, 2, and 3.

ASM Assisted Loss

We describe the loss functions for two different tasks. These tasks are responsible for facial landmark points detection and pose estimation.

Facial landmark points detection task: The common loss function for facial landmark points detection is Mean Square Error (MSE). We propose a new loss function that including MSE, as the main loss as well as the assistant loss which utilizes ASM to improve the accuracy of the network called ASM-LOSS.

The proposed ASM-LOSS guides the network to first learn the smoothed distribution of the facial landmark points. In other words, during the training process, the loss function compares the predicted facial landmark points with their corresponding ground truth as well as the smoothed version of the ground truth which is generated using ASM. Given this, in the early stage of training, we set a bigger weight to the ASM-LOSS in comparison to the main loss – which is MSE –, since the variation of the smoothed facial landmark points are much lower than the original landmark points, and as a rule of thumb, easier to be learned by a CNN. Then, by gradually decrease the weight of the ASM-LOSS, we lead the network to focus more on the original landmark points. In practice, we figured out that this method, which is also can be taken to account as transfer learning, works out well and results in more accurate models.

We also discover that although face pose estimation has a heavy reliance on face alignment, it can achieve good accuracy with the assistance of smoothed facial landmark points as well. In other words, if the performance of the facial landmark point detection task is acceptable, which means the network can predict facial landmarks such that the whole shape of the face is correct, the pose estimation can achieve good accuracy. Accordingly, using smoothed landmark points and training network using ASM-LOSS will results in more accuracy in pose estimation task.

Consider that for each image in the training set, there exist n landmark points in a set called G such that (Gxi , Gyi) is the coordinates for the ith landmark point. Similarly, the predicted set P contains n points such that (Px i , Py i ) is the predicted coordinates for the ith landmark point.

We apply PCA on the training set and calculate eigenvectors and eigenvalues. Then, we calculate set A, which contains n points and each point is the transformation of the corresponding point in G, by applying the ASM operator
according to Eq. 4:

We define the main facial landmark point loss, Eq. 7, as
the Mean Square Error between the ground truth (G) and
the predicted landmark points (P)

where N is the total number of images in the training set and Gij = (Gix , Giy ) shows the ith landmark of the j th sample in the training set. We calculate ASM-LOSS as the error between ASM points (Aset ), and predicted landmark points (Pset ) using Eq. 8:

Finally, we calculate the total loss for the facial landmark task with according to Eq. 9:

The accuracy of PCA have a heavy reliance on the ASM points (Aset ), which means that the more accurate the PCA, the less the discrepancy between the ground truth (G) and the ASM points (Aset ). To be more detailed, by reducing the accuracy of PCA, the generated ASM points (Aset ), will be more similar to the average point set, which is the average of all the ground truth face objects in the training sets. Consequently, predicting points in Aset is easier than the points in the Gset since the variation of latter is lower than the variation of the former. We use this feature to design our loss function such that we first guide the network towards learning the distribution of the smoothed landmark points– which is easier to be learned — and gradually harden the problem by decreasing the weight of ASM-LOSS.
We define α as ASM-LOSS weight using Eq. 10:

where i is the epoch number and l is the total number of training epochs. As shown in Eqs. 9, at the beginning of the training, the value of α is higher, which means we put more emphasis on ASM-LOSS. Hence, the network focuses more on predicting a simpler task and converges faster. Then after one-third of the total epochs, we reduce α to 1 and put equal emphasis on the main MSE loss ASM-LOSS. Finally, after two-third of total epochs, by reducing
α to 0.5, we direct the network toward predicting the main ground truths, while considering the smoothed points generated using ASM as an assistant.

Pose estimation task: We use mean square error to calculate the loss for the head pose estimation task. Eq. 11 defines the loss function Lpose, where yaw(yp), pitch(pp) and roll(rp) are the predicted poses and yt, pt , and rt are the corresponding ground truths.

Implementation Details

The code is available on Github here. All the documentation about the code is also available.

Installing the requirements

In order to run the code, you need to install python >= 3.5. The requirements and the libraries needed to run the code can be installed using the following command:

pip install -r requirements.txt

Using the pre-trained models

You can test and use the pre-trained models using the following codes which are available in the following file: https://github.com/aliprf/ASMNet/blob/master/main.py

tester = Test()
  tester.test_model(ds_name=DatasetName.w300,
                     pretrained_model_path='./pre_trained_models/ASMNet/ASM_loss/ASMNet_300W_ASMLoss.h5')

Training Network from scratch

Preparing Data

Data needs to be normalized and saved in npy format.

PCA creation

You can use the pca_utility.py class to create the eigenvalues, eigenvectors, and the meanvector:

pca_calc = PCAUtility()
    pca_calc.create_pca_from_npy(dataset_name=DatasetName.w300,
                                 labels_npy_path='./data/w300/normalized_labels/',
                                 pca_percentages=90)

Training

The training implementation is located in train.py class. You can use the following code to start the training:

trainer = Train(arch=ModelArch.ASMNet,
                    dataset_name=DatasetName.w300,
                    save_path='./',
                    asm_accuracy=90)

Results

ASMNet can perform face alignment and pose estimation with a very high visual accuracy. Please visit my GitHub or the original paper for the sample images.

Conclusion

We proposed ASMNet, a lightweight CNN architecture with multi-task learning for facial landmark points detection and pose estimation. We proposed a loss function that is assisted using ASM that increases the network accuracy. We built our network (called ASMNet) using a small portion of MobileNetV2. The proposed ASMNet architecture is about 2 times smaller than MobileNetV2, while the accuracy remains at the same rat

Please cite this work as:

@inproceedings{fard2021asmnet,
        title={ASMNet: A Lightweight Deep Neural Network for Face Alignment and Pose Estimation},
        author={Fard, Ali Pourramezan and Abdollahi, Hojjat and Mahoor, Mohammad},
        booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
        pages={1521--1530},
        year={2021}
  }

References

[1] A. Asthana, S. Zafeiriou, S. Cheng, and M. Pantic. Robust discriminative response map fitting with constrained local models. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3444–3451, 2013.

[2] D. Cristinacce and T. F. Cootes. Feature detection and tracking with constrained local models. In Bmvc, volume 1, page 3. Citeseer, 2006.

[3] T. F. Cootes, G. J. Edwards, and C. J. Taylor. Active appearance models. In European conference on computer vision, pages 484–498. Springer, 1998.

[4] P. Martins, R. Caseiro, and J. Batista. Generative face alignment through 2.5 d active appearance models. Computer Vision and Image Understanding, 117(3):250–268, 2013.

[5] X. Zhu and D. Ramanan. Face detection, pose estimation,
and landmark localization in the wild. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 2879–2886, June 2012.

[6] J. Zhang, S. Shan, M. Kan, and X. Chen. Coarse-to-fine autoencoder networks (cfan) for real-time face alignment. In European conference on computer vision, pages 1–16. Springer, 2014.

[7] Z. Zhang, P. Luo, C. C. Loy, and X. Tang. Facial landmark detection by deep multi-task learning. In European conference on computer vision, pages 94–108. Springer, 2014.

[8] C. Sagonas, G. Tzimiropoulos, S. Zafeiriou, and M. Pantic. 300 faces in-the-wild challenge: The first facial landmark localization challenge. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pages 397–403, 2013.

[9] W. Wu, C. Qian, S. Yang, Q. Wang, Y. Cai, and Q. Zhou. Look at boundary: A boundary-aware face alignment algorithm. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2129–2138, 2018.

TDS Archive

ASMNet: a Lightweight Deep Neural Network for Face Alignment and Pose Estimation

Introduction

ASM Network

ASM Assisted Loss Function

Active Shape Model Review

ASM Assisted Loss

Implementation Details

Training Network from scratch

Results

Conclusion

Please cite this work as:

References

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Published in TDS Archive

Written by Ali Pourramezan Fard

No responses yet