Detecting DeepFakes and other image forgery methods

6 min readJun 10, 2021

Advancement in image forgery methods such as deepfakes warrant advancements in forgery detection methods as well. This article summarizes contributions of a paper on detecting image forgeries.

Problem

The rapid progress in synthetic image generation and manipulation has now come to a point where it raises significant concerns for the implications towards society. At best, this leads to a loss of trust in digital content, but could potentially cause further harm by spreading false information or fake news.

Solutions and contributions

The paper examined the realism of state-of-the-art image manipulations, and how difficult it is to detect them, either automatically or by humans. In doing so, the following contributions were made:

FaceForensics++ : A novel large-scale database of over 1.8 million images for forgery detection was generated.

2. Evaluation of state of the art forgery detection methods (both human and automatic) tailored to facial manipulations was performed.

FaceForensics++

Classical forensics datasets have been created with significant manual effort under very controlled conditions. While several datasets were proposed that include image manipulations, only a few of them also address the important case of video footage. The most extensive dataset for generic image manipulation consists of 50, 000 forged images (both local and global manipulations) and around 500 forged videos.

In contrast, FaceForensics++, which is an extension of the FaceForensics dataset, contains 1.8 million images from 4000 fake videos. To imitate realistic scenarios, videos were collected from the wild, specifically from YouTube. With some manual screening, 1,000 video sequences containing 509, 914 images which we use as our pristine data.

To generate forged data, two computer graphics-based approaches (Face2Face and FaceSwap) and two learning based approaches (DeepFakes and NeuralTextures) were chosen. All four methods require source and target actor video pairs as input. The final output of each method is a video composed of generated images.

FaceSwap

FaceSwap is a graphics-based approach to transfer the face region from a source image to a target image. FaceSwap takes source and target (face) image sas input and projects the source face to the target face by leveraging landmarks on each face. To generate forged videos, the 1000 pristine videos are divided into source and target pairs, after which FaceSwap is performed on the video pairs frame-wise.

Face2Face

Face2Face is a facial reenactment system that transfers the expressions of a source video to a target video while maintaining the identity of the target person. The original implementation is based on two video input streams, with manual keyframe selection. To generate forged videos, the first frames of the prisitine videos are used to construct a temporary face identity (i.e., a 3D model), and track the expressions over the remaining frames. From the source videos, “expression parameters” that describe expressions of each frames are extracted and transferred onto target pristine videos to obtain forged reenactments.

DeepFakes

DeepFakes are deep learning based methods of face replacements. There are various public implementations of DeepFakes available, most notably FakeApp and the faceswap github. With these methods, a face in a target video can be replaced by a face that has been observed in a source video. To construct forged videos, source-target pairs of pristine videos are fed as input to faceswap github.

NeuralTextures

NeuralTextures uses the original video data to learn a neural texture of the target person, including a rendering network. To generate forged videos, patch-based GAN-loss is used, as used in Pix2Pix. This approach relies on tracked geometry that is used during train and test times. The tracking module of Face2Face is used to generate this information. Only the facial expressions corresponding to the mouth region are modified, i.e., the eye region stays unchanged.

Human forgery detection

To evaluate the performance of humans in the task of forgery detection, a user study with 204 participants consisting mostly of computer science university students was conducted. After a short introduction to the binary task, users are instructed to classify randomly selected images from our test set, consisting of forged and pristine images. Users are shown images for a few seconds, after which they are asked if the image was forged or real. With 60 images per attendee, 12240 human decisions were collected.

The user study contained fake images of all four manipulation methods and pristine images. In this setting, it is noted that Face2Face and NeuralTextures were particularly difficult to detect by human observers, as they do not introduce a strong semantic change, introducing only subtle visual artifacts in contrast to the face replacement methods. NeuralTextures texture seems particularly difficult to detect as human detection accuracy is below random chance and only increases in the challenging low quality task.

Automatic forgery detection

Since the goal was to detect forgeries of facial imagery, additional domain-specific information that can be extracted from input is utilized. To this end, a state-of-the-art face tracking method is used to track the face in the video and to extract the face region of the image. A conservative crop is used around the center of the tracked face, enclosing the reconstructed face.

The first method used for forgery detection employs handcrafted features. These features are then used to train a linear Support Vector Machine (SVM) classifier. This technique was the winning approach in the first IEEE Image Forensic Challenge . A 128 × 128 central crop-out of the face is provided as input to the method. While the hand-crafted method outperforms human accuracy on raw images by a large margin, it struggles to cope with compression, which leads to an accuracy below human performance for low quality videos

For detection from learned features, five neural network architectures known from the literature are evaluated to solve the classification task. These methods are based on convolutional neural networks ( Cozzolino, Bayar and Stamm, Rahmouni, MesoInception-4, XceptionNet).

Binary detection accuracy of all evaluated architectures on the different forgery methods using face tracking when trained on our different manipulation methods separately

The above figure shows the results of binary forgery detection task using all network architectures evaluated separately on all four forgery methods and at different video quality levels. All approaches achieve very high performance on raw input data. Performance drops for compressed videos. The neural networks are better at handling these situations, with XceptionNet able to achieve compelling results on weak compression while still maintaining reasonable performance on low quality images.

Binary precision values of baselines when trained on all four manipulation methods simulatenously.

To compare the results of the user study to the performance of the automatic detectors, the detection variants were also tested on a dataset containing images from all manipulation methods. We note that the automated detectors outperform human performance by a large margin. To evaluate the benefit of the initially performed face-tracking and conservative cropping, an XceptionNet was also trained on the full image. We note that this results in much lower accuracies

Conclusion

While current state-of-the-art facial image manipulation methods exhibit visually stunning results, we demonstrate that they can be detected by trained forgery detectors. It is particularly encouraging that also the challenging case of low-quality video can be tackled by learning-based approaches, where humans and hand-crafted features exhibit difficulties.