The gap between what a camera can theoretically measure and what it reliably measures in practice has always been defined by signal processing. For years, that processing was handcrafted: color channel decomposition, blind source separation, bandpass filtering. Those methods work, but they hit a ceiling when the conditions get messy. Bad lighting, head movement, diverse skin tones, cheap webcams. The real world, basically.
Deep learning changed the trajectory. Not overnight, and not without its own problems, but the shift from engineered pipelines to learned representations has pushed rPPG accuracy into territory that would have seemed unrealistic five years ago. The question now is less about whether neural networks can extract vital signs from video and more about which architectures generalize beyond the lab.
"The accuracy rate of deep learning rPPG for discriminating atrial fibrillation from normal sinus rhythm reached 90.0% in 30-second recordings and 97.1% in 10-minute recordings." — Yan et al., cross-dataset rPPG generalization study (2023)
The architecture timeline
rPPG deep learning did not start with transformers. It started with convolutional networks that could learn spatial features from face crops, and it evolved through a series of design choices that each solved a specific limitation.
DeepPhys (Chen and McDuff, 2018) at Microsoft Research was among the first end-to-end approaches. It uses a two-branch 2D CNN: one branch processes the current frame's appearance, the other processes the normalized difference between consecutive frames (the motion representation). An attention mechanism weights facial regions by signal quality. DeepPhys showed that a learned model could outperform traditional methods on controlled datasets, though its 2D architecture processes frames independently and misses longer temporal dependencies.
PhysNet (Yu et al., 2019) at the University of Oulu introduced 3D convolutions to the problem. By processing short video clips as spatiotemporal volumes, PhysNet captures temporal correlations across frames directly in its feature extraction. The 3D approach proved more effective than frame-by-frame processing for heart rate estimation, and PhysNet became a standard benchmark model. On the UBFC-rPPG dataset, it consistently ranks among the top supervised methods.
TS-CAN (Liu et al., 2020) from the University of Washington took a different approach to the temporal problem. Instead of computationally expensive 3D convolutions, TS-CAN uses temporal shift modules that shift feature channels along the time axis within standard 2D convolutions. This captures temporal information at a fraction of the computational cost. On UBFC-rPPG, TS-CAN achieves a mean absolute error of 1.29 BPM and a mean absolute percentage error of 1.50, according to benchmark results compiled by Saikevičius et al. (2025).
EfficientPhys (Liu and McDuff, 2023) pushed further toward deployability. Building on the temporal shift concept, it uses depthwise separable convolutions and neural architecture search to find compact models that run on mobile devices. The model is small enough for real-time inference on a smartphone GPU while maintaining competitive accuracy on standard benchmarks.
PhysFormer (Yu et al., CVPR 2022) brought transformer architecture to rPPG. Developed by Zitong Yu at the University of Oulu and collaborators at Oxford and the University of Hong Kong, PhysFormer uses temporal difference transformers with global spatio-temporal attention. The self-attention mechanism lets the model weigh contributions from distant frames and spatial regions simultaneously, something convolutional architectures struggle with. PhysFormer outperformed CNN-based methods on several benchmarks, particularly in scenarios with head movement.
Architecture comparison for rPPG vital sign extraction
| Architecture | Type | Temporal modeling | Computational cost | UBFC-rPPG MAE (BPM) | Key strength |
|---|---|---|---|---|---|
| DeepPhys | 2D CNN + attention | Frame differencing | Low | ~2.5 | Attention-based ROI weighting |
| PhysNet | 3D CNN | 3D convolutions | High | ~1.5 | Direct spatiotemporal features |
| TS-CAN | 2D CNN + temporal shift | Channel shifting | Low | ~1.29 | Efficient temporal capture |
| EfficientPhys | Lightweight CNN + NAS | Temporal shift | Very low | ~1.8 | Mobile deployment |
| PhysFormer | Transformer | Self-attention | Moderate-high | ~1.1 | Global context modeling |
| Spiking-PhysFormer | Hybrid SNN + Transformer | Spike-driven attention | Low | Under evaluation | Energy efficiency |
| PhySU-Net | Transformer + self-supervised | Long temporal context | Moderate | Under evaluation | Self-supervised pre-training |
Sources: rPPG-Toolbox benchmarks (Liu et al., NeurIPS 2023), Saikevičius et al. (Electronics, 2025), Yu et al. (CVPR 2022).
Note: MAE values are approximate and depend on training protocol, data splits, and preprocessing. Direct comparison across papers requires caution.
The generalization problem
Benchmark numbers on UBFC-rPPG or PURE look good. Sometimes suspiciously good. The real test is what happens when you train on one dataset and test on another — different cameras, different lighting, different demographics. This is the cross-dataset generalization problem, and it remains the hardest unsolved challenge in deep learning rPPG.
Vance and Flynn (2023) measured domain shifts in deep learning rPPG and found that performance drops substantially when models encounter conditions outside their training distribution. A model trained on PURE (recorded with a specific camera setup in controlled lighting at TU Eindhoven) may fail when tested on UBFC-rPPG (recorded with different equipment at the University of Burgundy), even though both datasets contain seated subjects at rest.
The causes of domain shift in rPPG are well-characterized:
- Camera sensor characteristics (color response, noise profile, frame rate) differ across devices and directly affect the raw signal
- Lighting spectrum and intensity change the relationship between skin color variations and blood volume changes
- Subject demographics (skin tone, age, facial hair) affect how much rPPG signal reaches the camera
- Compression artifacts in stored video destroy subtle color information that the models rely on
Domain adaptation approaches
Several research groups are tackling generalization head-on. Zhang et al. (2025) proposed integrating explicit and implicit prior knowledge into the learning pipeline, using physics-based constraints (like the known spectral properties of hemoglobin absorption) alongside data-driven feature learning. The idea is that models anchored to physical priors should generalize better than purely data-driven approaches.
PhysTTT (2025) introduced test-time training for rPPG, where the model adapts its parameters to each new test video using self-supervised objectives. This avoids the need for labeled data from the target domain entirely — the model adjusts itself on the fly using signal quality metrics and physiological constraints.
UDA-rPPG (2025) applies unsupervised domain adaptation with a geometric-physiological approach, learning to align feature distributions across domains while preserving the physiological signal structure. Their highest-peak priority learning strategy anchors the adaptation process to the dominant cardiac frequency, preventing the model from drifting toward noise during adaptation.
PhysLLM (2025) takes a more unconventional path, using large language models for cross-modal physiological signal reasoning. By encoding rPPG signal characteristics as text descriptions, the approach leverages the broad knowledge of LLMs to interpret and validate physiological measurements — essentially using language understanding to improve signal understanding.
What these models actually learn
One of the more interesting questions in rPPG deep learning is what the networks are actually attending to. Attention maps from PhysFormer and similar models show that trained networks learn to focus on skin regions with strong pulsatile signals — the forehead, cheeks, and nose — while ignoring eyes, hair, and background. This is roughly what signal processing engineers would choose manually, but the models arrive at it from data alone.
Hu et al. (2024) introduced Residual and Coordinate Attention (RCA) modules that make this region selection explicit and adaptive. During movement or partial occlusion, some facial regions lose their signal. The RCA modules dynamically reweight contributions so the model prioritizes whatever region currently has the cleanest signal. This is especially relevant for real-world deployment where subjects are not sitting perfectly still.
Cheng et al. (2024) at Hong Kong University of Science and Technology systematically studied how different facial regions contribute to SpO2 estimation accuracy. Their finding that multi-region fusion outperforms any single region confirms what the attention-based models learn implicitly: redundancy across regions improves robustness.
Multi-task learning and beyond heart rate
Heart rate was the entry point, but deep learning rPPG has expanded to multiple vital signs extracted simultaneously from the same video. A Nature Scientific Reports study (2025) demonstrated multi-task learning for simultaneous rPPG and respiratory rate estimation using complex-valued neural networks. The complex-domain approach captures phase information that real-valued networks discard, improving robustness across diverse skin tones and lighting conditions.
The multi-task framing is practical for two reasons. First, extracting multiple signals from the same face video amortizes the computational cost of facial detection, tracking, and preprocessing. Second, the vital signs are physiologically correlated — heart rate and respiratory rate covary during stress, for example — and joint training lets the model exploit those correlations.
Current multi-task architectures can estimate heart rate, respiratory rate, SpO2, and HRV metrics from a single video stream. The accuracy varies by signal type (heart rate is easiest, SpO2 remains hardest), but the trajectory points toward comprehensive contactless vital sign panels from a single camera.
Deployment realities
The gap between benchmark performance and production deployment is wide. Bouraffa et al. (WACV 2025) evaluated eleven rPPG models for automotive applications — measuring driver vital signs from an in-cabin camera — and found that performance degrades substantially compared to lab benchmarks. The combination of infrared illumination, variable ambient light, vibration, and natural head movement during driving creates conditions that no standard benchmark captures.
This mirrors the broader pattern. Lab benchmarks use controlled lighting, high-quality cameras, cooperative subjects, and constrained movement. Production environments have none of these luxuries. The rPPG-Toolbox (Liu et al., NeurIPS 2023) from the University of Washington helped standardize evaluation by providing consistent training and testing protocols across models and datasets, but standardized benchmarks can only go so far when the real problem is distributional mismatch between training and deployment.
Lightweight architectures like EfficientPhys and the recently proposed LightweightPhys (2025) address the computational side of deployment, using depthwise separable convolutions and attention-based noise suppression to run in real time on edge devices. PHASE-Net (2025) introduced zero-FLOPs modules that mix spatial features without additional computation, pushing the efficiency boundary further.
Circadify is developing deep learning-powered rPPG technology designed for deployment conditions rather than benchmark conditions. The focus is on robustness to the variables that make real-world measurement hard — diverse demographics, variable lighting, consumer cameras, and natural movement.
What comes next
Three directions seem likely to define the next phase of deep learning rPPG.
Self-supervised and unsupervised pre-training will reduce dependence on labeled data. Collecting ground-truth vital signs synchronized with video is expensive and logistically difficult. Models that can learn useful representations from unlabeled video — then fine-tune on small labeled sets — will scale better.
Physics-informed architectures that encode hemoglobin absorption spectra, blood volume pulse morphology, and cardiac cycle constraints into the network structure should generalize better than purely data-driven models. The physical relationships between light, skin, and blood do not change across datasets; networks that know about them should be more robust.
Standardized cross-domain evaluation protocols will force the field to confront generalization honestly. The rPPG-Toolbox was a step, but the community needs benchmarks that explicitly test on out-of-distribution data — different cameras, different demographics, different environments — rather than random splits of the same dataset.
The deep learning rPPG field has produced impressive results in a short time. The architectures have gone from simple frame-differencing CNNs to transformer-based models that attend globally across space and time. The remaining work is less about squeezing another fraction of a BPM out of UBFC-rPPG and more about making these models work reliably when someone points a phone camera at a face in a living room, a clinic, or a car.
Frequently asked questions
What deep learning models are used for rPPG vital sign extraction?
Leading architectures include PhysNet (3D CNN), DeepPhys (2D CNN with attention), TS-CAN (temporal shift convolutional attention network), EfficientPhys (lightweight CNN), and PhysFormer (transformer-based). Each handles the tradeoff between spatial feature extraction and temporal modeling differently.
How accurate is deep learning rPPG compared to contact-based heart rate monitors?
On benchmark datasets like UBFC-rPPG, supervised models like TS-CAN achieve mean absolute errors around 1.29 BPM under controlled conditions. Performance degrades in cross-dataset testing due to domain shift from differences in lighting, cameras, and subject demographics.
What is the biggest challenge for deep learning rPPG models?
Generalization across domains remains the central problem. A model trained on one dataset with specific cameras, lighting, and demographics often performs poorly on data collected under different conditions. Domain adaptation and self-supervised pre-training are active research areas addressing this.
Can deep learning rPPG extract vital signs beyond heart rate?
Yes. Multi-task learning approaches can simultaneously estimate heart rate, respiratory rate, SpO2, and heart rate variability from facial video. Recent work by researchers at institutions including Tsinghua University and the University of Oulu has demonstrated multi-vital-sign extraction from single video streams.
Related Articles
- What is rPPG Technology — The foundational overview of remote photoplethysmography, covering how cameras extract physiological signals from skin color changes.
- rPPG Signal Processing: Raw Video to Vital Signs — A detailed look at the signal processing pipeline that deep learning models are progressively replacing.
- rPPG Accuracy and Clinical Validation Methods — How rPPG accuracy is measured and validated, including the benchmark datasets discussed in this post.