
Hand posture recognition technology makes humancomputer interaction more natural and efficient. Existing hand posture recognition algorithms are mainly based on RGB images or depth data, each of which has its limitations: the former is susceptible to the interference of lighting and background color, while the latter is difficult to capture details and affects accuracy. To overcome these problems, fusion of RGB images and depth data has become a research trend. However, traditional static fusion methods use fixed modal weights, which are difficult to adapt to the complex relationships between modalities and lead to performance degradation. To cope with this problem, this paper proposes a Fusion module, including Multi-Scale Gated Extraction modules (MSGE) for multi-scale feature extraction and gating mechanism, Context Sensitive Dynamic Filtering modules (CSDF) for dynamically adjusting the weights according to the modal importance, and Importance Weighted Fusion modules (IWF) for adaptive weighting. Based on this, this paper proposes a network that fuses RGB information and depth data, named Dynamic Importance-Weighted Fusion Network (DIWFNet).
