Article in HTML

Author(s): Mayank Lovanshi, Vivek Tiwari


Address: International Institute of Information Technology (IIIT), Naya Raipur, Chhattisgarh, India.

*Corresponding Author:

Published In:   Volume - 36,      Issue - 1,     Year - 2023

Cite this article:
Lovanshi and Tiwari (2023). Human Part Semantic Segmentation Using CDGNET Architecture for Human Activity Recognition. Journal of Ravishankar University (Part-B: Science), 36(1), pp. 18-25.

Human Part Semantic Segmentation Using CDGNET Architecture for Human Activity Recognition

Mayank Lovanshi1*, Vivek Tiwari1

1International Institute of Information Technology (IIIT), Naya Raipur, 493661, Chhattisgarh, India.

*Corresponding Author:

Abstract: The segmentation of human body parts is a task that entails assigning labels to pixels in an image to identify the corresponding body part classes. To enhance accuracy, a technique known as sample class distribution was developed, considering the hierarchical structure of the human body and the unique positioning of each part. This technique involves gathering and applying primary human parsing labels in both vertical and horizontal dimensions to exploit the distribution of classes. By combining these guided features, a spatial guidance map is generated and incorporated into the backbone network. These semantic-guided features contribute to the effective recognition of human activity through semantic segmentation-enabled human pose. To assess the effectiveness of this approach, extensive experiments were performed on a large dataset called CIHP, using metrics such as mean IOU, pixel accuracy, and mean accuracy.

Keywords: Human Parsing, Human Semantic Part Segmentation, CIHP, Resnet 101

1. Introduction

Semantic segmentation is a technique used to divide an image into visually meaningful segments, allowing for further analysis and comprehension of the image. This approach finds applications in various fields, such as video understanding, medical analysis, human-robot interaction, and satellite-based semantic segmentation [4, 17, 10].

Human Part Semantic Segmentation, also known as Human Parsing, involves assigning semantic labels to each pixel of an image depicting a human body, such as arms, legs, dresses, and skirts [1]. Accurate identification of a human's semantic components is crucial for applications like human action analysis [15], augmented reality, human-computer interaction, and virtual reality. By separating the human body from the background, semantic segmentation [11,2] provides more detailed information about the person's movements and behaviors. It enables distinguishing between different bodily areas like the arms, legs, or torso, which in turn allows for identifying specific actions such as running, walking, or jumping [23,14].

Recent advancements in fully convolutional neural networks have contributed to the development of effective techniques for human parsing. These techniques typically employ an encoder-decoder architecture to extract features from the input image and multiple convolutional layers to generate pixel-wise predictions of the semantic categories. Researchers have also explored attention mechanisms, multi-scale features, and adversarial training to enhance the accuracy of human parsing. It is expected that ongoing research and development will lead to the creation of even more precise and reliable methods for human parsing.

Human parsing is a process that involves identifying and segmenting different components of the human body, including body parts and clothing, from an image or video. This information plays a crucial role in improving the accuracy of human activity recognition systems. By analyzing the movements and interactions of these body parts, algorithms can recognize specific actions like walking, running, and jumping. This capability has wide-ranging applications, including video surveillance, sports analysis, and healthcare monitoring, where tracking and analyzing human behavior are essential.

Furthermore, human parsing can provide valuable insights into individual characteristics, such as gender, clothing style, and age. This information finds utility in various domains such as marketing, virtual try-on experiences, and personalized services. For example, it can help tailor advertisements or recommendations based on a person's clothing preferences or suggest personalized fashion choices. The ability to extract such information through human parsing contributes to a range of applications that leverage understanding human behavior and preferences [3, 19].

Figure 1: Sample images: Human part semantic segmentation

In this paper, Section II contains details explaining the methodology and experimental method including an explanation of the architecture and working of the proposed model. Furthermore, section III has an observation and experimental results followed by a discussion & future work in section IV.



In this section, a novel approach is presented to convert the detailed structural information of the human body from a two-dimensional representation into a one-dimensional format along both the horizontal and vertical axes. This approach takes inspiration from CDGNet and attention-based models such as SENet, CBAM, and HANet.

While SENet and CBAM primarily focus on capturing the overall context of an image, and HANet considers height-driven attention maps for urban scene images, the proposed method extends this concept to incorporate both class and directional positions. Instead of relying solely on attention mechanisms, the proposed method incorporates classes and directional positions to generate essential signals that aid the network in effectively detecting human body parts.

Given the hierarchical structure of human bodies and their distinct spatial distributions, individual body parts exhibit unique distributions along both the vertical and horizontal dimensions. Building on this inspiration, the proposed strategy introduces a custom class distribution-guided network that predicts the class distribution in both the vertical and horizontal dimensions, guided by a class distribution loss. This innovative approach shows promise in accurately detecting and localizing human body parts in images, offering significant implications for computer vision and image analysis.

The objective of custom CDGNet is to achieve accurate human parsing
results by training a deep learning model that generates high-quality parsing results that align with the class distribution of the training dataset. The custom CDGNet model includes a generator and a discriminator network, and it is trained using the CDG loss function.

In our proposed method for human parsing, we generate class distributions that indicate the location of each body part and utilize these distributions to guide the feature representation. Here's how our method is applied:

·       Input: We start with an input image frame and its associated feature tensor, which has dimensions W x H x C (width x height x channel size).

·       Spatial feature extraction: To extract information about spatial features, we perform separate operations on the input tensor in the vertical and horizontal directions. This process yields directional characteristics that can be used to generate labels.

·       Label generation: We generate labels by applying average pooling in orthogonal directions. For example, vertical average pooling generates horizontal features (Zh), and vice versa, vertical features (Zv).

·       Channel reduction: As the number of classes (body parts) is generally smaller than the channel size, we introduce a 1-dimensional convolution layer with a kernel size of (3 x 3) and a Batch Normalization layer after the feature extraction module. This reduces the number of channels in the feature tensor by half (to C/2), simplifying the representation and reducing computational complexity.

  Figure 2: overview architecture of the proposed network.

By following this approach, we obtain horizontal and vertical labels that capture the spatial distribution of body parts. Figure 2 illustrates the extracted horizontal and vertical labels using the proposed model.


3. Observation & Experimental Results

This article introduced a novel approach for human parsing, which is evaluated using metrics such as mean IoU, pixel Accuracy, mean accuracy, and classwise accuracy on the CIHP datasets [12]. The proposed method is combined with these metrics to achieve superior performance compared to existing methods. The evaluation consists of quantitative and qualitative experiments demonstrating the proposed approach performs better than state-of-the-art models.

3.1. Dataset Used

The CIHP (Clothing, In-shop Try-On, and Human Parsing) dataset [12] was created for various computer vision applications, including semantic segmentation, human parsing, and retrieving in-store clothing. The dataset includes more than 38,000 high-resolution photos of various body positions, garment types, and occlusions. A pixel-level human parsing mask comprising 20 categories, including hair, face, upper clothes, lower clothes, shoes, and background, is attached to each image in the collection. The collection also includes annotations for retrieving and trying on clothes in stores, which require locating the clothing items and predicting their characteristics

3.2 Evaluation Metrics

The proposed model evaluates results on some of the evaluation matrices, i.e. discussed below:

·       Mean IOU: The ground truth and predicted segmentation results are compared using the Intersection over Union (IoU) metric, which calculates the overlap by dividing the areas of their intersection and union. The segmentation of two classes, also known as binary classes and multi-classes, is frequently evaluated using the Mean IoU [16].

·        Pixel accuracy: Pixel Accuracy is a performance metric that shows how many of an image’s pixels were correctly identified. By dividing the total number of correctly classified pixels by the total number of pixels in the image’s frame, semantic segmentation calculates the percentage of correctly identified pixelsin an image. [16].

·       Mean Accuracy: The number of accurate predictions a model makes, divided by the total number of input samples, is referred to as” Mean Accuracy.” The average accuracy across various classes is determined using this metric. [16]

Table 1: Quantitative results of Custom CDGNet with other benchmark methods on the CIHP dataset



Pixel Acc

Mean Acc

Mean IoU

PGN [7]





Graphonomy [6]





M-CE2P [18]





CorrPM [22]





SNT [9]





PCNet [21]





CDGNet [13]





Custom CDGNet






3.3. Quantitative Analysis

To attain the highest performance in human parsing, we performed quantitative experiments on the CIHP dataset compared with well-known human parsing algorithms. The proposed model outperforms in many places regarding classwise mean IOU in the experiment. Figure 4 represents the classwise results of mean IoU on CIHP datasets. Furthermore, the proposed model outperforms in terms of pixel accuracy, mean accuracy and mean IoU compared to existing state-of-the-art frameworks. Table 1 represents pixel accuracy as 91.77%, mean accuracy as 75.22% and mean IOU as 65.69% compared to other state-of-the-art work.

 Figure 3: Quantitative result of the classwise mean IoU of custom CDGNet on CIHP dataset

4. Discussion

This paper introduces a novel method called Custom CDGNet for human part semantic segmentation, aiming to achieve accurate and effective segmentation of human body parts. The proposed technique utilizes pixel labeling to assign classes to each pixel, generating vertical and horizontal class distributions for all human components. This information greatly assists in labeling pixels accurately, even in scenarios involving multiple individuals in an image. Through comprehensive qualitative and quantitative analysis, Custom CDGNet demonstrates superior performance compared to existing methods for human part semantic segmentation. These findings provide valuable insights and inspire further research in the application of multi-modal human parsing and edge computing for various computer vision tasks. Moreover, the outcomes achieved by the Custom CDGNet model hold promise for advancing the recognition of human activity through semantic segmentation-enabled pose estimation. This suggests potential future directions where the proposed model's results can contribute to improving human activity recognition by leveraging semantic segmentation techniques.


I am grateful to Dr Vivek Tiwari, my supervisor, whose expertise and insights greatly contributed to the interpretations and conclusions presented in this paper. Additionally, I would like to express my appreciation to the Director, Dean, and HoD CSE of IIIT-NR for their unwavering support throughout this research endeavour. The AIDL Lab at the International Institute of Information Technology, NR was the site of the research.



[1] Badrinarayanan, V., Kendall, A., and Cipolla, R. (2017). Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE transactions on pattern analysis and machine intelligence, 39(12):2481– 2495.

[2] Bose, K., Shubham, K., Tiwari, V., and Patel, K. S. (2022). Insect image semantic segmentation and identification using unet and deeplab v3+. In ICT Infrastructure and Computing: Proceedings of ICT4SD 2022, pages 703–711. Springer.

[3] Chen, L.-C., Yang, Y., Wang, J., Xu, W., and Yuille, A. L. (2016). Attention to scale: Scale-aware semantic image segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3640–3649.

[4] Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., and Adam, H. (2018). Encoder decoder with atrous separable convolution for semantic image seg- mentation. In Proceedings of the European conference on computer vision (ECCV), pages 801–818.

[5] Choi, S., Kim, J. T., and Choo, J. (2020). Cars can’t fly up in the sky:
Improving urban-scene segmentation via height-driven attention networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9373–9383.

[6] Gong, K., Gao, Y., Liang, X., Shen, X., Wang, M., and Lin, L. (2019). Graphonomy: Universal human parsing via graph transfer learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7450–7459. 10

[7] Gong, K., Liang, X., Li, Y., Chen, Y., Yang, M., and Lin, L. (2018). Instance-level human parsing via part grouping network. In Proceedings of the European conference on computer vision (ECCV), pages 770–785.

[8] Hu, J., Shen, L., and Sun, G. (2018). Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132–7141.

[9] Ji, R., Du, D., Zhang, L., Wen, L., Wu, Y., Zhao, C., Huang, F., and Lyu, S. (2020). Learning semantic neural tree for human parsing. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIII 16, pages 205–221. Springer.

[10] Kashyap, R. and Tiwari, V. (2017). Energy-based active contour method for image segmentation. International Journal of Electronic Healthcare, 9(2-3):210–225.

[11] Kashyap, R. and Tiwari, V. (2018). Active contours using global models for medical image segmentation. International Journal of Computational Systems Engineering, 4(2-3):195–201.

[12] Li, P., Xu, Y., Wei, Y., and Yang, Y. (2020). Self-correction for human parsing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(6):3260–3271.

[13] Liu, K., Choi, O., Wang, J., and Hwang, W. (2022). Cdgnet: Class distribution guided network for human parsing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4473–4482.

[14] Lovanshi, M. and Tiwari, V. (2023). Human pose estimation: Bench- marking deep learning-based methods. In proceedings of the IEEE Conference on Interdisciplinary Approaches in Technology and Management for Social Innovation.

[15] Patel, A. S., Vyas, R., Vyas, O., Ojha, M., and Tiwari, V. (2022). Motion-compensated online object tracking for activity detection and crowd behavior analysis. The Visual Computer, pages 1–21. [16] Rochan, M. et al. (2018). Future semantic segmentation with convolutional lstm. arXiv preprint arXiv:1807.07946. 11

[16] Ronneberger, O., Fischer, P., and Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5 9, 2015, Proceedings, Part III 18, pages 234–241. Springer.

[17] Ruan, T., Liu, T., Huang, Z., Wei, Y., Wei, S., and Zhao, Y. (2019). Devil in the details: Towards accurate single and multiple human parsing. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 4814–4821.

[18] Wang, P., Chen, P., Yuan, Y., Liu, D., Huang, Z., Hou, X., and Cottrell, G. (2018). Understanding convolution for semantic segmentation. In 2018 IEEE winter conference on applications of computer vision (WACV), pages 1451–1460. Ieee.

[19] Woo, S., Park, J., Lee, J.-Y., and Kweon, I. S. (2018). Cbam: Convolutional block attention module. In Proceedings of the European conference on computer vision (ECCV), pages 3–19.

[20] Zhang, X., Chen, Y., Zhu, B., Wang, J., and Tang, M. (2020a). Part-aware context network for human parsing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8971–8980.

[21] Zhang, Z., Su, C., Zheng, L., and Xie, X. (2020b). Correlating edge, pose with parsing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8900–8909.

[22] Zhou, B., Zhao, H., Puig, X., Xiao, T., Fidler, S., Barriuso, A., and Torralba, A. (2019). Semantic understanding of scenes through the ade20k dataset. International Journal of Computer Vision, 127:302–321.


Related Images:

Recomonded Articles: