Human Part Semantic Segmentation Using CDGNET
Architecture for Human Activity Recognition
Mayank Lovanshi1*, Vivek
Tiwari1
1International
Institute of Information Technology (IIIT), Naya Raipur, 493661, Chhattisgarh, India.
*Corresponding Author: mayank@iiitnr.edu.in
Abstract: The segmentation of
human body parts is a task that entails assigning labels to pixels in an image
to identify the corresponding body part classes. To enhance accuracy, a
technique known as sample class distribution was developed, considering the
hierarchical structure of the human body and the unique positioning of each
part. This technique involves gathering and applying primary human parsing
labels in both vertical and horizontal dimensions to exploit the distribution
of classes. By combining these guided features, a spatial guidance map is
generated and incorporated into the backbone network. These semantic-guided
features contribute to the effective recognition of human activity through
semantic segmentation-enabled human pose. To assess the effectiveness of this
approach, extensive experiments were performed on a large dataset called CIHP,
using metrics such as mean IOU, pixel accuracy, and mean accuracy.
Keywords: Human Parsing, Human
Semantic Part Segmentation, CIHP, Resnet 101
1. Introduction
Semantic segmentation is
a technique used to divide an image into visually meaningful segments, allowing
for further analysis and comprehension of the image. This approach finds
applications in various fields, such as video understanding, medical analysis,
human-robot interaction, and satellite-based semantic segmentation [4, 17, 10].
Human Part Semantic
Segmentation, also known as Human Parsing, involves assigning semantic labels
to each pixel of an image depicting a human body, such as arms, legs, dresses,
and skirts [1]. Accurate identification of a human's semantic components is
crucial for applications like human action analysis [15], augmented reality,
human-computer interaction, and virtual reality. By separating the human body
from the background, semantic segmentation [11,2] provides more detailed
information about the person's movements and behaviors. It enables
distinguishing between different bodily areas like the arms, legs, or torso,
which in turn allows for identifying specific actions such as running, walking,
or jumping [23,14].
Recent advancements in
fully convolutional neural networks have contributed to the development of
effective techniques for human parsing. These techniques typically employ an
encoder-decoder architecture to extract features from the input image and
multiple convolutional layers to generate pixel-wise predictions of the
semantic categories. Researchers have also explored attention mechanisms,
multi-scale features, and adversarial training to enhance the accuracy of human
parsing. It is expected that ongoing research and development will lead to the
creation of even more precise and reliable methods for human parsing.
Human parsing is a
process that involves identifying and segmenting different components of the
human body, including body parts and clothing, from an image or video. This
information plays a crucial role in improving the accuracy of human activity
recognition systems. By analyzing the movements and interactions of these body
parts, algorithms can recognize specific actions like walking, running, and
jumping. This capability has wide-ranging applications, including video
surveillance, sports analysis, and healthcare monitoring, where tracking and
analyzing human behavior are essential.
Furthermore, human
parsing can provide valuable insights into individual characteristics, such as
gender, clothing style, and age. This information finds utility in various
domains such as marketing, virtual try-on experiences, and personalized
services. For example, it can help tailor advertisements or recommendations
based on a person's clothing preferences or suggest personalized fashion
choices. The ability to extract such information through human parsing
contributes to a range of applications that leverage understanding human
behavior and preferences [3, 19].
Figure 1: Sample images:
Human part semantic segmentation
In
this paper, Section II contains details explaining the methodology and
experimental method including an explanation of the architecture and working of
the proposed model. Furthermore, section III has an observation and
experimental results followed by a discussion & future work in section IV.
2.Methodology
In
this section, a novel approach is presented to convert the detailed structural
information of the human body from a two-dimensional representation into a
one-dimensional format along both the horizontal and vertical axes. This
approach takes inspiration from CDGNet and attention-based models such as
SENet, CBAM, and HANet.
While
SENet and CBAM primarily focus on capturing the overall context of an image,
and HANet considers height-driven attention maps for urban scene images, the
proposed method extends this concept to incorporate both class and directional
positions. Instead of relying solely on attention mechanisms, the proposed
method incorporates classes and directional positions to generate essential
signals that aid the network in effectively detecting human body parts.
Given
the hierarchical structure of human bodies and their distinct spatial
distributions, individual body parts exhibit unique distributions along both
the vertical and horizontal dimensions. Building on this inspiration, the
proposed strategy introduces a custom class distribution-guided network that
predicts the class distribution in both the vertical and horizontal dimensions,
guided by a class distribution loss. This innovative approach shows promise in
accurately detecting and localizing human body parts in images, offering
significant implications for computer vision and image analysis.
The
objective of custom CDGNet is to achieve accurate human parsing
results by training a deep learning model that generates high-quality parsing
results that align with the class distribution of the training dataset. The
custom CDGNet model includes a generator and a discriminator network, and it is
trained using the CDG loss function.
In our proposed method
for human parsing, we generate class distributions that indicate the location
of each body part and utilize these distributions to guide the feature
representation. Here's how our method is applied:
· Input: We start with an input image frame and its associated
feature tensor, which has dimensions W x H x C (width x height x channel size).
· Spatial feature extraction: To extract information about
spatial features, we perform separate operations on the input tensor in the
vertical and horizontal directions. This process yields directional
characteristics that can be used to generate labels.
· Label generation: We generate labels by applying
average pooling in orthogonal directions. For example, vertical average pooling
generates horizontal features (Zh), and vice versa, vertical
features (Zv).
· Channel reduction: As the number of classes (body
parts) is generally smaller than the channel size, we introduce a 1-dimensional
convolution layer with a kernel size of (3 x 3) and a Batch Normalization layer
after the feature extraction module. This reduces the number of channels in the
feature tensor by half (to C/2), simplifying the representation and reducing
computational complexity.
Figure 2: overview architecture of the proposed network.
By following this
approach, we obtain horizontal and vertical labels that capture the spatial
distribution of body parts. Figure 2 illustrates the extracted horizontal and
vertical labels using the proposed model.
3. Observation & Experimental
Results
This article introduced
a novel approach for human parsing, which is evaluated using metrics such as
mean IoU, pixel Accuracy, mean accuracy, and classwise accuracy on the CIHP
datasets [12]. The proposed method is combined with these metrics to achieve
superior performance compared to existing methods. The evaluation consists of
quantitative and qualitative experiments demonstrating the proposed approach
performs better than state-of-the-art models.
3.1. Dataset Used
The CIHP (Clothing,
In-shop Try-On, and Human Parsing) dataset [12] was created for various
computer vision applications, including semantic segmentation, human parsing,
and retrieving in-store clothing. The dataset includes more than 38,000
high-resolution photos of various body positions, garment types, and
occlusions. A pixel-level human parsing mask comprising 20 categories,
including hair, face, upper clothes, lower clothes, shoes, and background, is
attached to each image in the collection. The collection also includes
annotations for retrieving and trying on clothes in stores, which require
locating the clothing items and predicting their characteristics
3.2 Evaluation Metrics
The proposed model
evaluates results on some of the evaluation matrices, i.e. discussed below:
· Mean IOU: The ground truth and
predicted segmentation results are compared using the Intersection over Union
(IoU) metric, which calculates the overlap by dividing the areas of their
intersection and union. The segmentation of two classes, also known as binary
classes and multi-classes, is frequently evaluated using the Mean IoU [16].
· Pixel accuracy: Pixel Accuracy is a
performance metric that shows how many of an image’s pixels were correctly
identified. By dividing the total number of correctly classified pixels by the
total number of pixels in the image’s frame, semantic segmentation calculates
the percentage of correctly identified pixelsin an image. [16].
· Mean Accuracy: The number of accurate predictions a model makes, divided by
the total number of input samples, is referred to as” Mean Accuracy.” The
average accuracy across various classes is determined using this metric. [16]
Table 1: Quantitative
results of Custom CDGNet with other benchmark methods on the CIHP dataset
Method
|
Backbone
|
Pixel Acc
|
Mean Acc
|
Mean IoU
|
PGN
[7]
|
DeepLabV2
|
-
|
64.65
|
55.80
|
Graphonomy
[6]
|
DeepLabV3+
|
-
|
65.73
|
58.58
|
M-CE2P
[18]
|
ResNet101
|
63.77
|
45.31
|
59.50
|
CorrPM
[22]
|
ResNet101
|
-
|
-
|
60.18
|
SNT
[9]
|
ResNet101
|
-
|
-
|
60.87
|
PCNet
[21]
|
ResNet101
|
-
|
67.05
|
61.05
|
CDGNet
[13]
|
ResNet101
|
90.01
|
74.2
|
65.56
|
Custom
CDGNet
|
ResNet101
|
91.23
|
75.01
|
65.57
|
3.3. Quantitative Analysis
To attain the highest
performance in human parsing, we performed quantitative experiments on the CIHP
dataset compared with well-known human parsing algorithms. The proposed model
outperforms in many places regarding classwise mean IOU in the experiment.
Figure 4 represents the classwise results of mean IoU on CIHP datasets.
Furthermore, the proposed model outperforms in terms of pixel accuracy, mean accuracy
and mean IoU compared to existing state-of-the-art frameworks. Table 1
represents pixel accuracy as 91.77%, mean accuracy as 75.22% and mean IOU as
65.69% compared to other state-of-the-art work.
Figure 3: Quantitative
result of the classwise mean IoU of custom CDGNet on CIHP dataset
4. Discussion
This paper introduces a
novel method called Custom CDGNet for human part semantic segmentation, aiming
to achieve accurate and effective segmentation of human body parts. The proposed
technique utilizes pixel labeling to assign classes to each pixel, generating
vertical and horizontal class distributions for all human components. This
information greatly assists in labeling pixels accurately, even in scenarios
involving multiple individuals in an image. Through comprehensive qualitative
and quantitative analysis, Custom CDGNet demonstrates superior performance
compared to existing methods for human part semantic segmentation. These
findings provide valuable insights and inspire further research in the
application of multi-modal human parsing and edge computing for various
computer vision tasks. Moreover, the outcomes achieved by the Custom CDGNet
model hold promise for advancing the recognition of human activity through
semantic segmentation-enabled pose estimation. This suggests potential future
directions where the proposed model's results can contribute to improving human
activity recognition by leveraging semantic segmentation techniques.
Acknowledgement
I
am grateful to Dr Vivek Tiwari, my supervisor, whose expertise and insights
greatly contributed to the interpretations and conclusions presented in this
paper. Additionally, I would like to express my appreciation to the Director,
Dean, and HoD CSE of IIIT-NR for their unwavering support throughout this
research endeavour. The AIDL Lab at the International Institute of Information
Technology, NR was the site of the research.
References
[1] Badrinarayanan, V.,
Kendall, A., and Cipolla, R. (2017). Segnet: A deep convolutional
encoder-decoder architecture for image segmentation. IEEE transactions on
pattern analysis and machine intelligence, 39(12):2481– 2495.
[2] Bose, K., Shubham,
K., Tiwari, V., and Patel, K. S. (2022). Insect image semantic segmentation and
identification using unet and deeplab v3+. In ICT Infrastructure and Computing:
Proceedings of ICT4SD 2022, pages 703–711. Springer.
[3] Chen, L.-C., Yang,
Y., Wang, J., Xu, W., and Yuille, A. L. (2016). Attention to scale: Scale-aware
semantic image segmentation. In Proceedings of the IEEE conference on computer
vision and pattern recognition, pages 3640–3649.
[4] Chen, L.-C., Zhu,
Y., Papandreou, G., Schroff, F., and Adam, H. (2018). Encoder decoder with
atrous separable convolution for semantic image seg- mentation. In Proceedings
of the European conference on computer vision (ECCV), pages 801–818.
[5] Choi, S., Kim, J.
T., and Choo, J. (2020). Cars can’t fly up in the sky:
Improving urban-scene segmentation via height-driven attention networks. In
Proceedings of the IEEE/CVF conference on computer vision and pattern
recognition, pages 9373–9383.
[6] Gong, K., Gao, Y.,
Liang, X., Shen, X., Wang, M., and Lin, L. (2019). Graphonomy: Universal human
parsing via graph transfer learning. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages 7450–7459. 10
[7] Gong, K., Liang, X.,
Li, Y., Chen, Y., Yang, M., and Lin, L. (2018). Instance-level human parsing
via part grouping network. In Proceedings of the European conference on computer
vision (ECCV), pages 770–785.
[8] Hu, J., Shen, L.,
and Sun, G. (2018). Squeeze-and-excitation networks. In Proceedings of the IEEE
conference on computer vision and pattern recognition, pages 7132–7141.
[9] Ji, R., Du, D.,
Zhang, L., Wen, L., Wu, Y., Zhao, C., Huang, F., and Lyu, S. (2020). Learning
semantic neural tree for human parsing. In Computer Vision–ECCV 2020: 16th
European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIII
16, pages 205–221. Springer.
[10] Kashyap, R. and
Tiwari, V. (2017). Energy-based active contour method for image segmentation.
International Journal of Electronic Healthcare, 9(2-3):210–225.
[11] Kashyap, R. and
Tiwari, V. (2018). Active contours using global models for medical image
segmentation. International Journal of Computational Systems Engineering,
4(2-3):195–201.
[12] Li, P., Xu, Y.,
Wei, Y., and Yang, Y. (2020). Self-correction for human parsing. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 44(6):3260–3271.
[13] Liu, K., Choi, O.,
Wang, J., and Hwang, W. (2022). Cdgnet: Class distribution guided network for
human parsing. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pages 4473–4482.
[14] Lovanshi, M. and
Tiwari, V. (2023). Human pose estimation: Bench- marking deep learning-based
methods. In proceedings of the IEEE Conference on Interdisciplinary Approaches
in Technology and Management for Social Innovation.
[15] Patel, A. S., Vyas,
R., Vyas, O., Ojha, M., and Tiwari, V. (2022). Motion-compensated online object
tracking for activity detection and crowd behavior analysis. The Visual Computer,
pages 1–21. [16] Rochan, M. et al. (2018). Future semantic segmentation with
convolutional lstm. arXiv preprint arXiv:1807.07946. 11
[16] Ronneberger, O.,
Fischer, P., and Brox, T. (2015). U-net: Convolutional networks for biomedical
image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI
2015: 18th International Conference, Munich, Germany, October 5 9, 2015, Proceedings,
Part III 18, pages 234–241. Springer.
[17] Ruan, T., Liu, T.,
Huang, Z., Wei, Y., Wei, S., and Zhao, Y. (2019). Devil in the details: Towards
accurate single and multiple human parsing. In Proceedings of the AAAI conference
on artificial intelligence, volume 33, pages 4814–4821.
[18] Wang, P., Chen, P.,
Yuan, Y., Liu, D., Huang, Z., Hou, X., and Cottrell, G. (2018). Understanding
convolution for semantic segmentation. In 2018 IEEE winter conference on applications
of computer vision (WACV), pages 1451–1460. Ieee.
[19] Woo, S., Park, J.,
Lee, J.-Y., and Kweon, I. S. (2018). Cbam: Convolutional block attention
module. In Proceedings of the European conference on computer vision (ECCV), pages
3–19.
[20] Zhang, X., Chen,
Y., Zhu, B., Wang, J., and Tang, M. (2020a). Part-aware context network for
human parsing. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pages 8971–8980.
[21] Zhang, Z., Su, C.,
Zheng, L., and Xie, X. (2020b). Correlating edge, pose with parsing. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 8900–8909.
[22] Zhou, B., Zhao, H.,
Puig, X., Xiao, T., Fidler, S., Barriuso, A., and Torralba, A. (2019). Semantic
understanding of scenes through the ade20k dataset. International Journal of
Computer Vision, 127:302–321.