Smart Phone Sensor Data Fusion: A Joint Learning Approach to Activity Recognition

Smart Phone Sensor Data Fusion: A Joint Learning Approach to Activity Recognition

Paper Snapshot

Published

June 2025

Keywords

Research Work

Overview

In this project, we developed a deep learning–based system to classify human activities using smartphone sensor data (accelerometer and gyroscope). We proposed and tested two architectures — a Full Transformer model and a Joint Learning model that fuses CNN-LSTM and Transformer components.

Our aim was to explore the fusion of spatial and temporal modeling techniques to achieve highly accurate activity classification from sensor signals, tackling the challenges of heterogeneity, noise, and subtle activity transitions.

Motivation

With the rising use of wearable devices and smartphones, activity recognition has become a key element in fitness tracking, elderly care, and mobile health. However, real-world data is noisy and multi-dimensional. Traditional models often fail to capture both local patterns and long-range dependencies effectively.

This research aimed to solve that using a hybrid architecture that integrates:

  • CNNs for spatial pattern extraction
  • Transformers for long-range dependencies
  • LSTMs for sequential modeling

Dataset

We used the UCI-HAR (Human Activity Recognition) dataset, a well-known benchmark containing:

  • 7352 records of six labeled activities: Walking, Walking Upstairs, Walking Downstairs, Sitting, Standing, and Laying
  • 561 sensor-derived features per record
    The dataset was:
  • Scaled using StandardScaler
  • Split into 70% training and 30% test sets

Architectures

1. Full Transformer Architecture

A deep model leveraging:

  • Positional encoding
  • Six parallel Conv1D branches for feature extraction
  • Multi-head self-attention layers to model global dependencies
  • LSTM layers to retain temporal patterns
  • Global average pooling and dropout regularization
  • Final Softmax classification over 6 activities

2. Joint Learning Architecture (Proposed)

Our novel fusion model combined:

  • A CNN-LSTM stream for local temporal feature learning
  • A Transformer stream for global attention-based modeling
    These two branches were:
  • Merged and followed by dense layers
  • Trained jointly using Adam optimizer for 50 epochs

This hybrid design allows the model to retain local detail while leveraging long-range dependencies — improving generalization across activities.

Results & Evaluation

  • Full Transformer Accuracy: 96%
  • Joint Learning Accuracy: 98%
  • Joint Learning F1 Score: 98% (outperforming most existing models in literature)
  • Evaluation metrics: Confusion matrix, classification report, and ROC curves

Our Joint Learning model outperformed several prior works such as:

  • Multi-head CNN with LSTM (97.07%)
  • ConvAE-LSTM (98.14% accuracy but lower F1 Score of 97.67%)

My Contributions

  • Designed, implemented, and trained both architectures using TensorFlow/Keras
  • Preprocessed UCI-HAR dataset, performed standardization and data analysis
  • Conducted benchmarking with state-of-the-art models and visualized evaluation metrics
  • Drafted model comparisons and handled performance tuning

Future Work

  • Experiment with GRU-Transformer hybrids
  • Investigate performance scaling with more CNN heads
  • Create or collect larger and more diverse datasets
  • Deploy model on mobile devices or edge environments for real-time activity tracking

Final Takeaway

This project demonstrates the power of deep fusion architectures in extracting both short-term features and long-term patterns from multi-sensor data. By combining CNNs, LSTMs, and Transformers, we achieved a high-performance, robust model suitable for real-world activity recognition applications.