Overview
In this project, we developed a deep learning–based system to classify human activities using smartphone sensor data (accelerometer and gyroscope). We proposed and tested two architectures — a Full Transformer model and a Joint Learning model that fuses CNN-LSTM and Transformer components.
Our aim was to explore the fusion of spatial and temporal modeling techniques to achieve highly accurate activity classification from sensor signals, tackling the challenges of heterogeneity, noise, and subtle activity transitions.
Motivation
With the rising use of wearable devices and smartphones, activity recognition has become a key element in fitness tracking, elderly care, and mobile health. However, real-world data is noisy and multi-dimensional. Traditional models often fail to capture both local patterns and long-range dependencies effectively.
This research aimed to solve that using a hybrid architecture that integrates:
CNNs for spatial pattern extraction
Transformers for long-range dependencies
LSTMs for sequential modeling
Dataset
We used the UCI-HAR (Human Activity Recognition) dataset, a well-known benchmark containing:
7352 records of six labeled activities: Walking, Walking Upstairs, Walking Downstairs, Sitting, Standing, and Laying
561 sensor-derived features per record
The dataset was:
Scaled using StandardScaler
Split into 70% training and 30% test sets
Architectures
1. Full Transformer Architecture
A deep model leveraging:
Positional encoding
Six parallel Conv1D branches for feature extraction
Multi-head self-attention layers to model global dependencies
LSTM layers to retain temporal patterns
Global average pooling and dropout regularization
Final Softmax classification over 6 activities
2. Joint Learning Architecture (Proposed)
Our novel fusion model combined:
A CNN-LSTM stream for local temporal feature learning
A Transformer stream for global attention-based modeling
These two branches were:
Merged and followed by dense layers
Trained jointly using Adam optimizer for 50 epochs
This hybrid design allows the model to retain local detail while leveraging long-range dependencies — improving generalization across activities.
Results & Evaluation
Full Transformer Accuracy: 96%
Joint Learning Accuracy: 98%
Joint Learning F1 Score: 98% (outperforming most existing models in literature)
Evaluation metrics: Confusion matrix, classification report, and ROC curves
Our Joint Learning model outperformed several prior works such as:
My Contributions
Designed, implemented, and trained both architectures using TensorFlow/Keras
Preprocessed UCI-HAR dataset, performed standardization and data analysis
Conducted benchmarking with state-of-the-art models and visualized evaluation metrics
Drafted model comparisons and handled performance tuning
Future Work
Experiment with GRU-Transformer hybrids
Investigate performance scaling with more CNN heads
Create or collect larger and more diverse datasets
Deploy model on mobile devices or edge environments for real-time activity tracking
Final Takeaway
This project demonstrates the power of deep fusion architectures in extracting both short-term features and long-term patterns from multi-sensor data. By combining CNNs, LSTMs, and Transformers, we achieved a high-performance, robust model suitable for real-world activity recognition applications.