169 Views

1304 - A True Copycat: Using Artificial Intelligence to Create A Large Dataset From A Small Open Source Parkinson's Dataset

Thursday, February 27, 2025

5:00 PM - 6:30 PM MST

Presenting Author(s)

AR

Akash Ramesh, BS

Medical student
akash.skyler.ramesh@gmail.com
Artesia, California, United States

Co-Author(s)

NV

Nisan Verma, BS

Medical Student
UQ Ochsner
Arcadia, California, United States

Objectives:

A significant challenge in medical research is obtaining high-quality datasets for unique diseases. Generating synthetic data with high accuracy using artificial intelligence, such as Generative Adversarial Networks (GANs), offers a promising solution. This study aims to use a GAN to generate a larger dataset that closely mirrors real-world data for Parkinson’s disease patients, both on and off L-Dopa.

Design: An open dataset from the Laboratory of Biomechanics and Motor Control at the Federal University of ABC, Brazil, was used. This dataset contained kinetics and kinematics data from Parkinson's patients with and without L-Dopa. A modified Mode-Seeking GAN (MSGAN) was employed to generate synthetic data. The Kolmogorov-Smirnov (KS) statistic measured similarity between the generated and real datasets, while Principal Component Analysis (PCA) and heat maps identified key patterns.

Results:

MSGAN achieved an average KS statistic of 0.27, ranging from 0.116 to 0.589 across variables. Thirty out of 45 variables had p-values > 0.05. Variables with p-values < 0.05 were Stroop errors (I, II, III), TMT errors, and Hoehn & Yahr scores. PCA showed principal components PC1 = 0.93 and PC2 = 0.05.

Conclusions: This study demonstrates the potential of GANs to generate synthetic data that closely approximates real-world datasets. An average KS statistic of 0.27 and 30/45 non-significant variables suggest that the synthetic data closely resembles the original. Significant p-values in the errors and Hoehn & Yahr scores are due to the small variability range, making any replication outside that range appear significant. Two principal components accounted for a majority of variance, and the heat map displayed favorable patterns. Future improvements could involve larger datasets, enhanced computational power, or refined algorithms for overfitting and mode-seeking. For the medical field, achieving larger datasets and enhancing computational power may be a limitation, so we suggest utilizing more efficient algorithms.