Synthetic data generation is becoming increasingly relevant in sport science, where real-world datasets are often limited by small sample sizes, class imbalance, privacy restrictions, and the absence of clear ground truth. This project responds to that challenge by proposing a structured framework to guide how synthetic data should be designed, evaluated, and deployed in sport. Rather than focusing on a single generative method, the study brings together evidence from peer-reviewed studies to show that synthetic data are most useful when their purpose is clearly defined and their design remains aligned with the real conditions of sport applications.
The framework is organised around six connected dimensions: objective of use, data structure, generation strategy, domain constraints, utility and fidelity evaluation, and deployment risk. Together, these dimensions provide a practical way to think about why synthetic data are needed, how they should be generated, what constraints should be respected, and how their value should be judged in context. The framework also emphasises that synthetic data are not intended to replace real data universally, but to support specific tasks such as augmentation, benchmarking, privacy-preserving sharing, and scenario exploration.
A central message of the project is that synthetic data in sport should be treated as a sequence of design decisions rather than as a single technical step. This is especially important because sport data are highly diverse, and the success of synthetic generation depends on how well methods are matched to data structure, task requirements, and domain knowledge. The study also highlights the need for clearer evaluation standards and more careful attention to interpretability, disclosure risk, and responsible deployment.
Currently, the next step in this project is to apply the framework using machine learning and synthetic data generation methods.
Read more: