scraping & organizing large data sets to prepare for fine-tuning