5. Making Training Data
Overview
We now have cap signal data for the cap0-m6A class (data__cap_0m6A.csv
). To retrain our classifier, we need cap signal data for all classes, and not just cap0-m6A class:
- Cap 0 (need data for this cap)
- Cap 1 (need data for this cap)
- Cap 2 (need data for this cap)
- Cap 2-1 (need data for this cap)
- Cap 0-m6A (you made this data
data__cap_0m6A.csv
in previous steps)
Why Retrain?
Capfinder uses retraining instead of transfer learning. We believe retraining the classifier from scratch is preferable to transfer learning for the following reasons:
- It's a simpler approach
- It avoids potential biases from the pretrained model
- It allows the model to learn optimal representations for all classes simultaneously
Steps to Prepare Data
-
Download cap signal data for existing caps (Cap 0, cap 1, cap 2, and cap 2-1) from these two links below:
-
Create a data directory:
Create a new directory (name it as, lets say,
caps_data_dir
) -
Extract downloaded zipped data:
Extract the downloaded files --
capfinder_training_data_p1.zip
andcapfinder_training_data_p2.zip
-- intocaps_data_dir
you created in step 2.Ensure that this directory has all 21 parts of the data (
data.tar.gz.part00.gpg
--data.tar.gz.part20.gpg
). These 21 parts are encrypted and you need a password to decrypt them first. -
Getting the password:
For now we only want to share data with people who want to collaborate. If you wish you collaborate, please send an email to Eivind Valen and you will be sent the password.
-
Decrypting the data:
To decrypt the data use the following script:
script
#!/bin/bash # Function to read password securely read_password() { read -s -p "Enter password for decryption: " password echo } # Function to get directory path get_directory() { read -p "Enter the path to the directory containing encrypted files: " directory if [ ! -d "$directory" ]; then echo "The specified path does not exist or is not a directory." exit 1 fi } # Main script echo "Welcome to the Decrypt, Extract, and Cleanup script!" # Get directory path get_directory # Get password read_password # Decrypt files echo "Decrypting files..." for file in "$directory"/data.tar.gz.part*.gpg; do if [ -f "$file" ]; then gpg --batch --yes --passphrase "$password" --decrypt "$file" > "${file%.gpg}" if [ $? -eq 0 ]; then echo "Decrypted: $file" rm "$file" else echo "Failed to decrypt: $file" exit 1 fi fi done # Extract files echo "Extracting files..." cat "$directory"/data.tar.gz.part* | tar xzvf - --transform='s|.*/||' -C "$directory" # Check if extraction was successful if [ $? -eq 0 ]; then echo "Extraction completed successfully." # Remove the decrypted compressed files echo "Removing decrypted compressed files..." rm "$directory"/data.tar.gz.part* if [ $? -eq 0 ]; then echo "Decrypted compressed files have been removed." else echo "Failed to remove some or all of the decrypted compressed files." fi else echo "Extraction failed. Decrypted files have not been removed." exit 1 fi echo "Process completed successfully."
Just download the script and run it. It will ask for the password for decrytion and the path of the directory where encrypted data is currently residing. The script will decrypt, combine, and extract the tar files.
If you are successful, you should see the following contents in your
caps_data_dir
:data__cap_0_run1.csv
data__cap_0_run2.csv
data__cap_1_run1.csv
data__cap_2_run1.csv
data__cap_2-1_run1.csv
data__cap_2-1_run2.csv
-
Add new cap0-m6A data:
Place the
data__cap_0m6A.csv
file in the same data directory. -
Verify data:
Ensure your
caps_data_dir
directory now contains CSV files for all cap classes:data__cap_0_run1.csv
data__cap_0_run2.csv
data__cap_1_run1.csv
data__cap_2_run1.csv
data__cap_2-1_run1.csv
data__cap_2-1_run2.csv
data__cap_0m6A.csv
The suffix run1
and run2
shows that the data was acquired from two different sequencing runs. If later on, you acquire more data for cap0-m6A
class, you can rename the two files as data__cap_0m6A_run1.csv
and data__cap_0m6A_run2.csv
Next Steps
With all cap signal data prepared in a single directory, you're now ready to proceed with retraining the Capfinder classifier. We will next use a training pipeline that processes all these files in batches, does hyperparameter tuning, and final training.