API documentation
align
The module aligns a reference sequence to a read sequence using Parasail. The module also provides functions to generate alignment strings and chunks for pretty printing.
Author: Adnan M. Niazi Date: 2024-02-28
PairwiseAlignment
dataclass
Pairwise alignment with semi-global alignment allowing for gaps at the start and end of the query sequence.
Source code in src/capfinder/align.py
__init__(ref_start: int, ref_end: int, query_start: int, query_end: int, cigar_pysam: CigarTuplesPySam, cigar_sam: CigarTuplesSam)
Initializes a PairwiseAlignment object.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
ref_start |
int
|
The starting position of the alignment in the reference sequence. |
required |
ref_end |
int
|
The ending position of the alignment in the reference sequence. |
required |
query_start |
int
|
The starting position of the alignment in the query sequence. |
required |
query_end |
int
|
The ending position of the alignment in the query sequence. |
required |
cigar_pysam |
CigarTuplesPySam
|
A list of tuples representing the CIGAR string in the Pysam format. |
required |
cigar_sam |
CigarTuplesSam
|
A list of tuples representing the CIGAR string in the SAM format. |
required |
Source code in src/capfinder/align.py
align(query_seq: str, target_seq: str, pretty_print_alns: bool) -> Tuple[str, str, str, int]
Main function call to align two sequences and print the alignment.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
query_seq |
str
|
The query sequence. |
required |
target_seq |
str
|
The target/reference sequence. |
required |
pretty_print_alns |
bool
|
Whether to print the alignment in a pretty format. |
required |
Returns:
Type | Description |
---|---|
Tuple[str, str, str, int]
|
Tuple[str, str, str]: A tuple containing three strings: 1. The aligned query sequence with gaps. 2. The visual representation of the alignment with '|' for matches, '/' for mismatches, and ' ' for gaps or insertions. 3. The aligned target sequence with gaps. 4. The alignment score. |
Source code in src/capfinder/align.py
cigartuples_from_string(cigarstring: str) -> CigarTuplesPySam
Returns pysam-style list of (op, count) tuples from a cigarstring.
Source code in src/capfinder/align.py
make_alignment_chunks(target: str, query: str, alignment: str, chunk_size: int) -> str
Divide three strings (target, query, and alignment) into chunks of the specified length and print them as triplets with the specified prefixes and a one-line gap between each triplet.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
target |
str
|
The target/reference string. |
required |
query |
str
|
The query string. |
required |
alignment |
str
|
The alignment string. |
required |
chunk_size |
int
|
The desired chunk size. |
required |
Returns:
Name | Type | Description |
---|---|---|
aln_string |
str
|
The aligned strings in chunks with the specified prefix. |
Source code in src/capfinder/align.py
make_alignment_strings(query: str, target: str, alignment: PairwiseAlignment) -> Tuple[str, str, str]
Generate alignment strings for the given query and target sequences based on a PairwiseAlignment object.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
query |
str
|
The query sequence. |
required |
target |
str
|
The target/reference sequence. |
required |
alignment |
PairwiseAlignment
|
An object representing the alignment between query and target sequences. |
required |
Returns:
Type | Description |
---|---|
Tuple[str, str, str]
|
Tuple[str, str, str]: A tuple containing three strings: 1. The aligned target sequence with gaps. 2. The aligned query sequence with gaps. 3. The visual representation of the alignment with '|' for matches, '/' for mismatches, and ' ' for gaps or insertions. |
Source code in src/capfinder/align.py
145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 |
|
parasail_align(*, query: str, ref: str) -> Any
Semi-global alignment allowing for gaps at the start and end of the query sequence.
:param query: str :param ref: str :return: PairwiseAlignment
Source code in src/capfinder/align.py
trim_parasail_alignment(alignment_result: Any) -> PairwiseAlignment
Trim the alignment result to remove leading and trailing gaps.
Source code in src/capfinder/align.py
attention_cnnlstm_model
CapfinderHyperModel
Bases: HyperModel
Hypermodel for the Capfinder CNN-LSTM with Attention architecture.
This model is designed for time series classification tasks, specifically for identifying RNA cap types. It combines Convolutional Neural Networks (CNNs) for local feature extraction, Long Short-Term Memory (LSTM) networks for sequence processing, and an attention mechanism to focus on the most relevant parts of the input sequence.
The architecture is flexible and allows for hyperparameter tuning of the number of layers, units, and other key parameters.
Attributes:
Name | Type | Description |
---|---|---|
input_shape |
Tuple[int, ...]
|
The shape of the input data. |
n_classes |
int
|
The number of classes for classification. |
encoder_model |
Optional[Model]
|
Placeholder for a potential encoder model. |
Methods:
Name | Description |
---|---|
build |
Constructs and returns a Keras model based on the provided hyperparameters. |
Source code in src/capfinder/attention_cnnlstm_model.py
44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 |
|
bam
We can only read BAM records one at a time from a BAM file. PySAM does not allow random access of BAM records. The module prepares and yields the BAM record information for each read.
Author: Adnan M. Niazi Date: 2024-02-28
generate_bam_records(bam_filepath: str) -> Generator[pysam.AlignedSegment, None, None]
Yield each record from a BAM file. Also creates an index (.bai) file if one does not exist already.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
bam_filepath |
str
|
str Path to the BAM file. |
required |
Yields:
Name | Type | Description |
---|---|---|
record |
AlignedSegment
|
pysam.AlignedSegment A BAM record. |
Source code in src/capfinder/bam.py
get_signal_info(record: pysam.AlignedSegment) -> Dict[str, Any]
Returns the signal info from a BAM record.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
record |
AlignedSegment
|
pysam.AlignedSegment A BAM record. |
required |
Returns:
Name | Type | Description |
---|---|---|
signal_info |
Dict[str, Any]
|
Dict[str, Any] Dictionary containing signal info for a read. |
Source code in src/capfinder/bam.py
get_total_records(bam_filepath: str) -> int
Returns the total number of records in a BAM file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
bam_filepath |
str
|
str Path to the BAM file. |
required |
Returns:
Name | Type | Description |
---|---|---|
total_records |
int
|
int Total number of records in the BAM file. |
Source code in src/capfinder/bam.py
process_bam_records(bam_filepath: str) -> Generator[Dict[str, Any], None, None]
Top level function to process a BAM file. Yields signal info for each read in the BAM file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
bam_filepath |
str
|
str Path to the BAM file to process. |
required |
Yields:
Name | Type | Description |
---|---|---|
signal_info |
Dict[str, Any]
|
Generator[Dict[str, Any], None, None] Dictionary containing signal info for a read. |
Source code in src/capfinder/bam.py
cli
add_cap(cap_int: int, cap_name: str) -> None
Add a new cap mapping or update an existing one.
Source code in src/capfinder/cli.py
cap_help() -> None
Display help information about cap mapping management.
Source code in src/capfinder/cli.py
create_train_config(file_path: Annotated[str, typer.Option(--file_path, -f, help='File path to save the JSON configuration file')] = '') -> None
Creates a dummy JSON configuration file at the specified path. Edit it to suit your needs.
Source code in src/capfinder/cli.py
476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 |
|
extract_cap_signal(bam_filepath: Annotated[str, typer.Option(--bam_filepath, -b, help='Path to the BAM file')] = '', pod5_dir: Annotated[str, typer.Option(--pod5_dir, -p, help='Path to directory containing POD5 files')] = '', reference: Annotated[str, typer.Option(--reference, -r, help="Reference Sequence (5' -> 3')")] = 'GCTTTCGTTCGTCTCCGGACTTATCGCACCACCTATCCATCATCAGTACTGT', cap_class: Annotated[int, typer.Option(--cap_class, -c, help='\n\n Integer-based class label for the RNA cap type. \n\n - -99 represents an unknown cap(s). \n\n - 0 represents Cap_0 \n\n - 1 represents Cap 1 \n\n - 2 represents Cap 2 \n\n - 3 represents Cap2-1 \n\n You can use the capmap command to manage cap mappings and use additional interger label for additional caps. \n\n ')] = -99, cap_n1_pos0: Annotated[int, typer.Option(--cap_n1_pos0, -p, help='0-based index of 1st nucleotide (N1) of cap in the reference')] = 52, train_or_test: Annotated[str, typer.Option(--train_or_test, -t, help='set to train or test depending on whether it is training or testing data')] = 'test', output_dir: Annotated[str, typer.Option(--output_dir, -o, help=textwrap.dedent('\n Path to the output directory which will contain: \n\n ├── A CSV file (data__cap_x.csv) containing the extracted ROI signal data.\n\n ├── A CSV file (metadata__cap_x.csv) containing the complete metadata information.\n\n ├── A log file (capfinder_vXYZ_datatime.log) containing the logs of the program.\n\n └── (Optional) plots directory containing cap signal plots, if --plot-signal is used.\n\n \u200b ├── good_reads: Directory that contains the plots for the good reads.\n\n \u200b ├── bad_reads: Directory that contains the plots for the bad reads.\n\n \u200b └── plotpaths.csv: CSV file containing the paths to the plots based on the read ID.\n'))] = '', n_workers: Annotated[int, typer.Option(--n_workers, -n, help='Number of CPUs to use for parallel processing')] = 1, plot_signal: Annotated[Optional[bool], typer.Option(--plot - signal / --no - plot - signal, help='Whether to plot extracted cap signal or not')] = None, debug_code: Annotated[bool, typer.Option(--debug / --no - debug, help='Enable debug mode for more detailed logging')] = False) -> None
Extracts signal corresponding to the RNA cap type using BAM and POD5 files. Also, generates plots if required.
Example command (for training data): capfinder extract-cap-signal \ --bam_filepath /path/to/sorted.bam \ --pod5_dir /path/to/pod5_dir \ --reference GCTTTCGTTCGTCTCCGGACTTATCGCACCACCTATCCATCATCAGTACTGTNNNNNNCGATGTAACTGGGACATGGTGAGCAATCAGGGAAAAAAAAAAAAAAA \ --cap_class 0 \ --cap_n1_pos0 52 \ --train_or_test train \ --output_dir /path/to/output_dir \ --n_workers 10 \ --no-plot-signal \ --no-debug
Example command (for testing data): capfinder extract-cap-signal \ --bam_filepath /path/to/sorted.bam \ --pod5_dir /path/to/pod5_dir \ --reference GCTTTCGTTCGTCTCCGGACTTATCGCACCACCTATCCATCATCAGTACTGT \ --cap_class -99 \ --cap_n1_pos0 52 \ --train_or_test test \ --output_dir /path/to/output_dir \ --n_workers 10 \ --no-plot-signal \ --no-debug
Source code in src/capfinder/cli.py
167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 |
|
list_caps() -> None
List all current cap mappings.
Source code in src/capfinder/cli.py
make_train_dataset(caps_data_dir: Annotated[str, typer.Option(--caps_data_dir, -c, help='Directory containing all the cap signal data files (data__cap_x.csv)')] = '', output_dir: Annotated[str, typer.Option(--output_dir, -o, help='A dataset directory will be created inside this directory automatically and the dataset will be saved there as CSV files.')] = '', target_length: Annotated[int, typer.Option(--target_length, -t, help='Number of signal points in cap signal to consider. If the signal is shorter, it will be padded with zeros. If the signal is longer, it will be truncated.')] = 500, dtype: Annotated[str, typer.Option(--dtype, -d, help="Data type to transform the dataset to. Valid values are 'float16', 'float32', or 'float64'.")] = 'float16', examples_per_class: Annotated[int, typer.Option(--examples_per_class, -e, help='Number of examples to include per class in the dataset')] = 1000, train_test_fraction: Annotated[float, typer.Option(--train_test_fraction, -tt, help='Fraction of data out of all data to use for training (0.0 to 1.0)')] = 0.95, train_val_fraction: Annotated[float, typer.Option(--train_val_fraction, -tv, help='Fraction of data out all the training split to use for validation (0.0 to 1.0)')] = 0.8, num_classes: Annotated[int, typer.Option(--num_classes, help='Number of classes in the dataset')] = 4, batch_size: Annotated[int, typer.Option(--batch_size, -b, help='Batch size for processing data')] = 1024, comet_project_name: Annotated[str, typer.Option(--comet_project_name, help='Name of the Comet ML project for logging')] = 'dataset', use_remote_dataset_version: Annotated[str, typer.Option(--use_remote_dataset_version, help='Version of the remote dataset to use. If not provided at all, the local dataset will be used/made and/or uploaded')] = '', use_augmentation: Annotated[bool, typer.Option(--use - augmentation / --no - use - augmentation, help='Whether to augment original data with time warped data')] = False) -> None
Prepares dataset for training the ML model. This command can be run independently
from here or is automatically invoked by the train-model
command.
This command processes cap signal data files, applies necessary transformations, and prepares a dataset suitable for training machine learning models. It supports both local data processing and fetching from a remote dataset.
Example command: capfinder make-train-dataset \ --caps_data_dir /path/to/caps_data \ --output_dir /path/to/output \ --target_length 500 \ --dtype float16 \ --examples_per_class 1000 \ --train_test_fraction 0.95 \ --train_val_fraction 0.8 \ --num_classes 4 \ --batch_size 32 \ --comet_project_name my-capfinder-project \ --use_remote_dataset_version latest --use-augmentation
Source code in src/capfinder/cli.py
309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 |
|
predict_cap_types(bam_filepath: Annotated[str, typer.Option(--bam_filepath, -b, help='Path to the BAM file')] = '', pod5_dir: Annotated[str, typer.Option(--pod5_dir, -p, help='Path to directory containing POD5 files')] = '', output_dir: Annotated[str, typer.Option(--output_dir, -o, help='Path to the output directory for prediction results and logs')] = '', n_cpus: Annotated[int, typer.Option(--n_cpus, -n, help=textwrap.dedent(" Number of CPUs to use for parallel processing.\n We use multiple CPUs during processing for POD5 file and BAM data (Step 1/5).\n For faster processing of this data (POD5 & BAM), increase the number of CPUs.\n For inference (Step 4/5), only a single CPU is used no matter how many CPUs you have specified.\n For faster inference, have a GPU available (it will be detected automatically) and set dtype to 'float16'."))] = 1, dtype: Annotated[str, typer.Option(--dtype, -d, help=textwrap.dedent(" Data type for model input. Valid values are 'float16', 'float32', or 'float64'.\n If you do not have a GPU, use 'float32' or 'float64' for better performance.\n If you have a GPU, use 'float16' for faster inference."))] = 'float16', batch_size: Annotated[int, typer.Option(--batch_size, -bs, help=textwrap.dedent(' Batch size for model inference.\n Larger batch sizes can speed up inference but require more memory.'))] = 128, custom_model_path: Annotated[Optional[str], typer.Option(--custom_model_path, -m, help='Path to a custom model (.keras) file. If not provided, the default pre-packaged model will be used.')] = None, plot_signal: Annotated[bool, typer.Option(--plot - signal / --no - plot - signal, help=textwrap.dedent(' "Whether to plot extracted cap signal or not.\n Saving plots can help you plot the read\'s signal, and plot the signal for cap and flanking bases(±5).'))] = False, debug_code: Annotated[bool, typer.Option(--debug / --no - debug, help='Enable debug mode for more detailed logging')] = False, refresh_cache: Annotated[bool, typer.Option(--refresh - cache / --no - refresh - cache, help='Refresh the cache for intermediate results')] = False) -> None
Predicts RNA cap types using BAM and POD5 files.
Example command
capfinder predict-cap-types \ --bam_filepath /path/to/sorted.bam \ --pod5_dir /path/to/pod5_dir \ --output_dir /path/to/output_dir \ --n_cpus 10 \ --dtype float16 \ --batch_size 256 \ --no-plot-signal \ --no-debug \ --no-refresh-cache
Source code in src/capfinder/cli.py
609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 |
|
remove_cap(cap_int: int) -> None
Remove a cap mapping.
Source code in src/capfinder/cli.py
reset_caps() -> None
Reset cap mappings to default.
Source code in src/capfinder/cli.py
show_config() -> None
Show the location of the configuration file.
Source code in src/capfinder/cli.py
train_model(config_file: Annotated[str, typer.Option(--config_file, -c, help='Path to the JSON configuration file containing the parameters for the training pipeline.')] = '') -> None
Trains the model using the parameters in the JSON configuration file.
Source code in src/capfinder/cli.py
collate
The main workhorse which collates information from the BAM file and the POD5 files, aligns OTE to extracts the signal for the region of interest (ROI) for training or testing purposes. It also plots the ROI signal if requested.
Author: Adnan M. Niazi Date: 2024-02-28
DatabaseHandler
Source code in src/capfinder/collate.py
85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 |
|
__init__(cap_class: int, num_processes: int, database_path: str, plots_csv_filepath: Union[str, None], output_dir: str) -> None
Initializes the index database handler
Source code in src/capfinder/collate.py
exit_func(worker_id: int, worker_state: Dict[str, Any]) -> None
Closes the database connection and the CSV files.
Source code in src/capfinder/collate.py
init_func(worker_id: int, worker_state: Dict[str, Any]) -> None
Opens the database connection and CSV files
Source code in src/capfinder/collate.py
merge_data() -> Tuple[str, str]
Merges the data and metadata CSV files.
Source code in src/capfinder/collate.py
FASTQRecord
dataclass
Simulates a FASTQ record object.
Attributes:
Name | Type | Description |
---|---|---|
id |
str
|
Read ID. |
seq |
str
|
Read sequence. |
Example
record = FASTQRecord(id="read1", seq="ATCG")
Source code in src/capfinder/collate.py
collate_bam_pod5(bam_filepath: str, pod5_dir: str, num_processes: int, reference: str, cap_class: int, cap0_pos: int, train_or_test: str, plot_signal: bool, output_dir: str) -> Tuple[str, str]
Collates information from the BAM file and the POD5 files, aligns OTE to extracts the signal for the region of interest (ROI) for training or testing purposes. It also plots the ROI signal if requested.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
bam_filepath |
str
|
Path to the BAM file. |
required |
pod5_dir |
str
|
Path to the directory containing the POD5 files. |
required |
num_processes |
int
|
Number of processes to use for parallel processing. |
required |
reference |
str
|
Reference sequence. |
required |
cap_class |
int
|
Class label for the RNA cap. |
required |
cap0_pos |
int
|
Position of the cap N1 base in the reference sequence (0-based). |
required |
train_or_test |
str
|
Whether to extract ROI for training or testing. |
required |
plot_signal |
bool
|
Whether to plot the ROI signal. |
required |
output_dir |
str
|
Path to the output directory. |
required |
Returns:
Type | Description |
---|---|
Tuple[str, str]
|
Tuple[str, str]: Paths to the data and metadata CSV files. |
Source code in src/capfinder/collate.py
422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 |
|
collate_bam_pod5_worker(worker_id: int, worker_state: Dict[str, Any], pickled_bam_data: bytes, reference: str, cap_class: int, cap0_pos: int, train_or_test: str, plot_signal: bool, output_dir: str) -> None
Worker function that collates information from POD5 and BAM file, finds the FASTA coordinates of region of interest (ROI) and and extracts its signal.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
worker_id |
int
|
int Worker ID. |
required |
worker_state |
Dict[str, Any]
|
dict Dictionary containing the database connection and cursor. |
required |
pickled_bam_data |
bytes
|
bytes Pickled dictionary containing the BAM record information. |
required |
reference |
str
|
str Reference sequence. |
required |
cap_class |
int
|
int Class label for the RNA cap |
required |
cap0_pos |
int
|
int Position of the cap0 base in the reference sequence. |
required |
train_or_test |
str
|
str Whether to extract ROI for training or testing. |
required |
plot_signal |
bool
|
bool Whether to plot the ROI signal. |
required |
output_dir |
str
|
str Path to the output directory. |
required |
Returns:
Type | Description |
---|---|
None
|
None |
Source code in src/capfinder/collate.py
223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 |
|
collate_bam_pod5_wrapper(bam_filepath: str, pod5_dir: str, num_processes: int, reference: str, cap_class: int, cap0_pos: int, train_or_test: str, plot_signal: bool, output_dir: str, debug_code: bool, formatted_command: Optional[str]) -> None
Wrapper function for collate_bam_pod5 that sets up logging and handles output.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
bam_filepath |
str
|
Path to the BAM file. |
required |
pod5_dir |
str
|
Path to the directory containing the POD5 files. |
required |
num_processes |
int
|
Number of processes to use for parallel processing. |
required |
reference |
str
|
Reference sequence. |
required |
cap_class |
int
|
Class label for the RNA cap. |
required |
cap0_pos |
int
|
Position of the cap N1 base in the reference sequence (0-based). |
required |
train_or_test |
str
|
Whether to extract ROI for training or testing. |
required |
plot_signal |
bool
|
Whether to plot the ROI signal. |
required |
output_dir |
str
|
Path to the output directory. |
required |
debug_code |
bool
|
Whether to show debug information in logs. |
required |
formatted_command |
Optional[str]
|
Formatted command string for logging. |
required |
Source code in src/capfinder/collate.py
generate_pickled_bam_records(bam_filepath: str) -> Generator[bytes, None, None]
Generate pickled BAM records from a BAM file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
bam_filepath |
str
|
Path to the BAM file. |
required |
Yields:
Name | Type | Description |
---|---|---|
bytes |
bytes
|
Pickled BAM record. |
Source code in src/capfinder/collate.py
constants
The module contains constants used in the capfinder package.
Author: Adnan M. Niazi Date: 2024-02-28
cyclic_learing_rate
CometLRLogger
Bases: Callback
A callback to log the learning rate to Comet.ml during training.
This callback logs the learning rate at the beginning of each epoch and at the end of each batch to a Comet.ml experiment.
Attributes:
Name | Type | Description |
---|---|---|
experiment |
Experiment
|
The Comet.ml experiment to log to. |
Source code in src/capfinder/cyclic_learing_rate.py
__init__(experiment: Experiment) -> None
Initialize the CometLRLogger.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
experiment |
Experiment
|
The Comet.ml experiment to log to. |
required |
on_batch_end(batch: int, logs: Optional[Dict[str, Any]] = None) -> None
Log the learning rate at the end of each batch.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
batch |
int
|
The current batch number. |
required |
logs |
Optional[Dict[str, Any]]
|
The logs dictionary. |
None
|
Source code in src/capfinder/cyclic_learing_rate.py
on_epoch_begin(epoch: int, logs: Optional[Dict[str, Any]] = None) -> None
Log the learning rate at the beginning of each epoch.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
epoch |
int
|
The current epoch number. |
required |
logs |
Optional[Dict[str, Any]]
|
The logs dictionary. |
None
|
Source code in src/capfinder/cyclic_learing_rate.py
CustomProgressCallback
Bases: Callback
A custom callback to print the learning rate at the end of each epoch.
This callback prints the current learning rate after Keras' built-in progress bar for each epoch.
Source code in src/capfinder/cyclic_learing_rate.py
__init__() -> None
on_epoch_end(epoch: int, logs: Optional[Dict[str, Any]] = None) -> None
Print the learning rate at the end of each epoch.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
epoch |
int
|
The current epoch number. |
required |
logs |
Optional[Dict[str, Any]]
|
The logs dictionary. |
None
|
Source code in src/capfinder/cyclic_learing_rate.py
CyclicLR
Bases: Callback
This callback implements a cyclical learning rate policy (CLR). The method cycles the learning rate between two boundaries with some constant frequency.
Arguments
base_lr: initial learning rate which is the
lower boundary in the cycle.
max_lr: upper boundary in the cycle. Functionally,
it defines the cycle amplitude (max_lr - base_lr).
The lr at any cycle is the sum of base_lr
and some scaling of the amplitude; therefore
max_lr may not actually be reached depending on
scaling function.
step_size: number of training iterations per
half cycle. Authors suggest setting step_size
2-8 x training iterations in epoch.
mode: one of {triangular, triangular2, exp_range}.
Default 'triangular'.
Values correspond to policies detailed above.
If scale_fn is not None, this argument is ignored.
gamma: constant in 'exp_range' scaling function:
gamma**(cycle iterations)
scale_fn: Custom scaling policy defined by a single
argument lambda function, where
0 <= scale_fn(x) <= 1 for all x >= 0.
mode paramater is ignored
scale_mode: {'cycle', 'iterations'}.
Defines whether scale_fn is evaluated on
cycle number or cycle iterations (training
iterations since start of cycle). Default is 'cycle'.
Source code in src/capfinder/cyclic_learing_rate.py
86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 |
|
on_batch_end(batch: int, logs: Optional[Dict[str, Any]] = None) -> None
Record previous batch statistics and update the learning rate.
Source code in src/capfinder/cyclic_learing_rate.py
on_train_begin(logs: Optional[Dict[str, Any]] = None) -> None
Initialize the learning rate to the base learning rate.
Source code in src/capfinder/cyclic_learing_rate.py
SGDRScheduler
Bases: Callback
Cosine annealing learning rate scheduler with periodic restarts.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
min_lr |
float
|
The lower bound of the learning rate range for the experiment. |
required |
max_lr |
float
|
The upper bound of the learning rate range for the experiment. |
required |
steps_per_epoch |
int
|
Number of mini-batches in the dataset. |
required |
lr_decay |
float
|
Reduce the max_lr after the completion of each cycle. |
1.0
|
cycle_length |
int
|
Initial number of epochs in a cycle. |
10
|
mult_factor |
float
|
Scale epochs_to_restart after each full cycle completion. |
2.0
|
Source code in src/capfinder/cyclic_learing_rate.py
218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 |
|
clr() -> float
Calculate the learning rate.
Source code in src/capfinder/cyclic_learing_rate.py
get_lr() -> float
on_batch_end(batch: int, logs: Optional[Dict[str, Any]] = None) -> None
Record previous batch statistics and update the learning rate.
Source code in src/capfinder/cyclic_learing_rate.py
on_epoch_end(epoch: int, logs: Optional[Dict[str, Any]] = None) -> None
Check for end of current cycle, apply restarts when necessary.
Source code in src/capfinder/cyclic_learing_rate.py
on_train_begin(logs: Optional[Dict[str, Any]] = None) -> None
Initialize the learning rate to the maximum value at the start of training.
on_train_end(logs: Optional[Dict[str, Any]] = None) -> None
Set weights to the values from the end of the most recent cycle for best performance.
Source code in src/capfinder/cyclic_learing_rate.py
data_loader
combine_datasets(features_dataset: tf.data.Dataset, labels_dataset: tf.data.Dataset, batch_size: int, num_timesteps: int) -> tf.data.Dataset
Combine feature and label datasets with padded batching.
Parameters:
features_dataset : tf.data.Dataset The dataset containing features. labels_dataset : tf.data.Dataset The dataset containing labels. batch_size : int The size of each batch. num_timesteps : int The number of time steps in each time series.
Returns:
tf.data.Dataset A combined dataset with features and labels, padded and batched.
Source code in src/capfinder/data_loader.py
load_datasets(train_x_path: str, train_y_path: str, val_x_path: str, val_y_path: str, batch_size: int, num_timesteps: int) -> Tuple[tf.data.Dataset, tf.data.Dataset]
Load and combine train and validation datasets.
Parameters:
train_x_path : str Path to the CSV file containing training features. train_y_path : str Path to the CSV file containing training labels. val_x_path : str Path to the CSV file containing validation features. val_y_path : str Path to the CSV file containing validation labels. batch_size : int The size of each batch. num_timesteps : int The number of time steps in each time series.
Returns:
Tuple[tf.data.Dataset, tf.data.Dataset] A tuple containing the combined training dataset and validation dataset.
Source code in src/capfinder/data_loader.py
load_feature_dataset(file_path: str, num_timesteps: int) -> tf.data.Dataset
Load feature dataset from a CSV file.
Parameters:
file_path : str The path to the CSV file containing features. num_timesteps : int The number of time steps in each time series.
Returns:
tf.data.Dataset A TensorFlow dataset containing the parsed features.
Source code in src/capfinder/data_loader.py
load_label_dataset(file_path: str) -> tf.data.Dataset
Load label dataset from a CSV file.
Parameters:
file_path : str The path to the CSV file containing labels.
Returns:
tf.data.Dataset A TensorFlow dataset containing the parsed labels.
Source code in src/capfinder/data_loader.py
parse_features(line: tf.Tensor, num_timesteps: int) -> tf.Tensor
Parse features from a CSV line and reshape them.
Parameters:
line : tf.Tensor A tensor representing a single line from the CSV file. num_timesteps : int The number of time steps in each time series.
Returns:
tf.Tensor A tensor of shape (num_timesteps, 1) containing the parsed features.
Source code in src/capfinder/data_loader.py
parse_labels(line: tf.Tensor) -> tf.Tensor
Parse labels from a CSV line.
Parameters:
line : tf.Tensor A tensor representing a single line from the CSV file.
Returns:
tf.Tensor A tensor containing the parsed label.
Source code in src/capfinder/data_loader.py
download_model
create_version_info_file(output_dir: str, version: str) -> None
Create a file to store the version information. If any file with a name starting with "v" already exists in the output directory, delete it before creating a new one.
Parameters: output_dir (str): The directory where the version file will be created. version (str): The version string to be written to the file.
Returns: None
Source code in src/capfinder/download_model.py
download_comet_model(workspace: str, model_name: str, version: str, output_dir: str = './', force_download: bool = False) -> None
Download a model from Comet ML using the official API.
Parameters: workspace (str): The Comet ML workspace name model_name (str): The name of the model version (str): The version of the model to download (use "latest" for the most recent version) output_dir (str): The local directory to save the downloaded model (default is current directory) force_download (bool): If True, download the model even if it already exists locally
Returns: str: The path to the model file (either existing or newly downloaded), or None if download failed
Source code in src/capfinder/download_model.py
rename_downloaded_model(output_dir: str, orig_model_name: str, new_model_name: str) -> None
Renames the downloaded model file to a new name.
Parameters: output_dir (str): The directory where the model file is located. orig_model_name (str): The original name of the model file. new_model_name (str): The new name to rename the model file to.
Returns: None
Source code in src/capfinder/download_model.py
encoder_model
CapfinderHyperModel
Bases: HyperModel
Custom HyperModel class to wrap the model building function for Capfinder.
This class defines the hyperparameter search space and builds the model based on the selected hyperparameters, including a variable number of MLP layers.
Attributes:
input_shape : Tuple[int, int] The shape of the input data. n_classes : int The number of output classes for the classification task. encoder_model : Optional[keras.Model] Stores the encoder part of the model, initialized during the build process.
Source code in src/capfinder/encoder_model.py
131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 |
|
build(hp: HyperParameters) -> Model
Build and compile the model based on the hyperparameters.
Parameters:
hp : HyperParameters The hyperparameters to use for building the model.
Returns:
Model The compiled Keras model.
Source code in src/capfinder/encoder_model.py
build_model(input_shape: Tuple[int, int], head_size: int, num_heads: int, ff_dim: int, num_transformer_blocks: int, mlp_units: List[int], n_classes: int, dropout: float = 0.0, mlp_dropout: float = 0.0) -> Tuple[keras.Model, keras.Model]
Build a transformer-based neural network model and return the encoder output.
input_shape : Tuple[int, int] The shape of the input data. head_size : int The size of the attention heads in the transformer encoder. num_heads : int The number of attention heads in the transformer encoder. ff_dim : int The dimensionality of the feed-forward network in the transformer encoder. num_transformer_blocks : int The number of transformer encoder blocks in the model. mlp_units : List[int] A list containing the number of units for each layer in the MLP. n_classes : int The number of output classes (for classification tasks). dropout : float, optional The dropout rate applied in the transformer encoder. mlp_dropout : float, optional The dropout rate applied in the MLP.
Tuple[keras.Model, keras.Model]: A tuple containing the full model and the encoder model.
Source code in src/capfinder/encoder_model.py
transformer_encoder(inputs: keras.layers.Layer, head_size: int, num_heads: int, ff_dim: int, dropout: Optional[float] = 0.0) -> keras.layers.Layer
Create a transformer encoder block.
The transformer encoder block consists of a multi-head attention layer followed by layer normalization and a feed-forward network.
Parameters:
inputs : keras.layers.Layer The input layer or tensor for the encoder block. head_size : int The size of the attention heads. num_heads : int The number of attention heads. ff_dim : int The dimensionality of the feed-forward network. dropout : float, optional The dropout rate applied after the attention layer and within the feed-forward network. Default is 0.0.
Returns:
keras.layers.Layer The output layer of the encoder block, which can be used as input for the next layer in a neural network.
Source code in src/capfinder/encoder_model.py
find_ote_test
The module contains the code to find OTE sequence in test data -- where we only know the context to the left of the NNNNNN region -- and its location with high-confidence. The modules can process one read at a time or all reads in a FASTQ file or folder of FASTQ files.
Author: Adnan M. Niazi Date: 2024-02-28
cnt_match_mismatch_gaps(aln_str: str) -> Tuple[int, int, int]
Takes an alignment string and counts the number of matches, mismatches, and gaps.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
aln_str |
str
|
The alignment string. |
required |
Returns:
Name | Type | Description |
---|---|---|
match_cnt |
int
|
The number of matches in the alignment string. |
mismatch_cnt |
int
|
The number of mismatches in the alignment string. |
gap_cnt |
int
|
The number of gaps in the alignment string. |
Source code in src/capfinder/find_ote_test.py
dispatcher(input_path: str, reference: str, cap0_pos: int, num_processes: int, output_folder: str) -> None
Check if the input path is a file or folder, and call the appropriate function to process the input.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_path |
str
|
The path to the FASTQ file or folder. |
required |
reference |
str
|
The reference sequence to align the read to. |
required |
num_processes |
int
|
The number of processes to use for parallel processing. |
required |
output_folder |
str
|
The folder where worker output files will be stored. |
required |
Returns:
Type | Description |
---|---|
None
|
None |
Source code in src/capfinder/find_ote_test.py
find_ote_test(input_path: str, reference: str, cap0_pos: int, num_processes: int, output_folder: str) -> None
Main function to process a FASTQ file or folder of FASTQ files to find OTEs in the reads. The function is suitable only for testing data where only the OTE sequence is known and the N1N2 cap bases and any bases 3' of them are unknown.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_path |
str
|
The path to the FASTQ file or folder. |
required |
reference |
str
|
The reference sequence to align the read to. |
required |
num_processes |
int
|
The number of processes to use for parallel processing. |
required |
output_folder |
str
|
The folder where worker output files will be stored. |
required |
Returns: None
Source code in src/capfinder/find_ote_test.py
has_good_aln_in_5prime_flanking_region(match_cnt: int, mismatch_cnt: int, gap_cnt: int) -> bool
Checks if the alignment in the flanking region before the cap is good.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
match_cnt |
int
|
The number of matches in the flanking region. |
required |
mismatch_cnt |
int
|
The number of mismatches in the flanking region. |
required |
gap_cnt |
int
|
The number of gaps in the flanking region. |
required |
Returns:
Name | Type | Description |
---|---|---|
bool |
bool
|
True if the alignment in the flanking region is good, False otherwise. |
Source code in src/capfinder/find_ote_test.py
has_good_aln_in_n_region(match_cnt: int, mismatch_cnt: int, gap_cnt: int) -> bool
Checks if the alignment in the NNNNNN region is good.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
match_cnt |
int
|
The number of matches in the NNNNNN region. |
required |
mismatch_cnt |
int
|
The number of mismatches in the NNNNNN region. |
required |
gap_cnt |
int
|
The number of gaps in the NNNNNN region. |
required |
Returns:
Name | Type | Description |
---|---|---|
bool |
bool
|
True if the alignment in the NNNNNN region is good, False otherwise. |
Source code in src/capfinder/find_ote_test.py
make_coordinates(aln_str: str, ref_str: str) -> List[int]
Walk along the alignment string and make an incrementing index where there is a match, mismatch, and deletions. For gaps in the alignment string, it output a -1 in the index list.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
aln_str |
str
|
The alignment string. |
required |
ref_str |
str
|
The reference string. |
required |
Returns:
Name | Type | Description |
---|---|---|
coord_list |
list
|
A list of indices corresponding to the alignment string. |
Source code in src/capfinder/find_ote_test.py
process_fastq_file(fastq_filepath: str, reference: str, cap0_pos: int, num_processes: int, output_folder: str) -> None
Process a single FASTQ file. The function reads the FASTQ file, and processes each read in parallel.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
fastq_filepath |
str
|
The path to the FASTQ file. |
required |
reference |
str
|
The reference sequence to align the read to. |
required |
num_processes |
int
|
The number of processes to use for parallel processing. |
required |
output_folder |
str
|
The folder where worker output files will be stored. |
required |
Returns:
Type | Description |
---|---|
None
|
None |
Source code in src/capfinder/find_ote_test.py
process_fastq_folder(folder_path: str, reference: str, cap0_pos: int, num_processes: int, output_folder: str) -> None
Process all FASTQ files in a folder. The function reads all FASTQ files in a folder, and feeds one FASTQ at a time which to a prcessing function that processes reads in this FASTQ file in parallel.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
folder_path |
str
|
The path to the folder containing FASTQ files. |
required |
reference |
str
|
The reference sequence to align the read to. |
required |
cap0_pos |
int
|
The position of the first cap base (N1) in the reference sequence (0-indexed). |
required |
num_processes |
int
|
The number of processes to use for parallel processing. |
required |
output_folder |
str
|
The folder where worker output files will be stored. |
required |
Returns:
Type | Description |
---|---|
None
|
None |
Source code in src/capfinder/find_ote_test.py
process_read(record: Any, reference: str, cap0_pos: int) -> Dict[str, Any]
Process a single read from a FASTQ file. The function alnigns the read to the reference, and checks if the alignment in the NNNNNN region and the flanking regions is good. If the alignment is good, then the function returns the read ID, alignment score, and the positions of the left flanking region, cap0 base, and the right flanking region in the read's FASTQ sequence. If the alignment is bad, then the function returns the read ID, alignment score, and the reason why the alignment is bad.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
record |
SeqRecord
|
A single read from a FASTQ file. |
required |
reference |
str
|
The reference sequence to align the read to. |
required |
cap0_pos |
int
|
The position of the first cap base in the reference sequence (0-indexed). |
required |
Returns:
Name | Type | Description |
---|---|---|
out_ds |
dict
|
A dictionary containing the following keys: read_id (str): The identifier of the sequence read. read_type (str): The type of the read, which can be 'good' or 'bad' reason (str or None): The reason for the failed alignment, if available. alignment_score (float): The alignment score for the read. left_flanking_region_start_fastq_pos (int or None): The starting position of the left flanking region in the FASTQ file, if available. cap_n1_minus_1_read_fastq_pos (int or None): The position of the caps N1 base in the FASTQ file (0-indexed), if available. right_flanking_region_start_fastq_pos (int or None): The starting position of the right flanking region in the FASTQ file, if available. |
Source code in src/capfinder/find_ote_test.py
137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 |
|
write_csv(resutls_list: List[dict], output_filepath: str) -> None
Take a list of dictionaries and write them to a CSV file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
resutls_list |
list
|
A list of dictionaries. |
required |
output_filepath |
str
|
The path to the output CSV file. |
required |
Returns:
Type | Description |
---|---|
None
|
None |
Source code in src/capfinder/find_ote_test.py
find_ote_train
The module contains the code to find OTE sequence in training data -- where we know both the left and right context to the NNNNNN region -- and its location with high-confidence. The modules can process one read at a time or all reads in a FASTQ file or folder of FASTQ files.
Author: Adnan M. Niazi Date: 2024-02-28
cnt_match_mismatch_gaps(aln_str: str) -> Tuple[int, int, int]
Takes an alignment string and counts the number of matches, mismatches, and gaps.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
aln_str |
str
|
The alignment string. |
required |
Returns:
Name | Type | Description |
---|---|---|
match_cnt |
int
|
The number of matches in the alignment string. |
mismatch_cnt |
int
|
The number of mismatches in the alignment string. |
gap_cnt |
int
|
The number of gaps in the alignment string. |
Source code in src/capfinder/find_ote_train.py
dispatcher(input_path: str, reference: str, cap0_pos: int, num_processes: int, output_folder: str) -> None
Check if the input path is a file or folder, and call the appropriate function to process the input.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_path |
str
|
The path to the FASTQ file or folder. |
required |
reference |
str
|
The reference sequence to align the read to. |
required |
num_processes |
int
|
The number of processes to use for parallel processing. |
required |
output_folder |
str
|
The folder where worker output files will be stored. |
required |
Returns:
Type | Description |
---|---|
None
|
None |
Source code in src/capfinder/find_ote_train.py
find_ote_train(input_path: str, reference: str, cap0_pos: int, num_processes: int, output_folder: str) -> None
Main function to process a FASTQ file or folder of FASTQ files ot find OTEs in the reads.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_path |
str
|
The path to the FASTQ file or folder. |
required |
reference |
str
|
The reference sequence to align the read to. |
required |
num_processes |
int
|
The number of processes to use for parallel processing. |
required |
output_folder |
str
|
The folder where worker output files will be stored. |
required |
Returns: None
Source code in src/capfinder/find_ote_train.py
has_good_aln_in_n_region(match_cnt: int, mismatch_cnt: int, gap_cnt: int) -> bool
Checks if the alignment in the NNNNNN region is good.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
match_cnt |
int
|
The number of matches in the NNNNNN region. |
required |
mismatch_cnt |
int
|
The number of mismatches in the NNNNNN region. |
required |
gap_cnt |
int
|
The number of gaps in the NNNNNN region. |
required |
Returns:
Name | Type | Description |
---|---|---|
bool |
bool
|
True if the alignment in the NNNNNN region is good, False otherwise. |
Source code in src/capfinder/find_ote_train.py
has_good_aln_ns_flanking_region(match_cnt: int, mismatch_cnt: int, gap_cnt: int) -> bool
Checks if the alignment in the flanking region before or after the NNNNNN region is good.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
match_cnt |
int
|
The number of matches in the flanking region. |
required |
mismatch_cnt |
int
|
The number of mismatches in the flanking region. |
required |
gap_cnt |
int
|
The number of gaps in the flanking region. |
required |
Returns:
Name | Type | Description |
---|---|---|
bool |
bool
|
True if the alignment in the flanking region is good, False otherwise. |
Source code in src/capfinder/find_ote_train.py
make_coordinates(aln_str: str, ref_str: str) -> List[int]
Walk along the alignment string and make an incrementing index where there is a match, mismatch, and deletions. For gaps in the alignment string, it output a -1 in the index list.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
aln_str |
str
|
The alignment string. |
required |
ref_str |
str
|
The reference string. |
required |
Returns:
Name | Type | Description |
---|---|---|
coord_list |
list
|
A list of indices corresponding to the alignment string. |
Source code in src/capfinder/find_ote_train.py
process_fastq_file(fastq_filepath: str, reference: str, cap0_pos: int, num_processes: int, output_folder: str) -> None
Process a single FASTQ file. The function reads the FASTQ file, and processes each read in parallel.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
fastq_filepath |
str
|
The path to the FASTQ file. |
required |
reference |
str
|
The reference sequence to align the read to. |
required |
num_processes |
int
|
The number of processes to use for parallel processing. |
required |
output_folder |
str
|
The folder where worker output files will be stored. |
required |
Returns:
Type | Description |
---|---|
None
|
None |
Source code in src/capfinder/find_ote_train.py
process_fastq_folder(folder_path: str, reference: str, cap0_pos: int, num_processes: int, output_folder: str) -> None
Process all FASTQ files in a folder. The function reads all FASTQ files in a folder, and feeds one FASTQ at a time which to a prcessing function that processes reads in this FASTQ file in parallel.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
folder_path |
str
|
The path to the folder containing FASTQ files. |
required |
reference |
str
|
The reference sequence to align the read to. |
required |
cap0_pos |
int
|
The position of the first cap base (N1) in the reference sequence (0-indexed). |
required |
num_processes |
int
|
The number of processes to use for parallel processing. |
required |
output_folder |
str
|
The folder where worker output files will be stored. |
required |
Returns:
Type | Description |
---|---|
None
|
None |
Source code in src/capfinder/find_ote_train.py
process_read(record: Any, reference: str, cap0_pos: int) -> Dict[str, Any]
Process a single read from a FASTQ file. The function alnigns the read to the reference, and checks if the alignment in the NNNNNN region and the flanking regions is good. If the alignment is good, then the function returns the read ID, alignment score, and the positions of the left flanking region, cap0 base, and the right flanking region in the read's FASTQ sequence. If the alignment is bad, then the function returns the read ID, alignment score, and the reason why the alignment is bad.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
record |
SeqRecord
|
A single read from a FASTQ file. |
required |
reference |
str
|
The reference sequence to align the read to. |
required |
cap0_pos |
int
|
The position of the first cap base in the reference sequence (0-indexed). |
required |
Returns:
Name | Type | Description |
---|---|---|
out_ds |
dict
|
A dictionary containing the following keys: read_id (str): The identifier of the sequence read. read_type (str): The type of the read, which can be 'good' or 'bad' reason (str or None): The reason for the failed alignment, if available. alignment_score (float): The alignment score for the read. left_flanking_region_start_fastq_pos (int or None): The starting position of the left flanking region in the FASTQ file, if available. cap0_read_fastq_pos (int or None): The position of the caps N1 base in the FASTQ file (0-indexed), if available. right_flanking_region_start_fastq_pos (int or None): The starting position of the right flanking region in the FASTQ file, if available. |
Source code in src/capfinder/find_ote_train.py
139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 |
|
write_csv(resutls_list: List[dict], output_filepath: str) -> None
Take a list of dictionaries and write them to a CSV file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
resutls_list |
list
|
A list of dictionaries. |
required |
output_filepath |
str
|
The path to the output CSV file. |
required |
Returns:
Type | Description |
---|---|
None
|
None |
Source code in src/capfinder/find_ote_train.py
index
We cannot random access a record in a BAM file, we can only iterate through it. That is our starting point. For each record in BAM file, we need to find the corresponding record in POD5 file. For that we need a mapping between POD5 file and read_ids. This is why we need to build an index of POD5 files. This module helps us to build an index of POD5 files and stores it in a SQLite database.
Author: Adnan M. Niazi Date: 2024-02-28
fetch_filepath_using_filename(conn: sqlite3.Connection, cursor: sqlite3.Cursor, pod5_filename: str) -> Any
Retrieve the pod5_filepath based on pod5_filename from the database.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
conn |
Connection
|
Connection object for the database. |
required |
cursor |
Cursor
|
Cursor object for the database. |
required |
pod5_filename |
str
|
The pod5_filename to be searched for. |
required |
Returns:
Name | Type | Description |
---|---|---|
pod5_filepath |
Any
|
The corresponding pod5_filepath if found, else None. |
Source code in src/capfinder/index.py
find_database_size(database_path: str) -> Any
Find the number of records in the database.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
database_path |
str
|
Path to the database. |
required |
Returns:
Name | Type | Description |
---|---|---|
size |
Any
|
Number of records in the database. |
Source code in src/capfinder/index.py
generate_pod5_path_and_name(pod5_path: str) -> Generator[Tuple[str, str], None, None]
Traverse the directory and yield all the names+extension and fullpaths of the pod5 files.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
pod5_path |
str
|
Path to a POD5 file/directory of POD5 files. |
required |
Yields:
Type | Description |
---|---|
str
|
Tuple[str, str]: Tuple containing the name+extension and full path of a POD5 file. |
Source code in src/capfinder/index.py
index(pod5_path: str, output_dir: str) -> None
Builds an index mapping read_ids to POD5 file paths.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
pod5_path |
str
|
Path to a POD5 file or directory of POD5 files. |
required |
output_dir |
str
|
Path where database.db file is written to. |
required |
Returns:
Type | Description |
---|---|
None
|
None |
Source code in src/capfinder/index.py
initialize_database(database_path: str) -> Tuple[sqlite3.Cursor, sqlite3.Connection]
Intializes the database connection based on the database path.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
database_path |
str
|
Path to the database. |
required |
Returns:
Name | Type | Description |
---|---|---|
cursor |
Cursor
|
Cursor object for the database. |
conn |
Connection
|
Connection object for the database. |
Source code in src/capfinder/index.py
write_database(data: Tuple[str, str], cursor: sqlite3.Cursor, conn: sqlite3.Connection) -> None
Write the index to a database.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data |
Tuple[str, str]
|
Tuples of fileroot and file |
required |
cursor |
Cursor
|
Cursor object for the database. |
required |
conn |
Connection
|
Connection object for the database. |
required |
Returns:
Type | Description |
---|---|
None
|
None |
Source code in src/capfinder/index.py
inference
batched_inference(dataset: tf.data.Dataset, model: keras.Model, output_dir: str, csv_file_path: str) -> str
Perform batched inference on a dataset using a given model and save predictions to a CSV file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset |
Dataset
|
The input dataset to perform inference on. |
required |
model |
Model
|
The Keras model to use for making predictions. |
required |
output_dir |
str
|
The directory where the output CSV file will be saved. |
required |
csv_file_path |
str
|
Path to the original CSV file used to create the dataset. |
required |
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
The path to the output CSV file containing the predictions. |
Source code in src/capfinder/inference.py
collate_bam_pod5_wrapper(bam_filepath: str, pod5_dir: str, num_cpus: int, reference: str, cap_class: int, cap0_pos: int, train_or_test: str, plot_signal: bool, output_dir: str) -> tuple[str, str]
Wrapper for collating BAM and POD5 files.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
bam_filepath |
str
|
Path to the BAM file. |
required |
pod5_dir |
str
|
Directory containing POD5 files. |
required |
num_cpus |
int
|
Number of CPUs to use for processing. |
required |
reference |
str
|
Reference sequence. |
required |
cap_class |
int
|
CAP class identifier. |
required |
cap0_pos |
int
|
Position of CAP0. |
required |
train_or_test |
str
|
Indicates whether data is for training or testing. |
required |
plot_signal |
bool
|
Flag to plot the signal. |
required |
output_dir |
str
|
Directory where output files will be saved. |
required |
Returns:
Type | Description |
---|---|
tuple[str, str]
|
tuple[str, str]: Paths to the data and metadata files. |
Source code in src/capfinder/inference.py
count_csv_rows(file_path: str) -> int
Quickly count the number of rows in a CSV file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_path |
str
|
Path to the CSV file. |
required |
Returns:
Name | Type | Description |
---|---|---|
int |
int
|
Number of rows in the CSV file (excluding the header). |
Source code in src/capfinder/inference.py
custom_cache_key_fn(context: TaskRunContext, parameters: dict) -> str
Generate a custom cache key based on input parameters.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
context |
TaskRunContext
|
Prefect context (unused in this function). |
required |
parameters |
dict
|
Dictionary of parameters used for cache key generation. |
required |
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
The generated cache key. |
Source code in src/capfinder/inference.py
generate_report_wrapper(metadata_file: str, predictions_file: str, output_csv: str, output_html: str) -> None
Wrapper for generating the report.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
metadata_file |
str
|
Path to the metadata file. |
required |
predictions_file |
str
|
Path to the predictions file. |
required |
output_csv |
str
|
Path to save the output CSV. |
required |
output_html |
str
|
Path to save the output HTML. |
required |
Source code in src/capfinder/inference.py
get_model(model_path: Optional[str] = None, load_optimizer: bool = False) -> keras.Model
Load and return a model from the given model path or use the default model.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model_path |
Optional[str]
|
Path to the custom model file. If None, use the default model. |
None
|
load_optimizer |
bool
|
Whether to load the optimizer with the model. |
False
|
Returns:
Type | Description |
---|---|
Model
|
keras.Model: The loaded Keras model. |
Source code in src/capfinder/inference.py
predict_cap_types(bam_filepath: str, pod5_dir: str, num_cpus: int, output_dir: str, dtype: DtypeLiteral, reference: str = 'GCTTTCGTTCGTCTCCGGACTTATCGCACCACCTATCCATCATCAGTACTGT', cap0_pos: int = 52, train_or_test: str = 'test', plot_signal: bool = False, cap_class: int = -99, target_length: int = 500, batch_size: int = 32, custom_model_path: Optional[str] = None, debug_code: bool = False, refresh_cache: bool = False, formatted_command: Optional[str] = None) -> None
Predict CAP types by preparing the inference data and running the prediction workflow.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
bam_filepath |
str
|
Path to the BAM file. |
required |
pod5_dir |
str
|
Directory containing POD5 files. |
required |
num_cpus |
int
|
Number of CPUs to use for processing. |
required |
output_dir |
str
|
Directory where output files will be saved. |
required |
dtype |
DtypeLiteral
|
Data type for the features. |
required |
reference |
str
|
Reference sequence. |
'GCTTTCGTTCGTCTCCGGACTTATCGCACCACCTATCCATCATCAGTACTGT'
|
cap0_pos |
int
|
Position of CAP0. |
52
|
train_or_test |
str
|
Indicates whether data is for training or testing. |
'test'
|
plot_signal |
bool
|
Flag to plot the signal. |
False
|
cap_class |
int
|
CAP class identifier. |
-99
|
target_length |
int
|
Length of the target sequence. |
500
|
batch_size |
int
|
Size of the data batches. |
32
|
custom_model_path |
Optional[str]
|
Path to a custom model file. If None, use the default model. |
None
|
debug_code |
bool
|
Flag to enable debugging information in logs. |
False
|
refresh_cache |
bool
|
Flag to refresh cached data. |
False
|
formatted_command |
Optional[str]
|
The formatted command string to be logged. |
None
|
Source code in src/capfinder/inference.py
prepare_inference_data(bam_filepath: str, pod5_dir: str, num_cpus: int, output_dir: str, dtype: DtypeLiteral, reference: str = 'GCTTTCGTTCGTCTCCGGACTTATCGCACCACCTATCCATCATCAGTACTGT', cap0_pos: int = 52, train_or_test: str = 'test', plot_signal: bool = False, cap_class: int = -99, target_length: int = 500, batch_size: int = 32, custom_model_path: Optional[str] = None, debug_code: bool = False, refresh_cache: bool = False) -> tuple[str, str]
Prepare inference data by processing BAM and POD5 files, and generate features for the model.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
bam_filepath |
str
|
Path to the BAM file. |
required |
pod5_dir |
str
|
Directory containing POD5 files. |
required |
num_cpus |
int
|
Number of CPUs to use for processing. |
required |
output_dir |
str
|
Directory where output files will be saved. |
required |
dtype |
DtypeLiteral
|
Data type for the features. |
required |
reference |
str
|
Reference sequence. |
'GCTTTCGTTCGTCTCCGGACTTATCGCACCACCTATCCATCATCAGTACTGT'
|
cap0_pos |
int
|
Position of CAP0. |
52
|
train_or_test |
str
|
Indicates whether data is for training or testing. |
'test'
|
plot_signal |
bool
|
Flag to plot the signal. |
False
|
cap_class |
int
|
CAP class identifier. |
-99
|
target_length |
int
|
Length of the target sequence. |
500
|
batch_size |
int
|
Size of the data batches. |
32
|
custom_model_path |
Optional[str]
|
Path to a custom model file. If None, use the default model. |
None
|
debug_code |
bool
|
Flag to enable debugging information in logs. |
False
|
refresh_cache |
bool
|
Flag to refresh cached data. |
False
|
Returns:
Type | Description |
---|---|
tuple[str, str]
|
tuple[str, str]: Paths to the output CSV and HTML files. |
Source code in src/capfinder/inference.py
242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 |
|
reconfigure_logging_task(output_dir: str, debug_code: bool) -> None
Reconfigure logging settings for both application and Prefect.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
output_dir |
str
|
Directory where logs will be saved. |
required |
debug_code |
bool
|
Flag to determine if code locations should be shown in logs. |
required |
Source code in src/capfinder/inference.py
inference_data_loader
create_dataset(file_path: str, target_length: int, batch_size: int, dtype: DtypeLiteral) -> tf.data.Dataset
Create a TensorFlow dataset from a CSV file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_path |
str
|
Path to the CSV file. |
required |
target_length |
int
|
The desired length of the timeseries tensor. |
required |
batch_size |
int
|
The number of samples per batch. |
required |
dtype |
DtypeLiteral
|
The desired data type for the timeseries tensor as a string. |
required |
Returns:
Type | Description |
---|---|
Dataset
|
tf.data.Dataset: A TensorFlow dataset that yields batches of parsed and formatted data. |
Source code in src/capfinder/inference_data_loader.py
csv_generator(file_path: str, chunk_size: int = 10000) -> Generator[Tuple[str, str, str], None, None]
Generate rows from a CSV file in chunks.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_path |
str
|
Path to the CSV file. |
required |
chunk_size |
int
|
Number of rows to process in each chunk. Defaults to 10000. |
10000
|
Yields:
Type | Description |
---|---|
str
|
Tuple[str, str, str]: A tuple containing read_id, cap_class, and timeseries as strings. |
Source code in src/capfinder/inference_data_loader.py
get_dtype(dtype: str) -> tf.DType
Convert a string dtype to its corresponding TensorFlow data type.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dtype |
str
|
A string representing the desired data type. |
required |
Returns:
Type | Description |
---|---|
DType
|
tf.DType: The corresponding TensorFlow data type. |
Raises:
Type | Description |
---|---|
ValueError
|
If an invalid dtype string is provided. |
Source code in src/capfinder/inference_data_loader.py
parse_row(row: Tuple[str, str, str], target_length: int, dtype: tf.DType) -> Tuple[tf.Tensor, tf.Tensor, tf.Tensor]
Parse a row of data and convert it to the appropriate tensor format. Padding and truncation are performed equally on both sides of the time series.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
row |
Tuple[str, str, str]
|
A tuple containing read_id, cap_class, and timeseries as strings. |
required |
target_length |
int
|
The desired length of the timeseries tensor. |
required |
dtype |
DType
|
The desired data type for the timeseries tensor. |
required |
Returns:
Type | Description |
---|---|
Tensor
|
Tuple[tf.Tensor, tf.Tensor, tf.Tensor]: A tuple containing the parsed and formatted tensors for |
Tensor
|
timeseries, cap_class, and read_id. |
Source code in src/capfinder/inference_data_loader.py
logger_config
PrefectHandler
Bases: Handler
A custom logging handler for Prefect that filters and formats log messages.
This handler integrates with Loguru, applies custom formatting, and prevents duplicate log messages.
Source code in src/capfinder/logger_config.py
__init__(loguru_logger: loguru.Logger, show_location: bool) -> None
Initialize the PrefectHandler.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
loguru_logger |
Logger
|
The Loguru logger instance to use for logging. |
required |
show_location |
bool
|
Whether to show the source location in log messages. |
required |
Source code in src/capfinder/logger_config.py
emit(record: logging.LogRecord) -> None
Emit a log record.
This method formats the log record, applies custom styling, and logs it using Loguru. It also filters out duplicate messages and HTTP status messages.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
record |
LogRecord
|
The log record to emit. |
required |
Source code in src/capfinder/logger_config.py
configure_logger(new_log_directory: str = '', show_location: bool = True) -> str
Configure the logger to log to a file in the specified directory.
Source code in src/capfinder/logger_config.py
configure_prefect_logging(show_location: bool = True) -> None
Configure Prefect logging with custom handler and settings.
This function sets up a custom PrefectHandler for all Prefect loggers, configures the root logger, and adjusts logging levels.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
show_location |
bool
|
Whether to show source location in log messages. Defaults to True. |
True
|
Source code in src/capfinder/logger_config.py
plot
The modules helps in plotting the entire read signal, signal for ROI, and the base annotations. It also prints alignments. All this information is useful in understanding if the OTE-finding algorthim is homing-in on the correct region of interest (ROI).
The plot is saved as an HTML file.
Author: Adnan M. Niazi Date: 2024-02-28
append_dummy_sequence(fasta_sequence: str, num_left_clipped_bases: int, num_right_clipped_bases: int) -> str
Append/prepend 'H' to the left/right of the FASTA sequence based on soft-clipping counts
Parameters:
Name | Type | Description | Default |
---|---|---|---|
fasta_sequence |
str
|
FASTA sequence |
required |
num_left_clipped_bases |
int
|
Number of bases soft-clipped from the left |
required |
num_right_clipped_bases |
int
|
Number of bases soft-clipped from the right |
required |
Returns:
Name | Type | Description |
---|---|---|
modified_sequence |
str
|
FASTA sequence with 'H' appended/prepended to the left/right |
Source code in src/capfinder/plot.py
process_pod5
Given read_id and pod5 filepath, this file preprocesses the signal data, and extracts the signal data for a region of interest (ROI).
Author: Adnan M. Niazi Date: 2024-02-28
clip_extreme_values(z_normalized_data: npt.NDArray[np.float64], num_std_dev: float = 4.0) -> npt.NDArray[np.float64]
Clip extreme values in the Z-score normalized data.
Clips values outside the specified number of standard deviations from the mean. This function takes Z-score normalized data as input, along with an optional parameter to set the number of standard deviations.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
z_normalized_data |
NDArray[float64]
|
Z-score normalized data. |
required |
num_std_dev |
float
|
Number of standard deviations to use as the limit. Defaults to 4.0. |
4.0
|
Returns:
Type | Description |
---|---|
NDArray[float64]
|
npt.NDArray[np.float64]: Clipped data within the specified range. |
Example
z_normalized_data = np.array([-2.0, -1.0, 0.0, 1.0, 2.0]) clipped_data = clip_extreme_values(z_normalized_data, num_std_dev=3.0)
Source code in src/capfinder/process_pod5.py
extract_roi_signal(signal: np.ndarray, base_locs_in_signal: npt.NDArray[np.int32], fasta: str, experiment_type: str, start_base_idx_in_fasta: int, end_base_idx_in_fasta: int, num_left_clipped_bases: int) -> ROIData
Extracts the signal data for a region of interest (ROI).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
signal |
ndarray
|
Signal data. |
required |
base_locs_in_signal |
NDArray[int32]
|
Array of locations of each new base in the signal. |
required |
fasta |
str
|
Fasta sequence of the read. |
required |
experiment_type |
str
|
Type of experiment (rna or dna). |
required |
start_base_idx_in_fasta |
int
|
Index of the first base in the ROI. |
required |
end_base_idx_in_fasta |
int
|
Index of the last base in the ROI. |
required |
num_left_clipped_bases |
int
|
Number of bases clipped from the left. |
required |
Returns:
Name | Type | Description |
---|---|---|
ROIData |
ROIData
|
Dictionary containing the ROI signal and fasta sequence. |
Source code in src/capfinder/process_pod5.py
175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 |
|
find_base_locs_in_signal(bam_data: dict) -> npt.NDArray[np.int32]
Finds the locations of each new base in the signal.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
bam_data |
dict
|
Dictionary containing information from the BAM file. |
required |
Returns:
Type | Description |
---|---|
NDArray[int32]
|
npt.NDArray[np.int32]: Array of locations of each new base in the signal. |
Source code in src/capfinder/process_pod5.py
preprocess_signal_data(signal: np.ndarray) -> npt.NDArray[np.float64]
Preprocesses the signal data.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
signal |
ndarray
|
Signal data. |
required |
Returns:
Name | Type | Description |
---|---|---|
signal |
NDArray[float64]
|
Preprocessed signal data. |
Source code in src/capfinder/process_pod5.py
pull_read_from_pod5(read_id: str, pod5_filepath: str) -> Dict[str, Any]
Returns a single read from a pod5 file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
read_id |
str
|
str The read_id of the read to be extracted. |
required |
pod5_filepath |
str
|
str Path to the pod5 file. |
required |
Returns:
Name | Type | Description |
---|---|---|
dict |
Dict[str, Any]
|
Dictionary containing information about the extracted read. - 'sample_rate': Sample rate of the read. - 'sequencing_kit': Sequencing kit used. - 'experiment_type': Experiment type. - 'local_basecalling': Local basecalling information. - 'signal': Signal data. - 'signal_pa': Signal data for the positive strand. - 'end_reason': Reason for the end of the read. - 'sample_count': Number of samples in the read. - 'channel': Pore channel information. - 'well': Pore well information. - 'pore_type': Pore type. - 'writing_software': Software used for writing. - 'scale': Scaling factor for the signal. - 'shift': Shift factor for the signal. |
Source code in src/capfinder/process_pod5.py
z_normalize(data: np.ndarray) -> npt.NDArray[np.float64]
Normalize the input data using Z-score normalization.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data |
ndarray
|
Input data to be Z-score normalized. |
required |
Returns:
Type | Description |
---|---|
NDArray[float64]
|
npt.NDArray[np.float64]: Z-score normalized data. |
Note
Z-score normalization (or Z normalization) transforms the data to have a mean of 0 and a standard deviation of 1.
Source code in src/capfinder/process_pod5.py
report
count_csv_rows(csv_file: str) -> int
create_database(db_path: str) -> sqlite3.Connection
create_table(conn: sqlite3.Connection, table_name: str, columns: List[str]) -> None
Create a table in the SQLite database.
Source code in src/capfinder/report.py
csv_to_sqlite(csv_file: str, db_conn: sqlite3.Connection, table_name: str, chunk_size: int = 100000) -> None
Import CSV data into SQLite database in chunks with progress bar.
Source code in src/capfinder/report.py
get_cap_type_counts(conn: sqlite3.Connection) -> DefaultDict[str, int]
Get cap type counts from the joined data.
Source code in src/capfinder/report.py
join_tables(conn: sqlite3.Connection, output_csv: str, chunk_size: int = 100000) -> None
Join metadata and predictions tables and save to CSV in chunks with progress bar.
Source code in src/capfinder/report.py
resnet_model
ResNetTimeSeriesHyper
Bases: HyperModel
A HyperModel class for building a ResNet-style neural network for time series classification.
This class defines a tunable ResNet architecture that can be optimized using Keras Tuner. It creates a model with an initial convolutional layer, followed by a variable number of ResNet blocks, and ends with global average pooling and dense layers.
Attributes:
Name | Type | Description |
---|---|---|
input_shape |
Tuple[int, int]
|
The shape of the input data (timesteps, features). |
n_classes |
int
|
The number of classes for classification. |
Methods:
Name | Description |
---|---|
build |
Builds and returns a compiled Keras model based on the provided hyperparameters. |
Source code in src/capfinder/resnet_model.py
build(hp: HyperParameters) -> Model
Build and compile a ResNet model based on the provided hyperparameters.
This method constructs a ResNet architecture with tunable hyperparameters including the number of filters, kernel sizes, number of ResNet blocks, dense layer units, dropout rate, and learning rate.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
hp |
HyperParameters
|
A HyperParameters object used to define the search space. |
required |
Returns:
Name | Type | Description |
---|---|---|
Model |
Model
|
A compiled Keras model ready for training. |
Source code in src/capfinder/resnet_model.py
train_etl
augment_example(x: tf.Tensor, y: tf.Tensor, dtype: tf.DType) -> tf.data.Dataset
Augment a single example by creating warped versions and combining them with the original.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x |
Tensor
|
The input tensor to be augmented. |
required |
y |
Tensor
|
The corresponding label tensor. |
required |
dtype |
DType
|
The desired data type for the augmented tensors. |
required |
Returns:
Type | Description |
---|---|
Dataset
|
tf.data.Dataset: A dataset containing the original and augmented examples with their labels. |
Source code in src/capfinder/train_etl.py
calculate_sizes(total_examples: int, train_fraction: float, batch_size: int) -> Tuple[int, int]
Compute the train and validation sizes based on the total number of examples.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
total_examples |
int
|
Total number of examples in the dataset. |
required |
train_fraction |
float
|
Fraction of data to use for training. |
required |
batch_size |
int
|
Size of each batch. |
required |
Returns:
Type | Description |
---|---|
Tuple[int, int]
|
Tuple[int, int]: Train size and validation size, both divisible by batch_size. |
Source code in src/capfinder/train_etl.py
combine_datasets(datasets: List[tf.data.Dataset]) -> tf.data.Dataset
Combine datasets from different classes.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
datasets |
List[Dataset]
|
List of datasets to combine. |
required |
Returns:
Type | Description |
---|---|
Dataset
|
tf.data.Dataset: A combined dataset. |
Source code in src/capfinder/train_etl.py
count_examples_fast(file_path: str) -> int
Count lines in a file using fast bash utilities, falling back to Python if necessary.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_path |
str
|
Path to the file to count lines in. |
required |
Returns:
Name | Type | Description |
---|---|---|
int |
int
|
Number of lines in the file (excluding header). |
Source code in src/capfinder/train_etl.py
count_examples_python(file_path: str) -> int
Count lines in a file using Python (slower but portable).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_path |
str
|
Path to the file to count lines in. |
required |
Returns:
Name | Type | Description |
---|---|---|
int |
int
|
Number of lines in the file (excluding header). |
Source code in src/capfinder/train_etl.py
create_class_dataset(file_paths: List[str], target_length: int, dtype: DtypeLiteral, examples_per_class: int, train_test_fraction: float) -> Tuple[tf.data.Dataset, tf.data.Dataset]
Create a dataset for a single class from multiple files.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_paths |
List[str]
|
List of file paths for a single class. |
required |
target_length |
int
|
The desired length of the timeseries tensor. |
required |
dtype |
DtypeLiteral
|
The desired data type for the timeseries tensor as a string. |
required |
examples_per_class |
int
|
Number of examples to take per class. |
required |
train_test_fraction |
float
|
Fraction of data to use for training. |
required |
Returns:
Type | Description |
---|---|
Tuple[Dataset, Dataset]
|
Tuple[tf.data.Dataset, tf.data.Dataset]: Train and test datasets for the given class. |
Source code in src/capfinder/train_etl.py
create_dataset(file_path: str, target_length: int, dtype: DtypeLiteral) -> tf.data.Dataset
Create a TensorFlow dataset for a single class CSV file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_path |
str
|
Path to the CSV file. |
required |
target_length |
int
|
The desired length of the timeseries tensor. |
required |
dtype |
DtypeLiteral
|
The desired data type for the timeseries tensor as a string. |
required |
Returns:
Type | Description |
---|---|
Dataset
|
tf.data.Dataset: A dataset for the given class. |
Source code in src/capfinder/train_etl.py
create_train_val_test_datasets_from_train_test_csvs(dataset_dir: str, batch_size: int, target_length: int, dtype: tf.DType, train_val_fraction: float, use_augmentation: bool = False) -> Tuple[tf.data.Dataset, tf.data.Dataset, tf.data.Dataset, int, int]
Load ready-made train, validation, and test datasets from CSV files.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset_dir |
str
|
Directory containing the CSV files. |
required |
batch_size |
int
|
Size of each batch. |
required |
target_length |
int
|
Target length of each time series. |
required |
dtype |
DType
|
Data type for the features. |
required |
train_val_fraction |
float
|
Fraction of training data to use for validation. |
required |
use_augmentation |
bool
|
Whether to augment original training examples with warped versions |
False
|
Returns:
Type | Description |
---|---|
Dataset
|
Tuple[tf.data.Dataset, tf.data.Dataset, tf.data.Dataset, int, int]: |
Dataset
|
Train dataset, validation dataset, test dataset, steps per epoch, and validation steps. |
Source code in src/capfinder/train_etl.py
create_warped_examples(signal: tf.Tensor, max_warp_factor: float = 0.3, dtype: tf.DType = tf.float32) -> Tuple[tf.Tensor, tf.Tensor]
Create warped versions (squished and expanded) of the input signal.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
signal |
Tensor
|
The input signal to be warped. |
required |
max_warp_factor |
float
|
The maximum factor by which the signal can be warped. Defaults to 0.3. |
0.3
|
dtype |
DType
|
The desired data type for the output tensors. Defaults to tf.float32. |
float32
|
Returns:
Type | Description |
---|---|
Tuple[Tensor, Tensor]
|
Tuple[tf.Tensor, tf.Tensor]: A tuple containing the squished and expanded versions of the input signal. |
Source code in src/capfinder/train_etl.py
csv_generator(file_path: str) -> Generator[Tuple[str, str, str], None, None]
Generates rows from a CSV file one at a time.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_path |
str
|
Path to the CSV file. |
required |
Yields:
Type | Description |
---|---|
str
|
Tuple[str, str, str]: A tuple containing read_id, cap_class, and timeseries as strings. |
Source code in src/capfinder/train_etl.py
get_class_from_file(file_path: str) -> int
Read the first data row from a CSV file and return the class ID.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_path |
str
|
Path to the CSV file. |
required |
Returns:
Name | Type | Description |
---|---|---|
int |
int
|
Class ID from the first data row. |
Source code in src/capfinder/train_etl.py
get_local_dataset_version(dataset_dir: str) -> Optional[str]
Get the version of the local dataset.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset_dir |
str
|
The directory containing the dataset. |
required |
Returns:
Type | Description |
---|---|
Optional[str]
|
Optional[str]: The version of the local dataset, or None if not found. |
Source code in src/capfinder/train_etl.py
group_files_by_class(caps_data_dir: str) -> Dict[int, List[str]]
Group CSV files in the directory by their class ID.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
caps_data_dir |
str
|
Directory containing the CSV files. |
required |
Returns:
Type | Description |
---|---|
Dict[int, List[str]]
|
Dict[int, List[str]]: Dictionary mapping class IDs to lists of file paths. |
Source code in src/capfinder/train_etl.py
interleave_class_datasets(class_datasets: List[tf.data.Dataset], num_classes: int) -> tf.data.Dataset
Interleave datasets from different classes to ensure class balance.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
class_datasets |
List[Dataset]
|
List of datasets, one for each class. |
required |
num_classes |
int
|
The number of classes in the dataset. |
required |
Returns:
Type | Description |
---|---|
Dataset
|
tf.data.Dataset: An interleaved dataset with balanced class representation. |
Source code in src/capfinder/train_etl.py
load_test_dataset_from_csvs(x_file_path: str, y_file_path: str, batch_size: int, target_length: int, dtype: DtypeLiteral) -> tf.data.Dataset
Load test dataset from CSV files.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x_file_path |
str
|
Path to the features CSV file. |
required |
y_file_path |
str
|
Path to the labels CSV file. |
required |
batch_size |
int
|
Size of each batch. |
required |
target_length |
int
|
Target length of each time series. |
required |
dtype |
DtypeLiteral
|
Data type for the features as a string. |
required |
Returns:
Type | Description |
---|---|
Dataset
|
tf.data.Dataset: Test dataset. |
Source code in src/capfinder/train_etl.py
load_train_dataset_from_csvs(x_file_path: str, y_file_path: str, batch_size: int, target_length: int, dtype: tf.DType, train_val_fraction: float = 0.8, use_augmentation: bool = False) -> Tuple[tf.data.Dataset, tf.data.Dataset, int, int]
Load training dataset from CSV files and split into train and validation sets.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x_file_path |
str
|
Path to the features CSV file. |
required |
y_file_path |
str
|
Path to the labels CSV file. |
required |
batch_size |
int
|
Size of each batch. |
required |
target_length |
int
|
Target length of each time series. |
required |
dtype |
DType
|
Data type for the features. |
required |
train_val_fraction |
float
|
Fraction of data to use for training. Defaults to 0.8. |
0.8
|
use_augmentation |
bool
|
Whether to augment original training examples with warped versions |
False
|
Returns:
Type | Description |
---|---|
Dataset
|
Tuple[tf.data.Dataset, tf.data.Dataset, int, int]: Train dataset, validation dataset, |
Dataset
|
steps per epoch, and validation steps. |
Source code in src/capfinder/train_etl.py
140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 |
|
parse_row(row: Tuple[str, str, str], target_length: int, dtype: tf.DType) -> Tuple[tf.Tensor, tf.Tensor, tf.Tensor]
Parse a row of data and convert it to the appropriate tensor format. Padding and truncation are performed equally on both sides of the time series.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
row |
Tuple[str, str, str]
|
A tuple containing read_id, cap_class, and timeseries as strings. |
required |
target_length |
int
|
The desired length of the timeseries tensor. |
required |
dtype |
DType
|
The desired data type for the timeseries tensor. |
required |
Returns:
Type | Description |
---|---|
Tensor
|
Tuple[tf.Tensor, tf.Tensor, tf.Tensor]: A tuple containing the parsed and formatted tensors for |
Tensor
|
timeseries, cap_class, and read_id. |
Source code in src/capfinder/train_etl.py
read_dataset_version_info(dataset_dir: str) -> Optional[str]
Read the dataset version information from a file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset_dir |
str
|
Directory containing the dataset version file. |
required |
Returns:
Type | Description |
---|---|
Optional[str]
|
Optional[str]: The dataset version if found, None otherwise. |
Source code in src/capfinder/train_etl.py
train_etl(caps_data_dir: str, dataset_dir: str, target_length: int, dtype: DtypeLiteral, examples_per_class: int, train_test_fraction: float, train_val_fraction: float, num_classes: int, batch_size: int, comet_project_name: str, use_remote_dataset_version: str = '', use_augmentation: bool = False) -> Tuple[tf.data.Dataset, tf.data.Dataset, tf.data.Dataset, int, int, str]
Process the data from multiple class files, create balanced datasets, perform train-test split, and upload to Comet ML.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
caps_data_dir |
str
|
Directory containing the class CSV files. |
required |
dataset_dir |
str
|
Directory to save the processed dataset. |
required |
target_length |
int
|
The desired length of each time series. |
required |
dtype |
DtypeLiteral
|
The desired data type for the timeseries tensor as a string. |
required |
examples_per_class |
int
|
Number of samples to use per class. |
required |
train_test_fraction |
float
|
Fraction of data to use for training. |
required |
train_val_fraction |
float
|
Fraction of training data to use for validation. |
required |
num_classes |
int
|
Number of classes in the dataset. |
required |
batch_size |
int
|
The number of samples per batch. |
required |
comet_project_name |
str
|
Name of the Comet ML project. |
required |
use_remote_dataset_version |
str
|
Version of the remote dataset to use, if any. |
''
|
use_augmentation |
bool
|
Whether to augment original training examples with warped versions |
False
|
Returns: Tuple[tf.data.Dataset, tf.data.Dataset, tf.data.Dataset, int, int, str]: The train, validation, and test datasets, steps per epoch, validation steps, and the dataset version.
Source code in src/capfinder/train_etl.py
720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 |
|
write_dataset_to_csv(dataset: tf.data.Dataset, dataset_dir: str, train_test: str) -> None
Write a dataset to CSV files.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset |
Dataset
|
The dataset to write. |
required |
dataset_dir |
str
|
The directory to write the CSV files to. |
required |
train_test |
str
|
Either 'train' or 'test' to indicate the dataset type. |
required |
Returns:
Type | Description |
---|---|
None
|
None |
Source code in src/capfinder/train_etl.py
write_dataset_version_info(dataset_dir: str, version: str) -> None
Write the dataset version information to a file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset_dir |
str
|
Directory to write the version file. |
required |
version |
str
|
Version information to write. |
required |
Source code in src/capfinder/train_etl.py
training
InterruptCallback
Bases: Callback
Callback to interrupt training based on a global flag.
Source code in src/capfinder/training.py
on_epoch_end(epoch: int, logs: Optional[Dict[str, float]] = None) -> None
Checks the global stop_training
flag at the end of each epoch.
If True, interrupts training and logs a message.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
epoch |
int
|
The current epoch index (integer). |
required |
logs |
Optional[Dict[str, float]]
|
Optional dictionary of training metrics at the end of the epoch (default: None). |
None
|
Returns:
Type | Description |
---|---|
None
|
None |
Source code in src/capfinder/training.py
on_train_batch_end(batch: int, logs: Optional[Dict[str, float]] = None) -> None
Checks the global stop_training
flag at the end of each batch.
If True, interrupts training and logs a message.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
batch |
int
|
The current batch index (integer). |
required |
logs |
Optional[Dict[str, float]]
|
Optional dictionary of training metrics at the end of the batch (default: None). |
None
|
Returns:
Type | Description |
---|---|
None
|
None |
Source code in src/capfinder/training.py
count_batches(dataset: tf.data.Dataset, dataset_name: str) -> int
Count the number of individual examples in a dataset.
Args: dataset (tf.data.Dataset): The dataset to count examples from. dataset_name (str): The name of the dataset.
Returns: int: The number of examples in the dataset.
Source code in src/capfinder/training.py
count_examples(dataset: tf.data.Dataset, dataset_name: str) -> int
Count the number of individual examples in a dataset.
Args: dataset (tf.data.Dataset): The dataset to count examples from. dataset_name (str): The name of the dataset.
Returns: int: The number of examples in the dataset.
Source code in src/capfinder/training.py
generate_unique_name(base_name: str, extension: str) -> str
Generate a unique filename with a datetime suffix.
Parameters:
base_name: str The base name of the file. extension: str The file extension.
Returns:
str The unique filename with the datetime suffix.
Source code in src/capfinder/training.py
handle_interrupt(signum: Optional[int] = None, frame: Optional[object] = None) -> None
Handles interrupt signals (e.g., Ctrl+C) by setting a global flag to stop training.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
signum |
Optional[int]
|
The signal number (optional). |
None
|
frame |
Optional[object]
|
The current stack frame (optional). |
None
|
Returns:
Type | Description |
---|---|
None
|
None |
Source code in src/capfinder/training.py
initialize_tuner(hyper_model: CNNLSTMModel | EncoderModel, tune_params: dict, model_save_dir: str, model_type: ModelType) -> Union[Hyperband, BayesianOptimization, RandomSearch]
Initialize a Keras Tuner object based on the specified tuning strategy.
Parameters:
hyper_model: CapfinderHyperModel An instance of the CapfinderHyperModel class. tune_params: dict A dictionary containing the hyperparameters for tuning. model_save_dir: str The directory where the model should be saved. comet_project_name: str model_type: ModelType Type of the model to be trained.
Returns:
Union[Hyperband, BayesianOptimization, RandomSearch]: An instance of the Keras Tuner class based on the specified tuning strategy.
Source code in src/capfinder/training.py
kill_gpu_processes() -> None
Terminates processes running on the NVIDIA GPU and sets the Keras dtype policy to float16.
This function checks if the nvidia-smi
command exists and, if found, attempts
to terminate all Python processes utilizing the GPU. If no NVIDIA GPU is found,
the function skips the termination step. It also sets the Keras global policy to
mixed_float16 for faster training.
Returns:
Type | Description |
---|---|
None
|
None |
Source code in src/capfinder/training.py
save_model(model: keras.Model, base_name: str, extension: str, save_dir: str) -> str
Save the given model to a specified directory.
Parameters:
model: keras.Model The model to be saved. base_name: str The base name for the saved model file. extension: str The file extension for the saved model file. save_dir: str The directory where the model should be saved.
Returns:
str The full path where the model was saved.
Source code in src/capfinder/training.py
select_lr_scheduler(lr_scheduler_params: dict, train_size: int) -> Union[keras.callbacks.ReduceLROnPlateau, CyclicLR, SGDRScheduler]
Selects and configures the learning rate scheduler based on the provided parameters.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
lr_scheduler_params |
dict
|
Configuration parameters for the learning rate scheduler. |
required |
train_size |
int
|
Number of training examples, used for step size calculations. |
required |
Returns:
Type | Description |
---|---|
Union[ReduceLROnPlateau, CyclicLR, SGDRScheduler]
|
Union[keras.callbacks.ReduceLROnPlateau, CyclicLR, SGDRScheduler]: The selected learning rate scheduler. |
Source code in src/capfinder/training.py
set_data_distributed_training() -> None
Set JAX as the backend for Keras training, with distributed training if multiple CUDA devices are available.
This function checks for available CUDA devices and sets up distributed training only if more than one is found.
Returns:
None
Source code in src/capfinder/training.py
upload_download
CometArtifactManager
Manages the creation, uploading, and downloading of dataset artifacts using Comet ML.
Source code in src/capfinder/upload_download.py
33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 |
|
__init__(project_name: str, dataset_dir: str) -> None
Initialize the CometArtifactManager.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
project_name |
str
|
The name of the Comet ML project. |
required |
dataset_dir |
str
|
The directory containing the dataset. |
required |
Source code in src/capfinder/upload_download.py
create_artifact() -> comet_ml.Artifact
Create and return a Comet ML artifact.
Returns:
Type | Description |
---|---|
Artifact
|
comet_ml.Artifact: The created Comet ML artifact. |
Source code in src/capfinder/upload_download.py
create_targz_chunks(chunk_size: int = 200 * 1024 * 1024) -> Tuple[List[str], str, int]
Create tar.gz chunks of the dataset.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
chunk_size |
int
|
The size of each chunk in bytes. Defaults to 20MB. |
200 * 1024 * 1024
|
Returns:
Type | Description |
---|---|
Tuple[List[str], str, int]
|
Tuple[List[str], str, int]: A tuple containing the list of chunk files, the temporary directory path, and the total number of chunks. |
Source code in src/capfinder/upload_download.py
download_remote_dataset(version: str, max_retries: int = 3) -> None
Download a remote dataset from Comet ML.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
version |
str
|
The version of the dataset to download. |
required |
max_retries |
int
|
The maximum number of download attempts. Defaults to 3. |
3
|
Raises:
Type | Description |
---|---|
Exception
|
If the download fails after the maximum number of retries. |
Source code in src/capfinder/upload_download.py
end_comet_experiment() -> None
initialize_comet_ml_experiment() -> comet_ml.Experiment
Initialize and return a Comet ML experiment.
Returns:
Type | Description |
---|---|
Experiment
|
comet_ml.Experiment: The initialized Comet ML experiment. |
Raises:
Type | Description |
---|---|
ValueError
|
If the COMET_API_KEY environment variable is not set. |
Source code in src/capfinder/upload_download.py
log_artifacts_to_comet() -> Optional[str]
Log artifacts to Comet ML.
Returns:
Type | Description |
---|---|
Optional[str]
|
Optional[str]: The version of the logged artifact, or None if logging failed. |
Source code in src/capfinder/upload_download.py
make_comet_artifacts() -> None
Create and upload Comet ML artifacts.
Source code in src/capfinder/upload_download.py
store_artifact_version_to_file(version: str) -> None
Store the artifact version in a file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
version |
str
|
The version of the artifact to store. |
required |
Source code in src/capfinder/upload_download.py
upload_chunk(chunk_file: str, chunk_number: int, total_chunks: int) -> None
Upload a chunk of the dataset to the Comet ML artifact.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
chunk_file |
str
|
The path to the chunk file. |
required |
chunk_number |
int
|
The number of the current chunk. |
required |
total_chunks |
int
|
The total number of chunks. |
required |
Source code in src/capfinder/upload_download.py
calculate_file_hash(file_path: str) -> str
Calculate the SHA256 hash of a file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_path |
str
|
The path to the file. |
required |
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
The hexadecimal representation of the file's SHA256 hash. |
Source code in src/capfinder/upload_download.py
download_dataset_from_comet(dataset_dir: str, project_name: str, version: str) -> None
Download a dataset from Comet ML.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset_dir |
str
|
The directory to download the dataset to. |
required |
project_name |
str
|
The name of the Comet ML project. |
required |
version |
str
|
The version of the dataset to download. |
required |
Source code in src/capfinder/upload_download.py
upload_dataset_to_comet(dataset_dir: str, project_name: str) -> str
Upload a dataset to Comet ML.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset_dir |
str
|
The directory containing the dataset to upload. |
required |
project_name |
str
|
The name of the Comet ML project. |
required |
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
The version of the uploaded dataset, or None if the upload failed. |
Source code in src/capfinder/upload_download.py
utils
The module contains some common utility functions used in the capfinder package.
Author: Adnan M. Niazi Date: 2024-02-28
ensure_config_dir() -> None
file_opener(filename: str) -> Union[IO[str], IO[bytes]]
Open a file for reading. If the file is compressed, use gzip to open it.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
filename |
str
|
The path to the file to open. |
required |
Returns:
Type | Description |
---|---|
Union[IO[str], IO[bytes]]
|
file object: A file object that can be used for reading. |
Source code in src/capfinder/utils.py
get_dtype(dtype: str) -> Type[np.floating]
Returns the numpy floating type corresponding to the provided dtype string.
If the provided dtype string is not valid, a warning is logged and np.float32 is returned as default.
Parameters: dtype (str): The dtype string to convert to a numpy floating type.
Returns: Type[np.floating]: The corresponding numpy floating type.
Source code in src/capfinder/utils.py
get_next_available_cap_number() -> int
Find the next available cap number in the sequence.
Returns: int: The next available cap number.
Source code in src/capfinder/utils.py
get_terminal_width() -> int
Get the width of the terminal.
Returns:
Name | Type | Description |
---|---|---|
int |
int
|
The width of the terminal in columns. Defaults to 80 if not available. |
initialize_cap_mapping() -> None
Initialize the cap mapping file if it doesn't exist.
Source code in src/capfinder/utils.py
initialize_comet_ml_experiment(project_name: str) -> Experiment
Initialize a CometML experiment for logging.
This function creates a CometML Experiment instance using the provided project name and the COMET_API_KEY environment variable.
Parameters:
project_name: str The name of the CometML project.
Returns:
Experiment: An instance of the CometML Experiment class.
Raises:
ValueError: If the project_name is empty or None, or if the COMET_API_KEY is not set. RuntimeError: If there's an error initializing the experiment.
Source code in src/capfinder/utils.py
is_cap_name_unique(new_cap_name: str) -> Optional[int]
Check if the given cap name is unique among existing cap mappings.
Args: new_cap_name (str): The new cap name to check for uniqueness.
Returns: Optional[int]: The integer label of the existing cap with the same name, if any. None otherwise.
Source code in src/capfinder/utils.py
load_custom_mapping() -> None
Load custom mapping from JSON file if it exists.
Source code in src/capfinder/utils.py
log_header(text: str) -> None
Log a centered header surrounded by '=' characters.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text |
str
|
The text to be displayed in the header. |
required |
Returns:
Type | Description |
---|---|
None
|
None |
Source code in src/capfinder/utils.py
log_output(description: str) -> None
Log a step in a multi-step process.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
description |
str
|
A description of the current step. |
required |
Returns:
Type | Description |
---|---|
None
|
None |
Source code in src/capfinder/utils.py
log_step(step_num: int, total_steps: int, description: str) -> None
Log a step in a multi-step process.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
step_num |
int
|
The current step number. |
required |
total_steps |
int
|
The total number of steps. |
required |
description |
str
|
A description of the current step. |
required |
Returns:
Type | Description |
---|---|
None
|
None |
Source code in src/capfinder/utils.py
log_subheader(text: str) -> None
Log a centered subheader surrounded by '-' characters.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text |
str
|
The text to be displayed in the header. |
required |
Returns:
Type | Description |
---|---|
None
|
None |
Source code in src/capfinder/utils.py
log_substep(text: str) -> None
Log a substep or bullet point.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text |
str
|
The text of the substep to be logged. |
required |
Returns:
Type | Description |
---|---|
None
|
None |
map_cap_int_to_name(cap_class: int) -> str
Map the integer representation of the CAP class to the CAP name.
open_database(database_path: str) -> Tuple[sqlite3.Connection, sqlite3.Cursor]
Open the database connection based on the database path.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
database_path |
str
|
Path to the database. |
required |
Returns:
Name | Type | Description |
---|---|---|
conn |
Connection
|
Connection object for the database. |
cursor |
Cursor
|
Cursor object for the database. |
Source code in src/capfinder/utils.py
save_custom_mapping(mapping: Dict[int, str]) -> None
Save the given mapping to JSON file.
Source code in src/capfinder/utils.py
update_cap_mapping(new_mapping: Dict[int, str]) -> None
visualize_alns
This module helps us to visualize the alignments of reads to a reference sequence. The module reads a FASTQ file or folder of FASTQ files, processes each read in parallel, and writes the output to a file. The output file contains the read ID, average quality, sequence, alignment score, and alignment string.
This module is useful in understandig the output of Parasail alignment.
Author: Adnan M. Niazi Date: 2024-02-28
calculate_average_quality(quality_scores: Sequence[Union[int, float]]) -> float
Calculate the average quality score for a read. Args: quality_scores (Sequence[Union[int, float]]): A list of quality scores for a read. Returns: average_quality (float): The average quality score for a read.
Source code in src/capfinder/visualize_alns.py
process_fastq_file(fastq_filepath: str, reference: str, num_processes: int, output_folder: str) -> None
Process a single FASTQ file. The function reads the FASTQ file, processes each read in parallel. The output is a file containing the read ID, average quality, sequence, alignment score, and alignment string.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
fastq_filepath |
str
|
The path to the FASTQ file. |
required |
reference |
str
|
The reference sequence to align the read to. |
required |
num_processes |
int
|
The number of processes to use for parallel processing. |
required |
output_folder |
str
|
The folder where the output file will be stored. |
required |
Returns:
Type | Description |
---|---|
None
|
None |
Source code in src/capfinder/visualize_alns.py
process_fastq_folder(folder_path: str, reference: str, num_processes: int, output_folder: str) -> None
Process all FASTQ files in a folder. The function reads all FASTQ files in a folder, processes each read in parallel.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
folder_path |
str
|
The path to the folder containing FASTQ files. |
required |
reference |
str
|
The reference sequence to align the read to. |
required |
num_processes |
int
|
The number of processes to use for parallel processing. |
required |
output_folder |
str
|
The folder where the output file will be stored. |
required |
Returns:
Type | Description |
---|---|
None
|
None |
Source code in src/capfinder/visualize_alns.py
process_fastq_path(path: str, reference: str, num_processes: int, output_folder: str) -> None
Process a FASTQ file or folder of FASTQ files based on the provided path.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path |
str
|
The path to the FASTQ file or folder. |
required |
reference |
str
|
The reference sequence to align the read to. |
required |
num_processes |
int
|
The number of processes to use for parallel processing. |
required |
output_folder |
str
|
The folder where the output file will be stored. |
required |
Returns:
Type | Description |
---|---|
None
|
None |
Source code in src/capfinder/visualize_alns.py
process_read(record: Any, reference: str) -> str
Process a single read from a FASTQ file. The function calculates average read quality, alignment score, and alignment string. The output is a string that can be written to a file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
record |
Any
|
A single read from a FASTQ file. |
required |
reference |
str
|
The reference sequence to align the read to. |
required |
Returns: output_string (str): A string containing the read ID, average quality, sequence, alignment score, and alignment string.
Source code in src/capfinder/visualize_alns.py
visualize_alns(path: str, reference: str, num_processes: int, output_folder: str) -> None
Main function to visualize alignments.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path |
str
|
The path to the FASTQ file or folder. |
required |
reference |
str
|
The reference sequence to align the read to. |
required |
num_processes |
int
|
The number of processes to use for parallel processing. |
required |
output_folder |
str
|
The folder where the output file will be stored. |
required |
Returns:
Type | Description |
---|---|
None
|
None |
Source code in src/capfinder/visualize_alns.py
write_ouput(output_list: List[str], output_filepath: str) -> None
Write a list of strings to a file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
output_list |
list
|
A list of strings to write to a file. |
required |
output_filepath |
str
|
The path to the output file. |
required |
Returns:
Type | Description |
---|---|
None
|
None |