Algorithmic ideas of SegPE:
- Exact match search: This is a more direct method for finding exact matches of artificial sequences.
Regular expression matching and Hamming distance:
This is suitable for detecting and locating index sequences, especially when mismatches of a certain length are taken into account.
- Process and classify PE/SE sequences: After removing the adapter and index sequences, classify the PE/SE sequences and create new PE/SE FASTQ files.
- Remove low-quality reads: This is a common step in bioinformatics to ensure that the data used for analysis is of high quality.
- Use Multi-threads, SIMD and AsyncIO to handle large amounts of data.
Usage: segpe [OPTIONS] --five-art-fa <FIVE_ART_FA> --three-art-fa <THREE_ART_FA> --five-idx-fa <FIVE_IDX_FA> --pe1-fastq <PE1_FASTQ>
--five-art-fa <FIVE_ART_FA>
Path of 5' artificial fasta file
--three-art-fa <THREE_ART_FA>
Path of 3' artificial fasta file
--five-idx-fa <FIVE_IDX_FA>
Path of 5‘ index fasta file
--three-idx-fa <THREE_IDX_FA>
Path of 3’ index fasta file
Location of index, 1: PE1 2: PE2 3: both
-s, --seed-len <SEED_LEN>
Number of seed length, not allow to longer than index length
--error-tolerance <ERROR_TOLERANCE>
Number of error tolerance
-m, --match-score <MATCH_SCORE>
Number of alignment macth score
--error-score <ERROR_SCORE>
Number of alignment mismacth score
--gap-open-score <GAP_OPEN_SCORE>
Number of alignment gap open score
--gap-extend-score <GAP_EXTEND_SCORE>
Number of alignment gap extend score
Low quality pruning threshold (Phred score, automatic ASCII recognition), no pruning if not set
--quality-ascii-offset <QUALITY_ASCII_OFFSET>
Whether to trim N/non-ACGT at both ends
--length-offset <LENGTH_OFFSET>
The minimum retention length after removing low quality values. If it is shorter than this length, it will be classified as low quality
Whether to trim poly-A/C/G/T at both ends
-n, --num-threads <NUM_THREADS>
Number of cucurrency threads
batch size of reads, which every thread need to handle
add trim info in reads_name
Path of output directory
Print help (see a summary with '-h')