Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
131 views
in Technique[技术] by (71.8m points)

python - Find pattern that is present twice and allow <=2 mismatches on each

I have a fastq file of 400,000 reads (so speed is important). In the sequences there are barcodes integrated that should be present twice. Given a barcode, I want to find the sequences that have the barcode present twice with <= 2 mismatches. So, with a barcode 'ATTCGACCGATAGG', I would like to retrieve all of the following sequences-

TATCTTGTGGAAAGGACGAAACACCGAACACAAAGCATAGATGCGTTTAAGAGCTATGCTGGAAAACAGCATAGCAAGTTTAAATAAGGCTAGTCCGTTATCAACTTGAAAAAGTGGCACCGAGTCGGTGCTTTTTTTATTCGACCGATAGGGGTGGCAGGGGAGGCCGAGGAGGAAGAAGGGGAGGTGGCAGATTCGACCGATAGGTGGCGTAACTAGATCTTGAGACAAA TATCTTGTGGAAAGGACGAAACACCGGTCCGAGCAGAAGAAGAAGTTTAAGAGCTATGCTGGAAACAGCATAGCAAGTTTAAATAAGGCTAGTCCGTTATCAACTTGAAAAAGTGGCACCGAGTCGGTGCTTTTTTTATTCGACCGATAGGGGTGGCAGGGGAGGCCGAGGAGGAAGAAGGGGAGGTGGCAGATTCGACCGATAGGTGGCGTAACTAGATCTTGAGACAAA TATCTTGTGGAAAGGACGAAACACCGAGTCCGAGCAGAAGAAGAAGTTTAAGAGCTATGCTGGAAACAGCATAGCAAGTTTAAATAAGGCTAGTCCGTTATCAACTTGAAAAAGTGGCACCGAGTCGGTGCTTTTTTTATTCGACCGATAGGGGTGGCAGGGGAGGCCGAGGAGGAAGAAGGGGAGGTGGCAGATTCGACCGATAGGTGGCGTAACTAGATCTTGAGACAAA TATCTTGTGGAAAGGACGAAACACCGAGTCCGAGCAGAAGAAGAAGTTTAAGAGCTATGCTGGAAACAGCATAGCAAGTTTAAATAAGGCTAGTCCGTTATCAACTTGAAAAAGTGGCACCGAGTCGGTGCTTTTTTTATTCGACGATAGGGGTGGCAGGGGAGGCCGAGGAGGAAGAAGGGGAGGTGGCAGATTCGACCGATAGGTGGCGTAACTAGATCTTGAGACAAA

Note that the first barcode in the fourth sequence is short of one character. I have tried with biopython and regex but it's just too slow given I have 5k barcodes. I am wondering if there is a fast solution available in python or in something like grep, awk or anything else. Thanks.

question from:https://stackoverflow.com/questions/65878893/find-pattern-that-is-present-twice-and-allow-2-mismatches-on-each

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Using GNU awk:

 awk '{ for (i=1;i<=NF;i++) { fnd=0;subs=$i;while (match(subs,"ATTCGACCGATAGG")) { subs=substr(subs,RSTART+RLENGTH);if (RSTART>0) { fnd++;print fnd } } if (fnd <=2) { print $i } } }' file

Explanation:

 awk '{ for (i=1;i<=NF;i++) {                           # Loop on each space delimited field
         fnd=0;                                         # Initialise fnd variable/counter
         subs=$i;                                       # Initialise substring variable
         while (match(subs,"ATTCGACCGATAGG")) { 
           subs=substr(subs,RSTART+RLENGTH);            # Check for multiple matches of "ATTCGACCGATAGG" in subs.
           if (RSTART>0) { 
              fnd++;                                    # Increment fnd if string found in subs
           } 
         } 
         if (fnd <=2) { 
            print $i                                    # If found twice or less than twice print the field
         }
        } 
       }' file

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

2.1m questions

2.1m answers

60 comments

57.0k users

...