itertools - python enumerate out of range when looping through a file

Question

Welcome To Ask or Share your Answers For Others

itertools - python enumerate out of range when looping through a file

asked Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

itertools - python enumerate out of range when looping through a file

I have a file of paths called test.txt

/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001G_1_Clean.fastq.gz
/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001G_2_Clean.fastq.gz
/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001T_1_Clean.fastq.gz
/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001T_2_Clean.fastq.gz
/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002G_1_Clean.fastq.gz
/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002G_2_Clean.fastq.gz
/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002T_1_Clean.fastq.gz
/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002T_2_Clean.fastq.gz

Notice that the number of lines is even and always even, my final goal is to parse this file and create a new one looping through these paths on a two by two basis. I am trying enumerate function but this will not parse two by two. Furthermore, I'm going out of range because indexing the way I'm doing is wrong. It would also be great if someone could tell me how to index properly with enumerate.

 with open('./src/test.txt') as f:
    for index,line in enumerate(f):
        sample = re.search(r'pfg[dGT]+',line)
        sample_string = sample.group(0)
        #print(sample_string)
        print('{{"name":"{0}","readgroup":"{0}","platform_unit":"{0}","fastq_1":"{1}","fastq_2":"{2}","library":"{0}"}},'.format(sample_string,line,line[index+1]))

The result is something like this:

{"name":"pfg001G","readgroup":"pfg001G","platform_unit":"pfg001G","fastq_1":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001G_1_Clean.fastq.gz
","fastq_2":"g","library":"pfg001G"},
{"name":"pfg001G","readgroup":"pfg001G","platform_unit":"pfg001G","fastq_1":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001G_2_Clean.fastq.gz
","fastq_2":"r","library":"pfg001G"},
{"name":"pfg001T","readgroup":"pfg001T","platform_unit":"pfg001T","fastq_1":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001T_1_Clean.fastq.gz
","fastq_2":"o","library":"pfg001T"},
{"name":"pfg001T","readgroup":"pfg001T","platform_unit":"pfg001T","fastq_1":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001T_2_Clean.fastq.gz
","fastq_2":"u","library":"pfg001T"},
{"name":"pfg002G","readgroup":"pfg002G","platform_unit":"pfg002G","fastq_1":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002G_1_Clean.fastq.gz
","fastq_2":"p","library":"pfg002G"},
{"name":"pfg002G","readgroup":"pfg002G","platform_unit":"pfg002G","fastq_1":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002G_2_Clean.fastq.gz
","fastq_2":"s","library":"pfg002G"},
{"name":"pfg002T","readgroup":"pfg002T","platform_unit":"pfg002T","fastq_1":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002T_1_Clean.fastq.gz
","fastq_2":"/","library":"pfg002T"},
{"name":"pfg002T","readgroup":"pfg002T","platform_unit":"pfg002T","fastq_1":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002T_2_Clean.fastq.gz","fastq_2":"c","library":"pfg002T"},

Clearly the indexation is wrong since it's going through every element of my path that is g r etc instead of printing the next path. For the first iteration the next path printed should be: "fastq_2":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001G_2_Clean.fastq.gz".

I believe the problem itself can be tackled with itertools more elegantly I just don't know how to do it. Would also be great if someone could tell me if an indexation with enumerate could also work.

question from:https://stackoverflow.com/questions/65913089/python-enumerate-out-of-range-when-looping-through-a-file

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-06T19:11:07+0000

One problem is that you are trying to access the data from the second line of the pair before you have read it. Additionally you can not access the second line with line[index + 1] because that refers to a character in the current line, not the next line which hasn't yet been read.

So you need to keep track of pairs of lines. You can use the index provided by enumerate() to determine whether the current line is the first (because it is an even number) or the second (because it's odd). Store the name and path for fastq_1 when you read the first line. Only write the output on the second line. Like this:

import re

with open('test.txt') as f:
    for index, line in enumerate(f):
        if index % 2 == 0:      # even, so this is the first line of a pair
            name = re.search(r'pfg[dGT]+',line).group(0)
            fastq_1 = line.rstrip()
        else:                   # odd, so second line. Emit result
            fastq_2 = line.rstrip()
            print('{{"name":"{0}","readgroup":"{0}","platform_unit":"{0}","fastq_1":"{1}","fastq_2":"{2}","library":"{0}"}},'.format(name, fastq_1, fastq_2))

line.rstrip() is required to remove the trailing new line character at the end of each line.

Categories

itertools - python enumerate out of range when looping through a file

itertools - python enumerate out of range when looping through a file

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags