Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
171 views
in Technique[技术] by (71.8m points)

python - How to use a sorted list as a filter?

I want to get all oneKmers that don't exist in in.txt.
in.txt is sorted in the same order as oneKmer at column 0.

It should be doable in O(N) instead of O(N^2) since both lists are in the same order.

How can I write this ?

import csv
import itertools

tsvfile = open('in.txt', "r")
tsvreader = csv.reader(tsvfile, delimiter=" ")

for i in itertools.product('ACTG', repeat = 18):
    oneKmer = ''.join(i)
    flag = 1
    with open(InFile) as tsvfile:
        tsvreader = csv.reader(tsvfile, delimiter=" ")
        for line in tsvreader:
            if line[0] == oneKmer:
                flag = 0
                break
    if flag:
        print(oneKmer)

in.txt:

AAAAAAAAAAAAAAAAAA 1400100
AAAAAAAAAAAAAAAAAC 37055
AAAAAAAAAAAAAAAAAT 70686
AAAAAAAAAAAAAAAAAG 192363
AAAAAAAAAAAAAAAACA 20042
AAAAAAAAAAAAAAAACC 12965
AAAAAAAAAAAAAAAACT 10596
AAAAAAAAAAAAAAAACG 1732
AAAAAAAAAAAAAAAATA 16440
AAAAAAAAAAAAAAAATC 18461
...

The whole in.txt file is 38,569,002,592 bytes with 1,836,020,688 lines.

The expected result should be (4^18 - 1,836,020,688) lines of strings. Of course I will further filter them later in the script.


For an easy example, say I want to print the integers <16 that don't exist in a given sorted list [3,5,6,8,10,11]. The result should be [1,2,4,7,9,12,13,14,15]. The given list is huge, so I want to read it one element at a time. So when I read 3, I know I can print out 1 and 2. Then skip 3, and read the next 5, now I can print out 4 and skip 5.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

A few solutions, all processing the supersequence and the subsequence in parallel, taking linear time and constant memory.

Using your easy example:

full = iter(range(1, 16))
skip = iter([3,5,6,8,10,11])

Solution 0: (the one I came up with last, but should've done first)

s = next(skip, None)
for x in full:
    if x == s:
        s = next(skip, None)
    else:
        print(x)

Solution 1:

from heapq import merge
from itertools import groupby

for x, g in groupby(merge(full, skip)):
    if len(list(g)) == 1:
        print(x)

Solution 2:

for s in skip:
    for x in iter(full.__next__, s):
        print(x)
for x in full:
    print(x)

Solution 3:

from functools import partial

until = partial(iter, full.__next__)
for s in skip:
    for x in until(s):
        print(x)
for x in full:
    print(x)

Solution 4:

from itertools import takewhile

for s in skip:
    for x in takewhile(s.__ne__, full):
        print(x)
for x in full:
    print(x)

Output of all solutions:

1
2
4
7
9
12
13
14
15

Solution 0 for your actual problem:

import csv
import itertools

with open('in.txt') as tsvfile:
    tsvreader = csv.reader(tsvfile, delimiter=' ')
    skip = next(tsvreader, [None])[0]
    for i in itertools.product('ACTG', repeat=18):
        oneKmer = ''.join(i)
        if oneKmer == skip:
            skip = next(tsvreader, [None])[0]
        else:
            print(oneKmer)

Slight variation:

import csv
from itertools import product
from operator import itemgetter

with open('in.txt') as tsvfile:
    tsvreader = csv.reader(tsvfile, delimiter=' ')
    skips = map(itemgetter(0), tsvreader)
    skip = next(skips, None)
    for oneKmer in map(''.join, product('ACTG', repeat=18)):
        if oneKmer == skip:
            skip = next(skips, None)
        else:
            print(oneKmer)

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...