Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
505 views
in Technique[技术] by (71.8m points)

bash - 在文件中查找并替换与另一个文件中的字符串匹配的子字符串(find and replace substrings in a file which match strings in another file)

I have two txt files: File1 is a tsv with 9 columns.

(我有两个txt文件: File1是具有9列的tsv。)

Following is its first row ( SRR6691737.359236/0_14228//11999_12313 is the first column and after Repeat is the 9th column):

(以下是其第一行( SRR6691737.359236/0_14228//11999_12313是第一列,而Repeat是第9列):)

SRR6691737.359236/0_14228//11999_12313  Censor  repeat  5       264     1169    +       .       Repeat BOVA2 SINE 1 260 9

File2 is a tsv with 9 columns.

(File2是具有9列的tsv。)

Following is its first row (after Read is the 9th column):

(以下是其第一行(在“ Read”为第9列之后):)

CM011822.1  reefer  discordance 63738705    63738727    .   +   .   Read SRR6691737.359236 11999 12313; Dup 277

File1 contains information of read name ( SRR6691737.359236 ), read length ( 0_14228) and coordinates ( 11999_12313 ) while file two contains only read name and coordinate.

(File1包含读取名称( SRR6691737.359236 ),读取长度( 0_14228)和坐标( 11999_12313 )的信息,而文件2仅包含读取名称和坐标。)

All read names and coordinates in file1 are present in file2, but file2 may also contain the same read names with different coordinates.

(file1中的所有读取名称和坐标都存在于file2中,但是file2也可能包含具有不同坐标的相同读取名称。)

Also file2 contains read names which are not present in file1.

(而且file2包含file1中不存在的读取名称。)

I want to write a script which finds read names and coordinates in file2 that match those in file1 and adds the read length from file1 to file2.

(我想编写一个脚本,在文件2中找到与文件1中的名称和坐标相匹配的读取名称和坐标,并将读取的长度从文件1添加到文件2中。)

ie changes the last column of file2:

(即更改file2的最后一列:)

Read SRR6691737.359236 11999 12313; Dup 277

to:

(至:)

Read SRR6691737.359236/0_14228//11999_12313; Dup 277

any help?

(有什么帮助吗?)

  ask by Mani Ghani poor Samami translate from so

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

If unclear how your input files look look like.

(如果不清楚,输入文件的外观如何。)

You write:

(你写:)

I have two txt files: File1 is a tsv with 9 columns.

(我有两个txt文件: File1是具有9列的tsv。)

Following is its first row ( SRR6691737.359236/0_14228//11999_12313 is the first column and after Repeat is the 9th column):

(以下是其第一行( SRR6691737.359236/0_14228//11999_12313是第一列,而Repeat是第9列):)

 SRR6691737.359236/0_14228//11999_12313 Censor repeat 5 264 1169 + . Repeat BOVA2 SINE 1 260 9 

If I try to check the columns (and put them in a 'Column,Value' pair): Column,Value 1,SRR6691737.359236/0_14228//11999_12313 2,Censor 3,repeat 4,5 5,264 6,1169 7,+ 8,.

(如果我尝试检查列(并将它们放在“列,值”对中):列,值1,SRR6691737.359236 / 0_14228 // 11999_12313 2,检查器3,重复4,5 5,264 6,1169 7,+ 8,)

9,Repeat 10,BOVA2 11,SINE 12,1 13,260 14,9

(9,重复10,BOVA2 11,正弦12,1 13,260 14,9)

That seems to have 14 columns, you specify 9 columns...

(似乎有14列,您指定了9列...)

Can you edit your question, and be clear about this?

(您可以编辑您的问题,并对此清楚吗?)

ie specify as csv SRR6691737.359236/0_14228//11999_12313,Censor,repeat,5,.....

(即指定为csv SRR6691737.359236/0_14228//11999_12313,Censor,repeat,5,.....)

Added info, after feedback : file1 contains the following fields (tab-separated):

(反馈后添加的信息:file1包含以下字段(制表符分隔):)

  1. SRR6691737.359236/0_14228//11999_12313

    (SRR6691737.359236 / 0_14228 // 11999_12313)

  2. Censor

    (审查)

  3. 5

    (5)

  4. 264

    (264)

  5. 1169

    (1169)

  6. +

    (+)

  7. .

    (。)

  8. Repeat BOVA2 SINE 1 260 9

    (重复BOVA2 SINE 1 260 9)

You want to convert this (using a script) to a tab-separated file:

(您想要将此(使用脚本)转换为制表符分隔的文件:)

  1. CM011822.1

    (CM011822.1)

  2. reefer

    (冷藏箱)

  3. distance

    (距离)

  4. 63738705

    (63738705)

  5. 63738727

    (63738727)

  6. +

    (+)

  7. .

    (。)

  8. Read SRR6691737.359236 11999 12313

    (阅读SRR6691737.359236 11999 12313)

  9. Dup 277

    (杜普277)

More info is needed to solve this!

(需要更多信息来解决这个问题!)

field 1: How/Where is the info for 'CM011822.1' coming from?

(字段1:“ CM011822.1”的信息来自何处?)

field 2 and 3: 'reefer'/'distance'.

(栏2和3:“冷藏箱” /“距离”。)

Is this fixed text, should these fields always contain these texts or are there exceptions?

(这是固定文本吗?这些字段应始终包含这些文本吗?还是有例外?)

field 4 and 5: Where are these values (63738705 ; 63738727) coming from?

(字段4和5:这些值(63738705; 63738727)来自何处?)

OK, it's clear that there are more questions to be asked than can give here …

(好吧,很明显,这里有更多的问题要问……)


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...