I am currently at the stage of my RNA-seq workflow which involves the usage of alignment tools and for that I have chosen STAR (this was downloaded through the SSH puTTY since I am on Windows so I would be able to use it). The download was successful and the next step involved downloading the reference genome FASTA and GTF files.
I found the FASTA and GTF files from ENSEMBL, the following are the links of the two respectively: (1) ftp://ftp.ensembl.org/pub/release-102/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz (2) ftp://ftp.ensembl.org/pub/release-102/gtf/homo_sapiens/Homo_sapiens.GRCh38.102.gtf.gz
All seemed well but when I attempted to use the head command to check whether the file is viable (i.e. showing around 10 lines of data represented as bases), it shows me the letter N repeatedly (referring to the data being unknown or unreadable). This only happened for the FASTA file, the GTF file seems fine.
I'm not sure what to do next or how to fix this problem, any help would be greatly appreciated. Thank you!
The code I used is the following:
wget --directory-prefix GENOME_DIR ftp://ftp.ensembl.org/pub/release-102/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz
#To download the ftp file
ls GENOME_DIR
#To check the respective files are in the directory indicated
gunzip GENOME_DIR//Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz
#To unzip the file
#To check file has been unzipped
FASTA=GENOME_DIR/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa
#To define the variable
head FASTA
#To check the viability of the file.
It is after this point that the command simply gives me a bunch of Ns and thus, I cannot continue. I also tried other links and other sources, I input -m switch in the wget command as well so I am at a loss. Again, any help would be greatly appreciated
2.1m questions
2.1m answers
60 comments
57.0k users