Here I’ll summarize some Linux commands that can help us to work with millions of DNA sequences from New Generation Sequencing (NGS).
A file storing biological sequences with extension ‘.fastq’ or ‘.fq’ is a file in FASTQ format, if it is also compressed with GZIP the suffix will be ‘.fastq.gz’ or ‘.fq.gz’. A FASTQ file usually contain millions of sequences and takes up dozens of Gigabytes in a disk. When these files are compressed with GZIP their sizes are reduced in more than 10 times (ZIP format is less efficient).
In the next lines I’ll show you some commands to deal with compressed FASTQ files, with minor changes they also can be used with uncompressed ones and FASTA format files.
To start, let’s compress a FASTQ file in GZIP format:
> gzip reads.fq
The resulting file will be named ‘reads.fq.gz’ by default.
If we want to check the contents of the file we can use the command ‘less’ or ‘zless’:
> less reads.fq.gz > zless reads.fq.gz
And to count the number of sequences stored into the file we can count the number of lines and divide by 4:
> zcat reads.fq.gz | echo $((`wc -l`/4)) 256678360
If the file is in FASTA format, we will count the number of sequences like this:
> grep -c "^>" reads.fa
Also we can count how many times appear an specific subsequence:
> zgrep -c 'ATGATGATG' reads.fq.gz 398065 > zcat reads.fq.gz | awk '/ATGATGATG/ {nlines = nlines + 1} END {print nlines}' 398065
Or extract examples of sequences that contain the subsequence:
> zcat reads.fq.gz | head -n 20000 | grep --no-group-separator -B1 -A2 ATGATGATG
For FASTA files:
> zcat reads.fa.gz | head -n 10000 | grep --no-group-separator -B1 ATGATGATG
Or extract sequences that contain some pattern in the header:
> zcat reads.fa.gz | grep --no-group-separator -B1 -A2 'OS=Homo sapiens'
Sometimes we will be interested in extracting a small part of the big file to use it for testing our processing methods, ex. the first 1000 sequences (4000 lines):
> zcat reads.fq.gz | head -4000 > test_reads.fq > zcat reads.fq.gz | head -4000 | gzip > test_reads.fq.gz
Or extract a range of lines (1000001-1000004):
> zcat reads.fq.gz | sed -n '1000001,1000004p;1000005q' > lines.fq
If we want to extract random sequences (1000):
> cat reads.fq | awk '{ printf("%s",$0); n++; if(n%4==0) { printf("n");} else { printf("X#&X");} }' | shuf | head -1000 | sed 's/X#&X/n/g' > reads.1000.fq
Or from FASTA file:
> cat reads.fa | awk '{if ((NR%2)==0)print prev"X#&X"$0;prev=$0;}' | shuf | head -1000 | sed 's/X#&X/n/g' > reads.1000.fa
A more complex but sometimes useful Perl one-liner to have statistics about the length of the sequences:
> perl -ne '{ $c++; if(($c-2)%4==0){$lens{length($_)}++} }; END { print "len\tseqs\n"; while(my($key,$value)=each %lens){print "$key\t$value\n"} }' reads.fastq
Or with AWK:
> awk '{ c++; if((c-2)%4==0){ ++a[length()] } } END{ print "len\tseqs"; for (i in a) print i"\t"a[i]}' reads.fastq
We can be interested in knowing how much disk space is using the compressed file:
> gzip --list reads.fq.gz
compressed uncompressed ratio uncompressed_name
18827926034 1431825024 -1215.0% reads.fq
It seems that GZIP expands instead of compressing, but it is not true, the reason is that the ‘–list’ option doesn’t work correctly with files bigger than 2GB. The solution is to use a slower but more reliable method to know the real size of the compressed file:
> zcat reads.fq.gz | wc --bytes 61561367168
Also can be interesting to split the file in several, smaller ones, with for e.x. 1 million of sequences each (4 millions of lines in a FASTQ file):
> zcat reads.fq.gz | split -d -l 4000000 - reads.split > gzip reads.split* > rename 's/.split(d+)/.$1.fq/' reads.split*
And later we can join the files again:
> zcat reads.*.fq.gz | gzip > reads.fq.g
My original article in Spanish can be read here:
http://bioinfoperl.blogspot.com/2015/01/algunos-comandos-utiles-linux-ficheros-fastq-ngs.html
Some of these commands and extra ideas can be found here:
great and helpful commands.
Thank you
Maybe `gztool` can be useful (https://github.com/circulosmeos/gztool): it can extract parts of a gzip file without uncompressing it from the start as it is usually the case with any other commands.It does so at the expense of creating an index file (it is created at the same time compression or decompression occurs, so there’s no wasted time) which is ~0.3%/gzip or much less if gztool itself compresses the data.
Hi,
thank you for the information.
But what is the difference between .fastq and .fq file?