Here I’ll summarize some Linux commands that can help us to work with millions of DNA sequences from New Generation Sequencing (NGS).
A file storing biological sequences with extension ‘.fastq’ or ‘.fq’ is a file in FASTQ format, if it is also compressed with GZIP the suffix will be ‘.fastq.gz’ or ‘.fq.gz’. A FASTQ file usually contain millions of sequences and takes up dozens of Gigabytes in a disk. When these files are compressed with GZIP their sizes are reduced in more than 10 times (ZIP format is less efficient).
In the next lines I’ll show you some commands to deal with compressed FASTQ files, with minor changes they also can be used with uncompressed ones and FASTA format files.
To start, let’s compress a FASTQ file in GZIP format:
> gzip reads.fq
The resulting file will be named ‘reads.fq.gz’ by default.
If we want to check the contents of the file we can use the command ‘less’ or ‘zless’:
> less reads.fq.gz > zless reads.fq.gz
And to count the number of sequences stored into the file we can count the number of lines and divide by 4:
> zcat reads.fq.gz | echo $((`wc -l`/4)) 256678360
If the file is in FASTA format, we will count the number of sequences like this:
> grep -c "^>" reads.fa