Bonnie Berger Leighton - Newtonville MA, US Deniz Yorukoglu - Cambridge MA, US Yun William Yu - Cambridge MA, US Jian Peng - Cambridge MA, US
International Classification:
G06F 16/174 G16B 30/00 G16C 99/00 G16B 50/00
Abstract:
This disclosure provides for a highly-efficient and scalable compression tool that compresses quality scores, preferably by capitalizing on sequence redundancy. In one embodiment, compression is achieved by smoothing a large fraction of quality score values based on k-mer neighborhood of their corresponding positions in read sequences. The approach exploits the intuition that any divergent base in a k-mer likely corresponds to either a single-nucleotide polymorphism (SNP) or sequencing error; thus, a preferred approach is to only preserve quality scores for probable variant locations and compress quality scores of concordant bases, preferably by resetting them to a default value. By viewing individual read datasets through the lens of k-mer frequencies in a corpus of reads, the approach herein ensures that compression “lossiness” does not affect accuracy in a deleterious way.