Today I was asked to create a data retention policy that manages ~60TB of our generated sequencing data. This data is both in-house and customer, collaborative and cross-institutional, and generated at our facility and elsewhere. No singular policy can surely cover all of these things, but nevertheless, I have to formulate something that will work.

The root of the issue is the cost of long-term storage versus the onslaught of data being generated. In an ideal world we could continue to throw resources into hosting these enormous genomic datasets. In reality, the cost of powering, and cooling our massive data storage arrays eventually far outweighs the cost of just sequencing again.

A typical sequencing project from raw DNA to FASTQ (Fast A/Quality File).  This file looks something like this:

@UB-NGS-01:220:FC1:4:1101:14645:9090 1:N:0:CGATGT
CAAATTTCAGAGCATTGGCCATAGAATAACCCTGGTCGGTT
+
CCCFFFFFHHHHHJJJJIJJJJJJJJIJJJJJJJJHIJJHH

The first line tells me a lot about where this sequence came from. It tells me that it was sequenced on UB-NGS-01 (our primary sequencer here at UB), on Flowcell 1, as well as the location that this sample was loaded onto the machine.  Line two is the actual DNA sequence. At this stage, I have no idea where his DNA belongs to in the genome, but after some processing, I can figure that out.

Storing these FASTQ files can take up quite a lot of space. Even compressed, we are currently sitting on 30TB of FASTQ files from all of the experiments generated at our facility. These are the most essential piece of raw data, as from this all downstream analysis can be preformed.

So back to my problem, how long do I really need to keep these FASTQ files for? Should I trust the average researcher to be responsible and hold on to their data. Do I care if they lose it? There is a delicate balance of being a safety net against catastrophic hard drive crashes, and over-protecting to the fault of significant economic loss.

It is an exciting time to be working in the field of Bioinformatics, but there are very real issues with regards to so called "BigData", and some tough decisions are going to have to be made about the long term value and retention of what we've produced thus far.