There are articles everywhere talking about advances in genetic research, finding that single gene that is the culprit for obesity, for diabetes, for cancer, and so on. In a sense, these articles take extremely complex research publications and break them down into a mainstream, digestible format for general consumption. Even still, it really isn't as simple as it sounds.

I work in a Next-Generation Sequencing Core Facility, and our workflow generally follows these steps:

  1. Researchers contact us with an experiment, and we work out the details of their design and system requirements.
  2. Researchers send us extracted and purified DNA or RNA.
  3. We take these samples and perform quality control on them. The quality-control step allows us to quantify the sample and check for degradation.
  4. After QC, we create sample libraries, which are essentially prepared samples that are ready to go on a sequencing machine.
  5. Once the libraries are validated, we place them onto our sequencing machines, either the illumina HiSeq2000, Roche FLX454, or LifeTech Ion Torrent PGM.
  6. After the machines do their thing, a massive amount of data is pushed down to our main computers. This is where I take over from our lab technicians.
  7. The data are initially processed and transformed from company-specific formats to industry-standard formats. Generally this is into FASTQ format.
  8. Once data are in this format, they are placed onto a download server and made available to researchers.

So now that you have a brief overview of the workflow, it is probably a pretty vague concept still. Most of you probably understand what DNA and RNA is: our fundamental genetic code that makes us who we are. . . .

So what does sequencing do? Essentially, we take a person's genetic material and determine its actual code. We figure out the A-C-G-T order that makes us who we are. We generate billions and billions of these letters, in a seemingly random order. The trick to all of this is figuring out what it means. In order to do this, I can take the ACGTACGT information and "map" it back to a reference file. Think of it like having the last name "Bard." You open up your phone's contact list and scan first for the letter B and then A . . . eventually you will find a matching location. However, there are small problems. In my family there are many Bards. My parents, my siblings, my relatives, even completely unrelated people. By sequencing more information, we are able to increase the uniqueness of that information. Currently we are able to sequence fifty to one hundred letters on one machine, and several hundred on another. This gives us a greater ability to look up where the sequence comes from. "BardJonathanE" would produce many fewer results than simply "Bard."

Quality Control Step - Showing Degraded RNA vs Good RNA

But why am I fat? So this is all well and good. We know WHAT we sequenced in the genome, but understanding how that makes me fat is very complicated. We can count the number of times a specific location was seen, we can count the number of changes in a particular stretch of DNA, and there are a ton of other applications that help us with figuring out what the data means. Sequencing is a pretty powerful tool, but the power really comes in understanding the data and being able to analyze them accurately. Using many different types of data, RNA-Seq, DNA-Seq, ChIP-Seq, miRNA, Exome, and the like will truly unlock the potential of sequencing. Trying to understand all of the data is tough, and integrating it all together is even more difficult.

So, next time you read a paper saying GeneX is the culprit for why we're fat, just remember that it most likely isn't acting alone. There is so much to the genome that we don't understand. The complex protein interactions, pathways of gene regulation, histone modifications, and chromatin accessibility are all extremely novel fields. And there are a dozen other specialized fields that I didn't even mention, nor have I ever heard of some of them. . . snoRNA anyone? There is no magic bullet, no simple answer.

Besides, coming from the analysis perspective, you can never really be sure that what your programs say the biology means—and what the true underlying biology actually is—are really the same.