Life in the FASTQ Lane
March 25, 2022
Table of Contents:
While many “bench scientists” are familiar with the workflows of ChIP-Seq and ATAC-Seq, and even the preparation and analysis of the libraries, the steps between sequencing and fully analyzed data is sometimes thought of as a mystery known only to bioinformatic experts. Most of us have some understanding that the raw data is usually in a file format called a FASTQ. But how do we get from FASTQ files to peaks on a genome browser? This article will provide a peek behind the curtain of the informatic analysis we perform at Active Motif, as part of our end-to-end epigenetic services.
The Sequencer ‐ Where it Begins
To begin, the flow cell of the Illumina sequencer is loaded with multiplexed libraries and the wet lab portion of our ChIP-Seq experiment is complete. The sequencer creates a single BCL (binary base call) file which contains data from all the multiplexed libraries on the flow cell. At this point, every individual library is sequenced, but the reads corresponding to each are all mixed together. Leveraging the barcoding scheme used to multiplex the libraries together we can now perform what is called demultiplexing to segregate only the reads belonging to each sample into individual libraries. Using the Illumina software bcl2fastq, one can perform demultiplexing to generate individual data files called FASTQ files for each of the libraries. The first informatic step is to complete the Illumina Sample Sheet. This is a CSV (comma-separated values) file that you can edit in Excel that tells the software 1) which library has which barcode sequence, 2) was the run single-end or paired-end, and 3) how many cycles were run (also called the read length). For our customers who choose to perform their own analysis, the FASTQ files are returned to them as their finished product. For those who take advantage of our end-to-end service, additional analysis is provided as described below.
Quality Control ‐ Check Your Work!
The next step is to perform quality control assessment of the FASTQ files of the project. Using software called fastqc, we can assess metrics such as base quality distribution, adapter contamination, k-mer content, and sequence duplication levels. We additionally perform a multi-species comparison of the libraries, using the Babraham Bioinformatics fastq_screen software, to confirm there are no reads from species other than that of the organism of interest contaminating the sample’s data. These steps are a critical part of the workflow to ensure that the final data is of high quality and suitable for downstream analysis.
Following QC, we next align the raw data in the FASTQ file to an annotated genome. We have the sequence reads, but do not yet know to what regions they correspond to on the genome of the organism from which the chromatin was isolated. The genomes of humans, mice, rats, zebrafish, fruit flies, yeast, and the worm C. elegans (among others) have all be sufficiently annotated to align FASTQ files. We use a software called BWA (Burrows-Wheeler Aligner) to map sequencing reads to the annotated genome (Li and Durbin, 2009). This process produces BAM files from the raw FASTQ files. BAM files contain the sequences, but now also have the coordinate information of what specific position along the genome they correspond to.
Peaks and Valleys
Next, we need to define peaks, and for this we use two software suites called MACS/MACS2 (Zhang et al., 2008) or SICER (Zang et al., 2009). These programs use the BAM files created by BWA to determine if and where the reads are enriched within each sample, across the genome. These regions of signal enrichment are referred to as “peaks” and serve as the functional unit of much of the analysis. It is important to note that we normalize the data so that the peak “calling” and subsequent observations will not be driven by technical variation, but instead more heavily dependent on and reflect the underlying biology at play. This process results in the generation of a BED file that contains the chromosome, bp start position, end position, and a series of meta data associated with that peak for every peak called.
We additionally generate a bigWig (.bw format) file that contains this same peak information and, at 100-200 megabytes in size, is a lot more portable than the original raw FASTQ file, which can be between 1-2 gigabytes. In these bigWig files, we now have data that can be uploaded to a genome browser like the UCSC Genome Browser or a genome browser program like IGV (Integrative Genomics Viewer; Robinson et al., 2001). IGV allows researchers to simply drag and drop their data from the desktop and visualize regions of interest. Alternatively, they may link bigWig files, hosted on an FTP server, to the UCSC Genome Browser. Researchers can use the genome browsers to search for genes/loci, compare tracks, and make screenshots for publication.
After peak calling is complete, it’s now time to perform downstream analysis. Because we’re working with an annotated genome, we can assign them to other specific features such as their nearest gene or that gene’s promoter region. Often, our clients are interested in a differential analysis that compares one group of samples to another to identify regions where the signal was significantly different. Using the R package DESeq2, we can get quantitative differences between samples at specific peaks (Love et al., 2014). Because these peak regions have been annotated to nearby genomic features, it is at this point that researchers can ask questions such as what genes are differentially regulated as a function of condition.
The Final Analysis
For a final bioinformatic analysis provided by Active Motif, researchers receive the FASTQ (raw unaligned reads), BAM (aligned reads), and bigWig (peak data) for all of their samples analyzed. Beyond that, we deliver a suite of graphics, annotation files, and genome browser screenshots that, when taken together, offer deep insight into the desired epigenetic question.
About the author
Nick Pervolarakis, Ph.D.
Nick grew up in a town called Lake Orion, Michigan and as the name implies spent a lot of time on the water. He graduated from the University of Michigan, Ann Arbor with a B.S. in Microbiology where he studied lung microbial communities through metagenomics. After receiving his degree, he began graduate school at the University of California, Irvine in the Mathematics, Computational, and Systems Biology program. His work there centered on applying single cell technologies to explore the mammary gland in a healthy and cancer context. At Active Motif, Nick works as a Computational Biologist and enjoys connecting with customers and their data to understand underlying biology through epigenetics. Beyond work, Nick enjoys reading, watching international films, and eating as many different varieties of food as he can get his hands on.