A MapReduce Framework for DNA Sequencing Data Processing

—Genomics and Next Generation Sequencers (NGS) like Illumina Hiseq produce data in the order of 200 billion base pairs in a single one-week run for a 60x human genome coverage, which requires modern high-throughput experimental technologies that can only be tackled with high performance computing (HPC) and specialized software algorithms called “short read aligners”. This paper focuses on the implementation of the DNA sequencing as a set of MapReduce programs that will accept a DNA data set as a FASTQ file and finally generate a VCF (variant call format) file, which has variants for a given DNA data set. In this paper MapReduce/Hadoop along with Burrows-Wheeler Aligner (BWA), Sequence Alignment/Map (SAM) tools, are fully utilized to provide various utilities for manipulating alignments, including sorting, merging, indexing, and generating alignments. The Map-Sort-Reduce process is designed to be suited for a Hadoop framework in which each cluster is a traditional N-node Hadoop cluster to utilize all of the Hadoop features like HDFS, program management and fault tolerance. The Map step performs multiple instances of the short read alignment algorithm (BoWTie) that run in parallel in Hadoop. The ordered list of the sequence reads are used as input tuples and the output tuples are the alignments of the short reads. In the Reduce step many parallel instances of the Short Oligonucleotide Analysis Package for SNP (SOAPsnp) algorithm run in the cluster. Input tuples are sorted alignments for a partition and the output tuples are SNP calls. Results are stored via HDFS, and then archived in SOAPsnp format. The proposed framework enables extremely fast discovering somatic mutations, inferring population genetical parameters, and performing association tests directly based on sequencing data without explicit genotyping or linkage-based imputation. It also demonstrate that this method achieves comparable accuracy to alternative methods for sequencing data processing.


I. INTRODUCTION
Today, genome sequencing machines (such as Illumina's HiSeq 4000) are able to generate thousands of gigabases of DNA and RNA sequencing data in a few hours for less than US$1,000 (a few years ago, the price was over US$100,000, and sequencing the first human genome cost about US$3 billion) [1,2].
Success in biology and the life sciences depends on our ability to properly analyze the big data sets that are gener-ated by these technologies, which in turn requires us to adopt advances in informatics. Map!Reduce/Hadoop and Spark enable us to compute and analyze thousands of gigabytes/petabytes of data in hours (rather than days or weeks). For example, Spark was recently used to sort 100 TB of data using 206 machines in 23 minutes.
In simple terms, DNA sequencing is the sequencing of whole genomes (such as human genomes). According to [3] "if finding DNA was the discovery of the exact substance holding our genetic makeup information, DNA sequencing is the discovery of the process that will allow us to read that information." The main function of DNA sequencing is to find the precise order of nucleotides within a DNA molecule. Also, DNA sequencing is used to determine the order of the four bases-adenine (A), guanine (G), cytosine (C), and thymine (T)-in a strand of DNA.
The challenges of DNA sequencing are many [4], but the most important ones are: 1. There are several sequencing technologies to generate FASTQ files, and the lengths of DNA sequences are different for each sequencing technology. 2. The input data (FASTQ data) size is big (a single DNA sequence sample can be up to 900 GB). 3. With a single powerful server, it takes too long (up to 80 hours) to process one DNA sequence and extract variants such as single nucleotide polymorphisms (SNPs). 4. There are many algorithms and steps involved in DNA sequencing, so selecting the proper combinations of open source tools is a serious challenge. For example, there are quite a few mapping/alignment algorithms and parameters. 5. Scalability-that is, optimizing the number of mappers and reducers-is difficult to achieve.
A high-level DNA sequencing workflow is presented in Figure 1. As we mentioned above, our focus in this paper will be implementing DNA sequencing as a set of MapReduce programs that will accept a DNA data set as a FASTQ file and finally generate a VCF (variant call format) file, which has variants for a given DNA data set.
One of the major goals of DNA sequencing is to find variants, since most of our DNA is identical; only a very small percent differs from person to person. One important example is the identification of single nucleotide polymorphisms (SNPs). The identification and extraction of SNPs from raw genetic sequences involves many algorithms and the application of a diverse set of tools The DNA sequencing pipeline is illustrated in Figure 2 and includes the following key steps [6]: 1. Input data validation: performing quality control on input data such as FASTQ files 2. Alignment: mapping short reads to the reference genome 3. Recalibration: visualizing and post-processing the alignment, including base quality recalibration 4. Variant detection: executing the SNP calling procedure along with filtering SNP candidates There is plenty of data to analyze and apply DNA sequencing to, and there are many open source algorithms for completing the previous four steps. Choice of these open source tools will significantly affect the final results.  A practical example is given below: @NCYC361-11a03.q1k bases 1 to 1576 GCGTGCCCGAAAAAATGCTTTTGGAGCCGCGC GTGAAAT... +NCYC361-11a03.q1k bases 1 to 1576 !)))))****(((***%%((((*(((+,**(((+**+,-...
FASTQ data can be paired or non-paired. If it is paired, then our input for the DNA sequencing will be a pair of files: left_file.fastq and right_file.fastq. If it is nonpaired, there will be a single file: file.fastq.

IV. BACKGROUND AND RELATED WORK FOR FRAMEWORKS USED FOR DNA SEQUENCING AND MAPREDUCE
McKenna, Aaron, et al [7] have developed a framework to enable the development of analysis tools for NGS using MapReduce. This framework is called Genome Analysis Toolkit (GATK). The aim of GATK is to provide a structured programming framework that simplify the analysis of NGS by developing efficient, robust, and scale-tolerant NGS analysis tools. It is mainly based on covering most of the analysis tool needs by providing small rich set of data access patterns. GATK decouples and splits the infrastructure needed to access the NGS data, and the specific logic to each analysis tool. This way, it enables splitting some analysis calculations from data management infrastructure which provides more correctness, stability, and CPU and memory efficiency. This also results in the ability for automatic parallelization of distributed clusters and shared memory machines. GATK eliminates the problem of the complexity of analysing and manipulating the huge data generated and used while using DNA sequencing which enables the developers and researchers to focus more on the algorithms and analysis methods they are designing and developing. All of this makes GATK one of the most important programming frameworks for developing NGS tools. 1000 Genomes Project and the Cancer Genome Atlas are considered as two main examples for the developed NGS tools based on the GATK programming framework.
DePristo, Mark A., et al [8] have developed another framework which is an analytic and unified one. The aim of this framework is to explore the differences between several samples that achieve specific results while using five sequencing technologies and three experimental designs. These sequencing technologies can be summarized as following: 1. Initial read mapping 2. Local realignment around indels PAPER A MAPREDUCE FRAMEWORK FOR DNA SEQUENCING DATA PROCESSING 3. Base quality score recalibration 4. SNP discovery and genotyping 5. Machine learning SNP discovery and genotyping is to find all the potential variants while machine learning is for the separation between true segregating variation and machine artifacts common to NGS technologies. The developed framework provides an integrated approach to data processing and variation discovery from NGS data. This approach was designed to meet the required specifications. The experimentations done were comprehensive as several and distinct sequencing technologies were applied. The data used for the experimentation was generated from the Board Institute and 1000 Genomes project. After the experimentation of these five sequencing technologies, results showed that the new designed and developed framework achieves more accurate results than not using it. Moreover, results showed that biological DNA variant has a great impact in increasing the sensitivity and specificity of variant discovery from NGS data and that because of the following: 1. The enhanced calibration of base quality scores introduced Local realignment for indels 2. Evaluation of multiple samples from a population The developed framework also provides more modularity and scalability that result in high quality variant and genotype calls production.
As mentioned by Schatz, Michael C., et al [9], using such parallel programmatic frameworks ease the parallel computation process to the developers and researchers with providing efficiency and fault tolerance. Another example for such frameworks is the MPI (Message Passing Interfaces) which enables the developer to design and implement parallel programs. The disadvantage of using it is its software development complication. Condor 4 is another example which is considered as a batch processing system. It is one of the most effective systems to run multiple independent parallel computations. But it is not working as efficient with the more complicated parallel algorithms. MapReduce framework is considered as one of the most efficient frameworks with most of the programs. It provides the ease to the programmer while taking care of handling job scheduling, fault tolerance, distributed aggregation and other management tasks. MapReduce framework was initially developed by Google and its main task was to streamline analyses of huge amount of webpages. However, the MapReduce implementation by Google is confidential and it is not open source, that's why Hadoop16 is more popular and works as an alternative which managed by Apache Software Foundation. The main idea of the MapReduce is to go through some Map, Reduce and shuffle parallel computational steps. These steps can be summarized in these three points: 1. Map: reads are mapped to the reference genome in parallel on multiple machines 2. Shuffle: aggregation of the alignments to be on the same chromosome and being sorted by position 3. Scan: scanning of the sorted alignments for biological events identification in each group The feature of parallelism makes a program works better with larger clusters. Hadoop 19, 22, 25, and 26 works well with several genomic software while providing them with better scalability. However, not all genomic pipelines fits with it. In general, Hadoop and cloud computing works well with programs which have processors that work independently for long periods and don't coordinate with each other.

V. MAPREDUCE ALGORITHMS FOR DNA SEQUENCING
The first step in the DNA sequencing pipeline is validating the format of FASTQ files. With validation, we want to verify the quality of the input files. Input data validation tools enable the quality control checks on the FASTQ file format coming from high-throughput sequencing pipelines.
There are many open source tools for input data validation. For example, for FASTQ validation you have these options; FastQValidator and FastQC In this section we will focus on mapping/alignment, recalibration, and variant detection algorithms using MapReduce.

A. DNA sequence alignment
Sequence alignment is the comparison of two or more DNA or protein sequences. The main purpose of sequence alignment is to highlight similarity between the sequences. For global sequence alignment, consider the following example with two input sequences over the same alphabet: • Sequence 1: GCGCATGGATTGAGCGA • Sequence 2: TGCGCCATTGATGACCA Our output is a possible alignment of the two sequences: • -GCGC-ATGGATTGAGCGA

• TGCGCCATTGAT-GACC-A
We can observe three elements in the possible alignment output: • Perfect matches (in bold), Mismatches (underlined) • Insertions and deletions (called indels, presented italic formatting) For the alignment phase, we will use MapReduce/Hadoop along with the following open source tools: • Burrows-Wheeler Aligner (BWA), an efficient program that aligns relatively short nucleotide sequences against a long reference sequence such as the human genome [5]. • Sequence Alignment/Map (SAM) tools, which provide various utilities for manipulating alignments in the SAM format, including sorting, merging, indexing, and generating alignments in a per-position format [5]. We will be working with files in the BAM format, which is the binary format of a SAM file.
Typical DNA sequencing for a single data sample (about 400-900 GB in the FASTQ file format) might take 70+ hours for a very powerful single server. The goal of the MapReduce algorithm is to find the answer in a few hours and make the solution scalable.
Since most open source tools (such as BWA, SAMtools, and GATK) for alignment, recalibration, and variant detection have Linux command-line interfaces, at each MapReduce phase our map() and reduce() functions PAPER A MAPREDUCE FRAMEWORK FOR DNA SEQUENCING DATA PROCESSING will call Linux shell scripts and provide the proper parameters.
To execute these shell scripts, we will use the Free-Marker templating language, which will merge Java objects and data structures with a template to create a proper shell script (see Figure 18-3). To distinguish one DNA sequence from another, for each analysis we will assign and utilize a unique GUID called the "analysis ID" (this helps to keep our input and output directories organized).
The MapReduce solution is presented in three steps, these correspond to steps 2-4 in the DNA sequencing pipeline described above. First, data is partitioned and merged during each stage.

A.1 Step 1: Alignment
Before the alignment step starts, we partition our DNA sequence FASTQ file(s) into sets of 8 million lines (or 2 million sequences; remember, in FASTQ format, each group of four lines represents a single DNA sequence). As mentioned previously, if FASTQ data is paired, our input is a pair of files: left_file.fastq and right_file.fastq. If it is non-paired, our input will be a single file: file.fastq. For paired data, we will partition as follows (for paired data, an alignment map() function will process left_file.fastq. NNNN  For non-paired data, we will partition as follows (for non-paired data, an alignment map ( ) function will process file.fastq.NNNN): file.fastq.0000 file.fastq.0001 file.fastq.0002 ….
Each partition (also known as a chunk) will be consumed by a map( ) function. The partition size used here is (8 million lines, equal to 2 million sequences) is just for illustration purposes; the size we use for partitions should be determined by the size of your Hadoop cluster. That is, if we have a cluster of 50 nodes and each node can handle 4 mappers, then you should split your FASTQ file into 200 partitions.
For example, if the total input size is about 400 GB, then we should partition the input into 2 GB chunks (this way, we will maximize the usage of all your mappers).
The map( ) function will read the input file (one single chunk) and will generate an aligned file in BAM format. Here, the map( ) function uses BWA to perform the alignment process. Once the alignment is done, then it will extract all chromosomes (1, 2, ..., 22, 232) and save them in the MapReduce filesystem (HDFS, for Hadoop).
For example, if we have 800 partitions, we will have generated 800 files per chromosome (23 * 800 = 18,400 files). There will be only 23 reducers (one per chromosome).
The reducer will concatenate (merge and sort) all chromosomes for a specific chromosome ID. All chromosomes 1 will be concatenated into a single file called chr1.bam, all chromosomes 2 will be concatenated into a single file called chr2.bam, and so on. Then each reducer will partition the merged BAM file into small files that will be used as input to the recalibration phase.

Reducer for the alignment phase
For the alignment phase, there will be exactly 23 reducers (one reducer per chromosome).
There are two important classes, ShellScriptUtil and TemplateEngine, that merit some discussion. The Shell-ScriptUtil.callProcess() method accepts a shell script file (first parameter), which it executes. It then writes all logs from the script execution to a logfile (second parameter). Logging is asynchronous, meaning that as you execute the script, the logfile immediately becomes available.
The TemplateEngine class is defined in the following Example. It just implements the basic notion of a templating engine: it accepts a template (as a text file with key holders) and key-value pairs as a Java map and then creates a brand new file in which all keys in the template are replaced by values.

A.2 Step 2: Recalibration
Recalibration is the second phase of our MapReduce DNA sequencing pipeline. In the recalibration step, each map( ) function will work on a specific aligned chromosome.
The mapper will perform duplicate marking, local realignment, and recalibration. The goal of map( ) is to create a local recalibration table filled with covariates. These PAPER A MAPREDUCE FRAMEWORK FOR DNA SEQUENCING DATA PROCESSING local covariates will be merged by the single reducer to create the final single global file (recalibration table) that will be used by the map( ) function of the third and final step of DNA sequencing, variant detection.
Following the alignment phase, we create special metadata to be used by the recalibration mappers. This metadata has the following format: <counter><;><partitioned-bam-file><;><ref_genome> <;><analysis_id> The recalibration mapper is presented in the following Example.