The rest of this section discusses each script in more detail and provides instructions on how to execute pipeline scripts separately i.e. without running the HiCUP control script.

The HiCUP pipeline comprises the following scripts:

HiCUP (pipeline control script)

The hicup Perl script controls the other programs in the HiCUP pipeline

Synopsis

hicup [OPTIONS]... [Configuration FILE]...

Function

The HiCUP pipeline comprises the scripts ‘hicup_truncater’, ‘hicup_mapper’, ‘hicup_filter’ and ‘hicup_deduplicator’ (‘hicup_digester generates the genome_digest file used by hicup_filter). The pipeline takes FASTQ files and generates Hi-C di-tag paired reads, aligned to a specified reference genome. The HiCUP script regulates the pipeline, executing each script in turn and passing output from one stage of the program to the next.

The designated configuration file sets the parameters for the whole pipeline. The configuration file lists the names of the FASTQ file pairs to be processed.

Configuration File Example:

#Directory to which output files should be written (optional parameter)
#Set to current working directory by default 
Outdir:

#Number of threads to use
Threads: 1

#Suppress progress updates (0: off, 1: on)
Quiet:0

#Retain intermediate pipeline files (0: off, 1: on)
Keep:0

#Compress outputfiles (0: off, 1: on)
Zip:1

#Path to the alignment program (Bowtie or Bowtie2)
#Remember to include the executable Bowtie/Bowtie2 filename.
#Note: ensure you specify the correct aligner i.e. 
#Bowtie when using Bowtie indices, or Bowtie2 when using Bowtie2 indices.
#In the example below Bowtie2 is specified.
Bowtie2: /usr/local/bowtie2/bowtie2

#Path to the genome digest file produced by hicup_digester
Digest: Digest_Mouse_genome_HindIII_None_12-32-06_17-02-2012.txt

#FASTQ format (valid formats: 'Sanger', 'Solexa_Illumina_1.0', 'Illumina_1.3' or 'Illumina_1.5')
#If not specified, HiCUP will try to determine the format automatically by analysing
#one of the FASTQ files. All input FASTQ will assumed to be in this format
Format: Sanger 

#Maximum di-tag length (optional parameter)
Longest: 800

#Minimum di-tag length (optional parameter)
Shortest: 150

#FASTQ files to be analysed, placing paired files on adjacent lines
s_1_1_sequence.txt
s_1_2_sequence.txt

s_2_1_sequence.txt
s_2_2_sequence.txt

s_3_1_sequence.txt.gz
s_3_2_sequence.txt.gz

Command Line Example:

hicup --zip --bowtie /usr/local/bowtie/bowtie --index /data/public/Genomes/Mouse/NCBIM37/Mus_musculus.NCBIM37 --digest Digest_Mouse_genome_HindIII_None_12-32-06_17-02-2012.txt --format Sanger --longest 800 --shortest 150 s_1_1_sequence.txt s_1_2_sequence.txt     s_2_1_sequence.txt s_2_2_sequence.txt s_3_1_sequence.txt.gz s_3_2_sequence.txt.gz

This configuration instructs the pipeline to process and pair the files s_1_1_sequence.txt with s_1_2_sequence.txt; and s_2_1_sequence.txt with s_2_2_sequence.txt; and s_3_1_sequence.txt with s_3_2_sequence.txt. Remember, a file pair generates one output file.)

HiCUP also requires the paths to Bowtie and the genome digest file.

Command Line Options

--bowtie        Specify the path to Bowtie
--bowtie2       Specify the path to Bowtie 2
--config        Specify the configuration file
--digest        Specify the digest file listing restriction 
                fragment co-ordinates
--example       Produce an example configuration file
--format        Specify FASTQ format
                Options: Sanger, Solexa_Illumina_1.0, 
                Illumina_1.3, Illumina_1.5
--help          Print help message and exit
--index         Path to the relevant reference genome 
                Bowtie/Bowtie2 indices
--keep          Keep intermediate pipeline files
--longest       Maximum allowable insert size (bps)
--nofill        Hi-C protocol did NOT include a fill-in of
                sticky ends prior to ligation step and 
                therefore FASTQ reads shall be truncated 
                at the Hi-C restriction enzyme cut site 
                (if present) sequence is encountered
--outdir        Directory to write output files
--quiet         Suppress progress reports (except 
                warnings)
--shortest      Minimum allowable insert size (bps)
--temp          Write intermediate files (i.e. all except 
                summaryfiles and files generated by HiCUP 
                Deduplicator) to a specified directory
--threads       Specify the number of threads, allowing 
                simultaneous
                processing of multiple files
--version       Print the program version and exit
--zip           Compress output

HiCUP Truncater

The HiCUP Truncater Perl script terminates sequence reads at specified Hi-C ligation junctions

SYNOPSIS

truncate [OPTIONS]... --config [Configuration FILE]...
truncate [OPTIONS]... [FASTQ FILES]...

Function

Valid Hi-C pairs comprise two DNA fragments from different regions of the genome ligated together. Typically, a forward read maps to one ligation fragment; the reverse read maps to the other. However, this is not always true since the Hi-C ligation junction may be found within the sequenced region. Such reads will most likely be removed from the Hi-C pipeline during the mapping process, thereby losing potentially valid data. The hicup_truncater script helps remedy this by identifying ligation junctions within reads and deleting sequence downstream of the restriction enzyme recognition site.

Truncation Explanation Diagram

The names of the files to be processed and the restriction site may be passed to the scrip using a configuration file or command line arguments. The configuration file contains: i) the recognition sequence of the first (or only) restriction enzyme used in the Hi-C protocol and ii) the sequence files to be processed by the HiCUP Truncater.

Configuration File Example:

re1: A^GATCT
s_6_1_sequence.txt_CCTT.fastq.gz
s_6_2_sequence.txt_CCTT.fastq.gz

Command Line Example:

hicup_truncater --re1 A^GATCT,BglII s_6_1_sequence.txt_CCTT.fastq.gz s_6_2_sequence.txt_CCTT.fastq.gz

(The caret symbol (‘^’) denotes the cut position in the restriction enzyme recognition sequence.)

This configuration instructs the script to process the files ‘s_6_1_sequence.txt_CCTT.txt’ and ‘s_6_2_sequence.txt_CCTT.gz’. The script identifies any reads containing the Hi-C ligation sequence ‘AGATCGATCT’ (this sequence is not found in the original genomic sequence but is generated by restriction digestion with BglII, removal of sticky ends followed by blunt-ended ligation) and discards sequence downstream of the restriction cut site.

The program creates sequence files named the same as the input files, only suffixed with ‘trunc.fastq’. The script also produces a date-stamped summary file (e.g. ‘hicup_truncater_summary_08-59-17_30-01-2015.txt’) listing the number of reads truncated or not truncated for each input sequence file, along with accompanying SVG format charts.

Command Line Options

--config            Name of the optional configuration file
--help              Print program help and exit
--nofill            Hi-C protocol did NOT include a fill-in of 
                    sticky ends prior to re-ligation and 
                    therefore reads shall be truncated at the 
                    restriction site sequence
--outdir            Directory to write output files
--quiet             Suppress all progress reports, except
                    warnings
--re1               Restriction enzyme recognition sequence
--sequences         Instead of specifying a restriction enzyme 
                    recognition sequence, specify the ligation 
                    sequences directly
--threads           Number of threads to use, allowing 
                    simultaneous processing  of different files
--version           Print the program version and exit
--zip               Compress output

HiCUP Mapper

The HiCUP Mapper script aligns paired reads independently to a reference genome and retains reads where both partners align

Synopsis

hicup_mapper [OPTIONS]... --config [Configuration FILE]...
hicup_mapper [OPTIONS]... [FASTQ FILES]...

Function

Valid Hi-C ligation products comprise two restriction fragments from different regions of the genome ligated together. This program maps Hi-C di-tags against a reference genome to determine from where each restriction fragment is derived. Following mapping the forward and reverse reads are paired i.e. two input files result in one output file.

HiCUP Mapper uses the sequence alignment programs Bowtie or Bowtie2 to perform the mapping.

Bowtie mapping parameters:

-p 1 -n 1 -m 1 –best

-p 1: launches Bowtie with only one search thread. Although limiting the search to one processor is slower on multi-core machines, it ensures the order of the returned mapped reads is the same as found in the input file (ignoring omitted unmapped sequences), essential for the correct functioning of HiCUP. (Bowtie actually defaults to -p 1, but this has been made explicit in the Perl script due to the importance of preserving the read order.)

-n 1: alignments may have no more than 1 mismatch in the first 28 bases (seed) and the sum of the Phred quality values at all mismatched positions (not just in the seed) may not exceed 70. Bowtie rounds quality values to the nearest 10, saturating at 30.

-m 1: report only unique alignments

–best: reports alignments in best-to-worst order

The configuration file sets the parameters for HiCUP Mapper, it contains: i) names of files to be mapped; ii) local path to Bowtie; iii) path to the relevant reference genome Bowtie indices; iv) the sequence format.

Bowtie2 Mapping Parameters:

–very-sensitive: slower but a more sensitive and more accurate option

–no-unal: suppress SAM records for reads that failed to align.

–threads: number of threads specified by the user divided by the number of files processed

–reorder: ensure the read output order is the same as the input order when multi-threading

Bowtie2 does not have a direct equivalent of the -m 1 option available in the original Bowtie. Therefore, to identify and filter out multi-mapping reads, HiCUP processes the SAM file generated by Bowtie2. A reads is considered as uniquely mapping if the quality score is greater than or equal to 30 and either i) the read cannot be mapped to another location or ii) if the read can be mapped to other locations, then the difference in quality score between this hit and the next-best match should be at least 10 (as reported in the Bowtie2 SAM tags “AS” and “XS”). Before HiCUP version 0.6.1, a unique-mapping read was defined simply as having no next-best hit when using Bowtie2 as the aligner.

Configuration file example:

Bowtie: /usr/local/bowtie/bowtie
Index: /data/public/Genomes/Mouse/NCBIM37/Mus_musculus.NCBIM37
Format: Sanger
Threads: 4
s_1_1_sequence.txt_CCTT_trunc
s_1_2_sequence.txt_CCTT_trunc
s_1_1_sequence.txt_AAGT_trunc
s_1_2_sequence.txt_AAGT_trunc

Command line example

hicup_mapper --bowtie /usr/local/bowtie/bowtie --index /data/public/Genomes/Mouse/NCBIM37/Mus_musculus.NCBIM37 --format Sanger --threads 4 s_1_1_sequence.txt_CCTT_trunc s_1_2_sequence.txt_CCTT_trunc s_1_1_sequence.txt_AAGT_trunc s_1_2_sequence.txt_AAGT_trunc

The above example sends four files to Bowtie for mapping against the mouse NCBIM37 genome. Any text not preceded with a flag is assumed to be a sequence filename.

The FASTQ format options are:

  • Sanger
  • Solexa_Illumina_1.0
  • Illumina_1.3
  • Illumina_1.5

The output filenames will be based on the input filenames but suffixed with ‘.pair.sam’ or ‘.pair.bam’. The script also produces a date-stamped file (e.g. ‘hicup_mapper_summary_08-59-17_30-01-2015.txt’) and summarises the results in SVG graphical format.

Command Line Options

--bowtie            Specify the path to Bowtie
--bowtie2           Specify the path to Bowtie 2
--config            Specify the configuration file
--format            Specify FASTQ format
                    Options: Sanger, Solexa_Illumina_1.0, 
                    Illumina_1.3, Illumina_1.5
--help              Print help message and exit
--index             Path to the relevant reference genome 
                    Bowtie/Bowtie2 indices
--outdir            Directory to write output files
--quiet             Suppress progress reports (except 
                    warnings)
--threads           Specify the number of threads, allowing 
                    simultaneous processing of different files 
                    (default: 1)
--version           Print the program version and exit
--zip               Compress output

HiCUP Filter

The HiCUP Filter Perl script classifies read pairs, identifying valid Hi-C di-tags

Synopsis

hicup_filter [OPTIONS] --config [Configuration FILE]...
hicup_filter [OPTIONS] [hicup_mapper output FILE]...

Function

The majority of reads generated by the HiCUP Mapper script are most likely valid Hi-C products, but a substantial minority are probably not and should be removed. The HiCUP Filter script processes paired reads together with the file created by HiCUP Digester to identify valid Hi-C pairs.

The names of the files to be processed and other parameters may be passed to the script using a configuration file or by command line arguments. As a minimum requirement the script requires: i) a list of HiCUP Mapper output file(s) and ii) a digested genome produced by HiCUP Digester.

Configuration File Example:

Digest: Digest_Mouse_Genome_BglII_AluI_11-12-29_08-02-2012.txt
Longest: 800
Shortest: 150
S1_R1_R2_sequence.pair.bam
S2_R1_R2_sequence.pair.bam
S2_R1_R2_sequence.pair.bam

Command Line Example

hicup_filter --digest Digest: Digest_Mouse_Genome_BglII_AluI_11-12-29_08-02-2012.txt --longest 800 --shortest 150 

The program writes valid Hi-C read pairs to a file named the same as the original, but suffixed with ‘.filt.sam’ or ‘.filt.bam’. Rejected paired sequences are written to different files in a separate date-stamped folder e.g. hicup_filter_ditag_rejects_08-59-17_30-01-2015.

The script also creates a date-stamped file providing an overview of the types of ligation products created e.g. hicup_filter_summary_08-59-17_30-01-2015.txt and further summarises this in SVG-format charts.

Rejected paired reads (Hi-C Experimental Artefacts):

A) Sonication protocol:

-Same circularized: DNA fragment cut with the restriction enzyme circularizes, ligating to itself, and is then linearised by sonication

-Same dangling ends: di-tag pairs map to the same restriction fragment and at least one end overlaps the restriction fragment cut site

-Same internal: di-tag maps to a single restriction fragment but neither end of the di-tag overlaps the restriction fragment cut site

-Re-ligation fragments: di-tag pairs map to adjacent restriction fragments which have re-ligated in the same orientation as found in the genome

-Wrong size: calculated di-tag length is not within the limits set by the size-selection step in the experimental protocol

-Contiguous: the di-tag could theoretically represent a contiguous DNA strand spanning several restriction fragments

Hi-C Artefacts

The sonication protocol above is by far the most commonly used procedure, however there is a Hi-C variant where a second restriction enzyme is used to shorten the Hi-C fragments instead of sonicating the samples.

B) Double-digest protocol

-Unmapped: di-tags did not map to expected genomic restriction sites.

-No ligation: read pair members both map to a single restriction enzyme 1 / restriction enzyme 2 double-digest fragment.

-Wrong size: calculated di-tag length is not within the limits set by the size-selection step in the experimental protocol.

-Re-ligation: the original restriction enzyme 1 cut re-anneals. The resulting di-tags map to both the adjacent fragments; one read mapping to one fragment, the other member of the pair mapping to the adjacent fragment.

-Self-ligation: a DNA fragment cut with restriction enzyme 1 circularises, ligating to itself, and is then linearised by the action of restriction enzyme 2.

-Internal restriction enzyme 2 fragments: DNA fragments cut only by restriction enzyme 2 i.e. contain no Hi-C junction or restriction 1 cut sites. Unclassified: while the ends of the fragments map to expected locations within the genome, the orientation of the cut sites do not correspond to any of the aforementioned categories.

Command Line Options

--config        Specify the optional configuration file
--digest        Specify the genome digest file (created by 
                hicup_digester)
--help          Print program help and exit
--longest       Maximum allowable insert size (bps)
--outdir        Directory to write output files
--quiet         Suppress all progress reports
--shortest      Minimum allowable insert size (bps)
--threads       Specify the number of threads, allowing 
                simultaneous processing of multiple files
--version       Print the program version and exit
--zip           Compress final output files using gzip, 
                or if SAMtools is installed, to BAM format

HiCUP Deduplicator

The HiCUP Deduplicator script removes duplicated di-tags (retaining one copy of each) from the data set

Synopsis

hicup_deduplicator [OPTIONS]... --config [Configuration FILE]...
hicup_deduplicator [OPTIONS]... [SAM/BAM FILES]...

Function

The Hi-C experimental protocol involves a PCR amplification step to generate enough material for sequencing. Consequently, the dataset generated by HiCUP Filter may contain PCR copies of the same di-tag. These PCR duplicates could result in incorrect inferences being drawn regarding the genomic conformation and so should be removed from the data set.

The names of the files to process can be passed to the script either by using a configuration file or command line arguments.

Example:

hicup_deduplicator --zip sample_544_PC_FL_500_lane2.sam

The program creates SAM/BAM files named the same as the input files, only suffixed with ‘.dedup.bam’ or ‘dedup.sam’. If running the whole HiCUP pipeline (not solely the deduplicator script), the final HiCUP file will be end ‘.hicup.bam’ or ‘.hicup.sam’.

De-duplication step only Input: sample_544_PC_FL_500_lane2.filt.sam Output: sample_544_PC_FL_500_lane2.dedup.bam

Input: sample_545_PC_TAM_4_lane3.filt.bam Output: sample_545_PC_TAM_4_lane3.dedup.bam

Whole Pipeline Input: sample_544_PC_FL_500_lane2.filt.bam Output: sample_544_PC_FL_500_lane2.hicup.bam

Input: sample_545_PC_TAM_4_lane3.filt.bam Output: sample_545_PC_TAM_4_lane3.filt.bam

The script also produces a date-stamped summary file (e.g. hicup_deduplicator_summary_08-59-17_30-01-2015.txt) reporting the number of unique di-tags present found in the data set and then classifies those unique di-tags as either cis di-tags (in which both reads are derived from the same chromosome) or trans di-tags (different chromosomes). In addition, this stage of the pipeline produces SVG charts summarising the results.

Command Line Options

--config        Specify the configuration file
--help          Print help message and exit
--outdir        Directory to write output files
--quiet         Suppress progress reports (except warnings)
--threads       Specify the number of threads, allowing simultaneous
                processing of multiple files
--version       Print the program version and exit
--zip           Compress output

HiCUP Digester

The HiCUP Digester Perl script cuts throughout a selected genome at one or two specified restriction sites

Synopsis

hicup_digester [OPTIONS] --config [Configuration FILE]...
hicup_digester [OPTIONS] [FASTA FILES]...

Function

The Perl script HiCUP Mapper generates a file of paired mapped reads. While the majority of those reads are expected to be valid Hi-C ligation products, a substantial minority probably will not and should be removed.

The script HiCUP Filter removes many of those invalid pairs, but before it can do this it requires a digested reference genome as input, along with the paired sequence files. The HiCUP Digester program cuts a selected reference genome with one or two specified Type II restriction enzymes that recognise single undivided palindromic sequences. The script prints the results to file for subsequent processing by HiCUP Filter.

The names of the files to be processed and the digestion parameters may be passed to the script by a configuration file or command line arguments. The configuration file contains: i) restriction site 1; ii) restriction site 2 (optional and an atypical choice of protocol); iii) the name of the genome to be processed (optional) and iv) list of FASTA files to be processed.

For example:

re1: A^GATCT,BglII
genome: Mouse
/Genomes/Mouse/NCBIM37/Mus_musculus.NCBIM37.52.dna.chromosome.1.fa
/Genomes/Mouse/NCBIM37/Mus_musculus.NCBIM37.52.dna.chromosome.2.fa
/Genomes/Mouse/NCBIM37/Mus_musculus.NCBIM37.52.dna.chromosome.3.fa
.
.
.
/Genomes/Mouse/NCBIM37/ Mus_musculus.NCBIM37.52.dna.chromosome.Y.fa

The re1 flag refers to the sequence at which the first restriction enzyme used in the Hi-C protocol cuts the genome. The caret symbol (‘^’) marks the position where the enzyme cuts the DNA. As an option the name of the restriction enzyme may be included after the sequence, using a comma to separate the two. Some Hi-C protocols may use two enzymes at this stage to create Hi-C ligation junctions.

Specify two enzymes as follows:

--re1 A^GATCT,BglII:A^AGCTT,HindIII

Restriction site 2 refers to the second, optional (other DNA shearing techniques such as sonication may be used) enzymatic digestion. This restriction site does NOT form a Hi-C ligation junction. This is the restriction enzyme that is used when the Hi-C sonication protocol is not followed. Typically the sonication protocol is followed.

Specify a restriction enzyme to shorten the di-tags instead of sonication (double digest protocol):

--re1 A^GATCT,BglII --re2 AG^CT,AluI

The program creates a tab-delimited file listing:

-column 1: chromosome name (as named in the header row of the FASTA file) -column 2: restriction fragment start position -column 3: restriction fragment end position -column 4: restriction fragment number -column 5: the restriction fragment number if the genome underwent a single digest with restriction enzyme 1 -column 6: the restriction site at the five-prime end of the restriction fragment -column 7: the restriction site at the three-prime end of the restriction fragment

The output file reports the first base for each chromosome as 1 (i.e. NOT 0). The restriction fragment number also starts at 1.

Important note: use the same FASTA files to generate the digested reference genome as to generate the Bowtie indices.

Command Line Options

--re1       Restriction enzyme used to digest the genome (the enzyme that 
            forms the ligation junction) e.g. A^GATCT,BglII.  Some Hi-C protocols may use two enzymes at this stage.  To specify two enzymes: -1 A^GATCT,BglII:A^AGCTT,HindIII.
--re2       To specify a restriction enzyme instead of sonication to
            shorten di-tags. This restriction site does NOT form a Hi-C ligation junction. 2 .g. AG^CT,AluI. Typically the sonication protocol is followed.
--config    Specify the name of the optional configuration file
--genome    Name of the genome to be digested (not the path to the genome 
            file or files, but the genome name to include in the output file)
--help      Print program help and exit
--outdir    Specify the directory to which the output files should be 
            written
--quiet     Suppress all progress reports
--version   Print the program version and exit
--zip       Print the results to a gzip file

References

  1. Lieberman-Aiden, E. et al. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 326, 289–293 (2009).