Bismark
| Function | A tool to map bisulfite converted sequence reads and determine cytosine methylation states |
|---|---|
| Language | Perl |
| Requirements | A functional version of Bowtie or Bowtie2 is required |
| Code Maturity | Stable (for Bowtie and Bowtie2) |
| Code Released | Yes, under GNU GPL v3 or later. |
| Initial Contact | Felix Krueger |
Download Now |
|
Bismark is a program to map bisulfite treated sequencing reads to a genome of interest and perform methylation calls in a single step. The output can be easily imported into a genome viewer, such as SeqMonk, and enables a researcher to analyse the methylation levels of their samples straight away. It's main features are:
- Bisulfite mapping and methylation calling in one single step
- Supports single-end and paired-end read alignments
- Supports ungapped and gapped alignments
- Alignment seed length, number of mismatches etc. are adjustable
- Output discriminates between cytosine methylation in CpG, CHG and CHH context
This link will take you to the Bismark publication.
This link will take you to our review about primary data analysis in BS-Seq.
Here you can access the Bismark documentation Bismark User Guide (pdf).
Here is an overview of the alignment modes that are currently supported by Bismark: Bismark alignment modes (pdf).
Changelog
- 05-10-13: Version 0.7.12 released
-
- Bismark: Removed a rogue sleep(1) command that would slow down single-end Bowtie 2 alignments for a single lane of HiSeq (200M sequences) from ~1 day to 6 years and 4 months (roughly)
- bismark2bedGraph: keeps now track of the temp files it just created in a session instead of using all files in the output folder ending in ".methXtractor.temp". This lets you kick off the bedGraph conversion step from already sorted, individual methXtractor.temp files if desired
- 04-22-13: Version 0.7.11 released
-
- Bismark: Fixed non-functional single-end alignments with Bowtie2 which were accidentally broken by introducing the option '--pbat' in v0.7.10 (an evil 'if' instead of 'elsif'...)
- For paired-end alignments with Bowtie 1, the option '--non_bs_mm' would accidentally confuse the number of mismatches of read 1 and read 2 whenever the first read aligned in reverse orientation, i.e. for OB and CTOT alignments. This has now been corrected
- Previously, the option '--non_bs_mm' would potentially output non-integer values for Bowtie 2 alignments if the read (or reference) contained 'N' characters. Alignment scores from 'N's are now adjusted so that they count as mismatches similar to what Bowtie 1 does. This works for fine reads with up to and including 5 N's (which is quite a lot...)
- Methylation extractor: To avoid duplication and keep code modular, the bedGraph conversion step invoked by the option '--bedGraph' is now been farmed out to the module 'bismark2bedGraph'. This script is independent of the methylation extractor and also works as a stand-alone tool from the methylation extractor output (compressed or gzip compressed files). To work well from within the methylation extractor this script (which is now included in the Bismark package) needs to reside in the same folder as the 'bismark_methylation_extractor' itself
- bismark2bedGraph: Temporary chromosome files now have an input file name included in their file name to enable parallel processing of several files in the same directory at the same time
- To avoid duplication and keep code modular, the bedGraph to genome-wide cytosine methylation report step invoked by the option '--cytosine_report' has now been split out to the module 'bedGraph2cytosine'. This script is independent of the methylation extractor and also works as a stand-alone tool from the Bismark bedGraph '--counts' output (compressed or gzip compressed files). To work well from within the methylation extractor this script (which is now included in the Bismark package) needs to reside in the same folder as the 'bismark_methylation_extractor' itself
- Deduplication script: Fixed some warnings that were thrown if '--bam' was not specified
- 04-18-13: Version 0.7.10 released
-
- Bismark: Added new option '--gzip' that causes temporary bisulfite conversion files to be written out in a GZIP compressed form to save disk space. This option is available for most alignment modes with the exception of paired-end FastA files
- Added new option '--bam' that causes the output file to be written out in BAM format instead of the default SAM format. Bismark will attempt to use the path to Samtools that was specified with '--samtools_path', or, if it hasn't been specified explicitly, attempt to find Samtools in the PATH. If no installation of Samtools can be found the SAM output will be compressed with GZIP instead (yielding a .sam.gz output file)
- Added new option '--samtools_path' to point Bismark to your Samtools installation, e.g. /home/user/samtools/. Does not need to be specified explicitly if Samtools is in the PATH
- Added new option '--pbat' which is to be used for PBAT-Seq libraries (Post-Bisulfite Adapter Tagging; Kobayashi et al., PLoS Genetics, 2012). This is essentially the exact opposite of alignments in 'directional' mode, as it will only launch two alignment threads to the CTOT and CTOB strands instead of the normal OT and OB ones. The option '--pbat' works currently only for single-end and paired-end FastQ files for use with Bowtie1 and uncompressed temporary files only (there are no plans to extend this to other alignment modes at present)
- Methylation extractor: The methylation extractor does now also read BAM files, however this requires a working copy of Samtools. The new option '--samtools_path' may point the methylation extractor to your Samtools installation, e.g. /home/user/samtools/. This does not need to be specified explicitly if Samtools is in the PATH
- Added new option '--gzip' to write out the primary methylation extractor files (CpG_OT_..., CpG_OB_... etc) in a GZIP compressed form to save disk space. This option does not work on bedGraph and genome-wide cytosine reports as they are 'tiny' anyway
- The methylation extractor does now treat InDel free reads differently than before which leads to a ~60% increase in extraction speed for ungapped alignments in SAM format!
- When sorting methylation calls for the bedGraph step, the methylation extractor does now use the output directory to store temporary sort files instead of the default /tmp/ directory
- Deduplication script: The deduplication script does now also read BAM files, however this requires a working copy of Samtools. The new option '--samtools_path' may point the script to your Samtools installation, e.g. /home/user/samtools/. This does not need to be specified explicitly if Samtools is in the PATH
- The deduplication script also received the new option '--bam' to write out deduplicated files directly in BAM format. If no installation of Samtools can be found the SAM output will be compressed with GZIP instead (yielding a .sam.gz output file)
- 03-01-13: Version 0.7.8 released
-
- Bismark: Added new option '--non_bs_mm' which prints an extra column at the end of SAM files showing the number of non-nisulfite mismatches of a read. This option is not available in '--vanilla' format. Format for single-end reads: "XA:Z:mismatches". Format for paired-end reads: read 1: "XA:Z:mismatches", read 2: "XB:Z:mismatches"
- Bismark: The mapping report file names were changed to _bismark_(SE/PE)_report.txt (Bowtie 1) or bt2_bismark_(SE/PE)_report.txt (Bowtie 2) to keep it more uniform
- Methylation extractor: The input file(s) may now be specified with a file path which abrogates the need to be in the same directory as the input file(s) when calling the methylation extractor
- Methylation extractor: Added new function '--buffer_size' to increase the physical memory used for the sorting the output by chromosomal positions (only needed for bedGraph output)
- Methylation extractor: Reference sequence files containing pipe ('|') characters were found to crash the methylation extractor as the chromosome name was used for filenames. These characters are now replaced with underscores when the reads are sorted during the bedGraph step
- Updated the Bismark User Guide with sections for the bedGraph and genome-wide methylation report outputs, and Appendix IV is now showing alignment stats for the test data
- 02-10-12: Version 0.7.7 released
-
- When reading in the genome file, Bismark does now automatically remove \r line ending characters as well. This sometimes caused problems when genome files had been edited on Windows machines.
- Added support for the Bowtie 2 options '--rdg int1,int2' and '--rfg int1,int2' to adjust the gap open and extension penalties for both read and reference sequence. This might be useful in very specialised circumstances (e.g. when handling PacBio data...)
- The methylation extractor received a fairly extensive overhaul:
- Renamed methylation_extractor to bismark_methylation_extractor
- Added new function '-o/--output' to specify an output directory. This became necessary for integration into Galaxy
- Added new function '--no_header' to suppress the Bismark version header in the output files if plain alignment data is more desirable
- Added option '--bedGraph' to produce a bedGraph output file once the methylation extraction has finished; this reports the genomic location of a cytosine and its methylation state (in %). By default, only cytosines in CpG context will be sorted/reported
- Implemented option '--cutoff threshold' to set the minimum number of times a methylation state has to be seen for that nucleotide before its methylation percentage is reported
- Implemented option '--counts' which adds two additional columns to the bedGraph output file to enable further calculations:
Column 5: count of methylated calls per position
Column 6: count of unmethylated calls per position - Implemented option '--CX_context' so that the sorted bedGraph output file contains information on every single cytosine that was covered in the experiment irrespective of its sequence context
- Added option '--cytosine_report' which produces a genome-wide methylation report for all cytosines. By default, the output uses 1-based chromosome coordinates and reports CpG context only. The output considers all Cs on both forward and reverse strands and reports their position, strand, trinucleotide content and methylation state
- Option '--CX_context' applies to the cytosine report as well. The output file wil contain information on every single cytosine in the genome irrespective of its context. This applies to both forward and reverse strands
- Implemented option '--zero_based' to use zero-based coordinates like used in e.g. bed files instead of 1-based coordinates
- Implemented option '--genome_folder PATH' to be used to extract sequences from. Accepted formats are FastA files ending with '.fa' or '.fasta'
- Added an option '--split_by_chromosome' which writes the cytosine report output to individual chromosome files instead of to one single very large file
- 23-08-12: Update to genome_methylation2bedGraph script
-
- Added an option '--split_by_chromosome' to enable sorting of very large files. The methylation extractor output is first written into temporary files chromosome by chromosome. These temporary files can then sorted by position and are deleted afterwards
- Added an option '--counts' which adds 2 more lines to the output file to enable further calculations (technically no longer in bedGraph format then...):
Column 5: count of methylated calls per position
Column 6: count of unmethylated calls per position
- 31-07-12: Version 0.7.6 released
-
- Reworked the way in which SAM files (both single and paired-end) are handled in the methylation extractor so that reads containing InDels, which may be generated by Bismark using Bowtie 2, are now handled as intended. Bismark users employing Bowtie 2 for alignments are strongly encouraged to upgrade to this version
- Changed the way in which the methylation extractor identifies the read and genome conversion flags in SAM output. This might become relevant if the Bismark SAM mapping output was compressed/decompressed with CRAM or Goby at some point, since these tools may change the order of optional tags in a SAM entry. Thanks to Z. Zeno for pointing this out and contributing a patch
- 16-07-12: Version 0.7.5 released
-
- Trailing read ID segment numbers (e.g. /1,/2 or /3) are now removed internally for Bowtie 2 alignments in paired-end mode as this might have caused no reads to align at all if the segment number was not 1 or 2. As of Bowtie 2 version 2.0.0-beta7 this behavior has been disabled for unpaired reads
- The Bowtie 2 option -M is now deprecated (as of Bowtie 2 version 2.0.0-beta7). What used to be called -M mode is still the default mode, but adjusting the -M setting is deprecated. The options -D and -R should be used to adjust the effort expended to find valid alignments
- Changed the default seed mismatch parameter (controlled by -n) to 1 (down from 2). This increases alignment speed noticably and typically produces very similar results for good quality read data
- Fixed a bug where the chromosomal sequence could not be extracted for very short genomic sequences for alignments with Bowtie 2
- The methylation extractor and the Bismark alignment output deduplication script do now read both raw and gzipped (.gz) Bismark mapping files
- Manual updated accordingly
- 26-04-12: Version 0.7.4 released
-
- Introduced a new option '--temp_dir <dir>' to which the C-to-T or G-to-A transcribed temporary files can be written to instead of using the same folder that contains the input files. This might become useful for implementation into Galaxy.
- The input files to be aligned may now contain path information, e.g. /home/user/file.fq or ../temp/file.fq, and one no longer has to call Bismark from within the directory containing the input files.
- 05-04-12: Version 0.7.3 released
-
- Corrected a bug for the TLEN field in paired-end SAM output. This value was occasionally calculated incorrectly if both reads were overlapping almost entirely with a difference of only a single bp between the end of one read and the start of the second read. This did not affect the output of the methylation extractor but merely the display of the read alignment itself
- Removed a potential source of crashes with gzipped input files and the option -u/--qupto
- methylation_extractor: Corrected a potential flaw for the 'remove overlap' option for paired-end alignments in --vanilla mode when the first read aligned in a reverse orientation
- methylation_extractor: file endings of all files generated by the methylation extractor will be only a single '.txt' if the file was called .txt before
- 14-03-12: Version 0.7.2 released
-
- methylation_extractor: changed the file endings of all files generated by the methylation extractor to '.txt'; this is to avoid confusing these files with SAM formatted Bismark output files
- deduplicate_Bismark_alignment_output.pl: Fixed a bug for paired-end deduplication mode in SAM format, which only printed the second read alignment of a pair to the deduplicated file
- trim_galore: Updated so that non-RRBS FastQ files are adapter and quality trimmed in a single pass
- trim_galore: added an option --fastqc_args "..." to pass extra arguments to FastQC for easier integration into pipelines
- trim_galore: Added some more documentation and trim_galore can now be downloaded separately here
- validate_paired_end_files: Updated so that one can optionally write out unpaired single-end reads should a read-pair fail to be considered a valid paired-end read pair
- 29-02-12: Version 0.7.1 released
-
- Adjusted Bismark so that white spaces or tab characters in the read IDs get replaced with underscores on the fly. This was necessary because some ID checks would fail as Bowtie2 truncates read IDs if it encounters spaces in the read ID (causing errors with the latest RTA version), whereas Bowtie 1 only truncates read IDs if 'tab' characters were found. More information about this can be found in the RELEASE_NOTES.
- An RRBS QC pack is now avaliable for download which contains a brief guide to RRBS, the Cutadapt-wrapping script trim_galore as well as a validate_paired_end_files script to remove read pairs for which at least one of the read has been trimmed to a too short read length due to quality and/or adapter trimming.
- 24-02-12: Version 0.7.0 released
-
- Changed Bismark's behavior for "--directional" mode (default) to run only 2 parallel instances of Bowtie 1/2 to the original top (OT) and bottom (OB) strands, instead of 4 instances to all possible bisulfite strands. This change might result in somewhat faster alignment speed and mapping efficiency. It is still possible to run the 4-alignment strand mode for any combination of input file(s) and choice of aligner by specifying --non_directional.
- Changed the --score_min default function for Bowtie 2 alignments to a more stringent setting of "L,0,-0.2" instead of using the Bowtie2 default function (which was "L,0,-0.6")
- 06-02-12: Version 0.6.4 released
-
- Adjusted the options -u and -s so that only the non-skipped part of the input file will be transcribed and analysed. This allows splitting up very large files into smaller chunks to allow parallel processing, e.g -s 10000000 -u 20000000 would analyse sequences 10000001 to 20000000. The alignment report will be based on this reduced number of reads analysed
- In paired-end mode, the options --unmapped and --ambiguous do now output unaligned or multiply aligned reads, respectively, to their correct output files as intended
- Sequences in FastA format do now receive Phred score qualities of 40 throughout (ASCII 'I') to prevent the SAM to BAM conversion in SAMtools from failing
- If a genomic sequence could not be extracted it will now also be counted and reported for use with Bowtie 1
- Suppressed debugging warning meassages that were printed in error for Bowtie2 alignments (single-end mode only)
- 04-01-12: Version 0.6.3 released
-
- The methylation extractor does now also work with Bismark SAM output files
- Fixed a bug caused when a read was called 0 (zero)
- Changed the XX:Z mismatch field in the SAM output to display mismatching nucleotides of the reference sequence (instead of the read sequence ones)
- 15-12-11: Version 0.6.beta2 released
-
- Added a parallelization option for Bowtie 2 alignments ('-p'). Since it makes use of the option '--reorder' this option requires a Bowtie 2 version of 2.0.0-beta5 or higher. This option is still experimental and is only recommended for use on very powerful hardware setups (i.e. lots of cores and memory).
- 08-12-11: Version 0.6.beta1 released
-
- Bismark does now also support gapped alignments with Bowtie 2 (when specifying the option '--bowtie2')
- The bismark_genome_preparation does now also generate Bowtie 2 bisulfite indexes
- The Bismark default output has been changed to SAM format (for both Bowtie 1 and Bowtie 2)
- The 'old' output is still available via the option '--vanilla'
- Slightly increased the alignment efficiencies for Bowtie 1 alignments
- Changed the default mapping behavior to the former option '--directional' ('--non-directional' re-enables four-strand output)
- Changed the default maximum insert size parameter (-X/--maxins) for paired-end alignments to 500bp (up from 250bp)
- The methylation extractor works currently only on the 'vanilla' Bismark output
- The bismark2SAM script will now reverse qualities and methylation calls when reads were reverse-complemented
- 17-10-11: Version 0.5.4 released
-
- Bismark will now accept input files in either normal, uncompressed or gzipped format
- Added the option -o/--output_dir <dir> to Bismark which lets you specify the folder for all Bismark output files instead of writing into the same folder as the input file(s). If the output directory does not exist already it will be created first
- The path to the genome folder can now be absolute or relative (e.g. ../genomes/mouse/)
- Changed the way unmapped or ambiguous reads are reported so that one output file (and/or ambiguous read file) is generated per input file. Their name will be derived from the input file name. For paired-end samples, the unmapped or ambiguous filenames can be discriminated by _1 and _2 in their file names
- Added the number of sequences analysed in total to the paired-end report file (was only printed on screen previously)
- Fixed a bug for the FastQ output for ambiguous reads where quality scores were not followed by a new line
- 20-09-11: Update to bismark2SAM script
-
- The bismark2SAM script does now also report the methylation calls in a custom field (XM) for easier downstream processing. In addition, the second read of a paired-end alignment has a 2 at the end of the ID field to reflect the paired-end nature. Thanks to T. McBryan for implementing these new features.
- 13-09-11: Version 0.5.3 released
-
- Increased the 'chunkmbs' default value to 512 MB (up from 256 MB)
- Corrected a mix-up of the strand names of the complementary strands in the alignment report for single-end alignments (see release notes)
- Fixed a bug in the genome_methylation_bismark2bedGraph script that was introduced during the 1-based (Bismark) to 0-based (bedGraph) coordinate adaptation in June 2011. Thanks to M.A. Bentley for his contributions to the new version.
- Improved the bismark2SAM script to more accurately describe the origin of a bisulfite strand in the bitwise FLAG field. Thanks to E. Vidal for his contributions to the new version.
- 16-08-11: Version 0.5.2 released
-
- Increased the 'chunkmbs' default value to 256 MB (up from 64 MB)
- Bismark will now accept input files in both comma and space separated format
- Fixed a bug in the methylation extractor which resulted in offset positions for reverse reads when the option '--ignore' was used (single-end only)
- Included a check (and warning) whether the read IDs in the input files contain tab characters, as this will cause Bowtie to truncate the reads and result in no alignments
- 16-06-11: Version 0.5.1 released
-
- The genome folder for the bismark_genome_preparation can now be specified either as absolute or relative path
- Fixed a bug where a newline character was missing after the quality values in the unmapped reads FastQ output
- Fixed a bug which prevented paired-end alignments in FastA format
- Input files for the methylation extractor can now also have a relative path
- 21-04-11: Version 0.5.0 released
-
- Bismark alignments should now also support FastQ files produced by Casava v1.8 which will be available soon
- The Bismark output will now have an additional column (2 extra columns for paired-end data) with the basecall qualities (in Phred33/ Sanger format; left blank for FastA data)
- A bug was fixed for the reporting of paired-end alignments whereby alignments to the CTOT strand were assigned to CTOB strand and vice versa
- 10-02-11: Version 0.4.1 released
-
- Bisulfite genomes are now written into a multi-FastA file by default. This allows indexing of new genomes with tens of thousands of contigs or scaffolds
- The internal reporting of paired-end alignments was changed, so that sequence which produce two identical alignments are preferentially assigned to the original strands as intended
- 04-02-11: Version 0.4.0 released
-
- The option --directional is now also available for paired-end libraries. This will ignore alignments to strands which should theoretically not be sequenced
- Fixed a strand confusion in the alignments summary report for paired-end alignments (this only affected the report but not any alignments as such)
- 26-01-11: Version 0.3.0 released
-
- The Bismark User Guide replaces the previous documentation (INSTALL.txt and README.txt). It is easy to follow and contains many more details about BS-Seq and Bimark
- A BS-Seq test dataset is now available for download. It contains 10K sequences (human, shotgun) in FastQ format, taken from the SRR020138 data set (Lister et al, 2009).
- Both bismark and bismark_genome_preparation will now recognise the reference genome sequences with either .fa and .fasta file extensions
- 18-01-11: Version 0.2.6 released
-
- Fixed a bug which might have been caused by specifying very lax alignment parameters (allowing 10+ non-BS mismatches)
- 22-12-10: Version 0.2.5 released
-
- Added the option '--un <filename>' to write out unaligned reads to <filename>
- Added the option '--ambiguous <filename>' to write out ambiguously aligned reads to <filename>
- 18-11-10: Version 0.2.4 released
-
- Added the option '-I/--minins <int>' to modify the minimum valid insert size for paired-end alignments
- Added the option '-X/--maxins <int>' to modify the maximum valid insert size for paired-end alignments
- Changed the remove_tree command in the genome preparation script to rm_tree to be compatible with older versions of Perl (thanks to S. Cooper for spotting this)
- 04-11-10: Version 0.2.3 released
-
- Added the option '--directional' to Bismark to only report alignments to the original strands if the library was generated in a strand-sepcific manner
- The alignment option '--best' will now be selected by default to ensure the best possible alignments
- All Bismark output files will now end in .txt so they can be viewed or imported more easily
- Changed the reporting format slightly to increase readability
- 13-09-10: Version 0.2.2 released
-
- Fixed a bug whereby the methylation positions of certain reverse mapped reads were offset by a few bp (in the methylation extractor output)
- 08-09-10: Version 0.2.1 released
-
- The Bismark aligner will now handle Multi-Fasta-Files (MFA) as intended.
- 07-09-10: Version 0.2.0 released
-
- Non-CpG context methylation will now be subdivided into CHG and CHH context
- Added the option '--chunkmbs <int>' to counteract Bowtie best-first memory chunk exhaustion warnings in --best and paired-end alignment mode
- Added the option '--quiet' so that bowtie warnings can be suppressed if desired
- FastA files do no longer need the file extension '.fa' in order to work
- Bismark will no longer tolerate non-unique chromosome names when reading the genome into memory
- Fixed an issue with paired-end report files
- The methylation extractor will by default produce individual output files for CpG, CHG and CHH conext
- The methylation extractor can optionally merge CHG and CHH context into 'non-CpG' context if desired
- The methylation extract will ensure that its version matches the Bismark version used to generate the Bismark mapping results file
- 09-08-10: Version 0.1.5 released
-
- Fixed a bug whereby specifying '-n 0' as alignment parameter would not work correctly
- 06-08-10: Version 0.1.4 released
-
- Bismark will no longer stop during the methylation call process when it encounters ambiguity bases in the reference genome
- Fixed a strand-specifity mix-up in the single-end methylation extractor output
- 03-08-10: Version 0.1.3 released
-
- The genome indexer will now properly (and recursively) remove any pre-existing bisulfite genome directory before creating a new one
- The genome indexer will now convert ambiguity code for DNA into N's instead (anything else than C, A, T or G)
- The genome indexer does now also accept fastA files with mutltiple sequence entries
- Fixed a strand-specifity mix-up in the single-end methylation extractor output
- The option to ignore bases in the methylation extractor does now correctly alter the position of the remaining methylation calls
- Added an option to the methylation extractor to score overlapping methylation calls for paired-end alignments only once
- 17-06-10: Version 0.1.2 released
-
- Both single-end and paired-end alignments have a new and final output format (see README.txt for more details)
- Bismark and the Methylation Extractor will include their version info in the first line of the output file
- Fixed a bug with the chromosome name resolution for paired-end alignments
- Reads aligning to the very edges of chromosomes previously produced several error messages when trying to extract one additional bp to determine if Cs are in CpG context. These reads will be excluded.
- The Bismark and Methylation Extractor --help option will give info about their output file format
- 15-06-10: Version 0.1.1 released
-
- Bismark also handles genome fastA files in other formats than only Ensembl format
- Fixed a runtime bug with first alignments
- 14-06-10: Version 0.1 released
-
- Initial release
- All basic functions working