Deduplication Filter

The deduplication filter is designed to remove redundancy in a probe list. It can do this in one of two ways, either by comparing the names of probes and merging those with a common prefix, or by looking for overlaps between probes and merging those with a high enough degree of overlap.

Where duplicates are detected you can choose which one to keep by either selecting on the length of the probes, or by using the annotated value on the probe list you are deduplicating.

The most common use of this filter would be reducing a set of transcript probes to a minimal set by removing additional splice variants, leaving only one variant per gene. The default name filter is set up to do this using the naming convention used by Ensembl (gene-123, where 123 is a transcript identifier).

Deduplication Options

Options

  1. You can choose whether to deduplicate based on probe names or position overlap
  2. For name based deduplication you can specify a pattern which will be removed from the probe name. Probes with the same remaining names after removal of the pattern will be merged. The pattern can be a simple string or can use full regular expression syntax.
  3. For overlap based deduplication you can specify the minimum percentage overlap required between the two probes. The overlap is always calculated against the shorter of the probes.
  4. Where duplicates are detected you can choose to select which one to keep based on the highest or lowest value for either the probe length, or the annotated value for that probe in the list being filtered