Quickstart¶
SAM Alignment File¶
trcls requires a pre-aligned SAM file. The program has been used successfully with bowtie2 and HISAT2 aligned files. SAM alignment files from other sequence alignment programs may not work if they represent alignments (particularly, the CIGAR and MD fields) differently from bowtie2 or HISAT2.
The SAM alignment file should be trimmed to the region of interest. This is most
commonly performed with samtools. If you intend to use the annotated SAM
alignment file downstream, which is almost always the case, be sure to include
the headers with the -h
option.
For example, to extract the alignments mapping to FLNA to a file flna.sam from the pre-aligned file pre-aligned.bam.
samtools view -h pre-aligned.bam -o flna.sam chrX:154348524-154374638
Obtaining a GTF Annotation File¶
trcls additionally requires a GTF annotation file of known transcripts from the region of interest. Most typically, these annotations originate from a single gene. There are some requirements that must be met with regard to the contents of this file.
There must be feature lines with a feature field of value ‘exon’. These are, in fact, the only feature lines which trcls reads.
Each feature line with feature field of value ‘exon’ must have an attribute field with tag-value pair with tag ‘transcript_id’ and value corresponding to a unique identifier (usually the accession number) corresponding to the transcript with which this exon belongs.
For examples of valid GTF files, please see the file FLNA.gtf from the project repository. These files can be generated from the UCSC Table Browser. For GTF file specifications, please refer to this page.
Output¶
The annotated SAM alignment is output via stdout. Log messages are sent to stderr.
python3 trcls flna.sam flna.gtf > flna-annotated.sam 2> flna.log
Annotations¶
Annotations are added to each alignment via an optional field tagged TR
of
type string, i.e. TR:Z
. If an alignment fails to annotate, the value for
this field is set as a single asterisk character (*
); otherwise, it is a
comma separated string of possible transcripts which this sequence read may have
originated from. Possible pre-mRNA is always represented by value ‘pre-mRNA’.
For example:
Tag |
Meaning |
---|---|
|
No annotation information available. |
|
Originates from pre-mRNA only. |
|
Originates from NM_001456 only. |
|
Originates from either pre-mRNA, or NM_0001110556. |