***************************************************************************************
NOTE:  the executable is named "run_LA"  not "run_TA"; this is very old documentation.
- EJW
***************************************************************************************


Software Development Group : closure
Faculty Point of Contact : TBA
Software Developer : Granger Sutton
Description : TIGR_Assembler is a program for shotgun fragment assembly. It
takes as input a set of sequence fragments and outputs a set of consensus
sequences.
Keywords : shotgun fragment assembly
mailto : grange@tigr.org
------------------------------------------------------------------------------


This file contains information about how to assemble shotgun sequencing
projects using TIGR_Assembler. Included are instructions for running
three programs: pull_contig_js.sp, run_TA, and TIGR_Assembler. pull_contig_js.sp
is a perl program for pulling down files used as input for TIGR_Assembler.
run_TA is a csh shell script for running TIGR_Assembler without having to
explicitly provide all of the input parameters. TIGR_Assembler should rarely
if ever need to invoked directly. Also included is some information on input
and output file formats. Please look for items labelled ***NEW*** to keep
up with important changes.

Examples of how to use run_TA:

For a brand new project,

run_TA '-s -q proj.qual' proj.seq >& proj.run_log &

For a project already editted the old way,

run_TA '-s -q proj.qual -O proj.order' proj.seq >& proj.run_log &

For a project already editted the new way,

run_TA '-s -q proj.qual -C proj.contig' proj.seq >& proj.run_log &


where you can leave out the -s if you don't want singletons included in
the asmbl_link and assembly tables. proj is just the name of the project
you're working on. proj.qual is the quality values file. proj.seq is the
sequences file. proj.run_log is just an error log file. proj.order is the
sequence merge order file constructed from the current set of assemblies.
proj.contig is the contig file representing the current set of assemblies.

The .seq, .qual, .order, and .contig files can be generated using a perl
script pull_contig_js.sp which is also installed. Some examples:

New project

pull_contig_js.sp -p passwd_file -o proj -n seq.sql >& proj.pull_log &

editted old way

pull_contig_js.sp -p passwd_file -o proj -n seq.sql -t ord.sql >& proj.pull_log &

editted new way

pull_contig_js.sp -p passwd_file -o proj -n seq.sql -a asm.sql >& proj.pull_log &

where passwd_file is a sybase passwd_file of the form
user_login
database
sybase_password
sybase_server

proj is still the name of the project which the script will use as the prefix
for the files it generates.

seq.sql is a file with sql to select JUST the seq_names to be used for this
project such as

select seq_name from sequence
where trash is null and ed_ln > 0

ord.sql is a file with sql to select the asmbl_ids to jumpstart the order such
as

select asmbl_id from assembly
where seq# > 1
and asmbl_id >= 1202
and asmbl_id <= 1456

asm.sql is a file with sql to select the asmbl_ids to jumpstart with contigs such
as

select asmbl_id from assembly
where seq# > 1
and asmbl_id >= 1202
and asmbl_id <= 1456


I haven't really finalized how I'm doing clone length constraints so for the time
being the .seq file needs to be modified.

Use:

sed -e '/^>/s/ / minimum maximum median /' < proj.seq > proj.tmp
mv proj.tmp proj.seq

where minimum maximum median need to be replaced in the above by three numbers
representing your best guess of the minimum, maximum and median clone lengths
for the plasmid library used for this project. (If there is more than one
library with different values the sed command can be modified.)
Example:

sed -e '/^>/s/ / 500 3000 2000 /' < proj.seq > proj.tmp
mv proj.tmp proj.seq

The -c 'min_clone_len max_clone_len med_clone_len' option has been added to
specify the clone length constraints to be put in the .seq file without
having to run sed. WARNING: this still does not solve the problem of multiple
libraries with different clone length constraints.

Here is the help page if you run
pull_contig_js.sp -h

usage: pull_contig_js.sp options
        -U sybase user [default = unix id]
        -P password
        -D database
        -S server [default = DSQUERY]
              OR
        -p password_file

        -o output prefix [default = tmp]

	-c 'min_clone_len max_clone_len med_clone_len'
        -x delete seq_name SQL file
        -X delete seq_name list file
        -n seq_name SQL file
        -N seq_name list file
        -a asmbl_id SQL file (for contig jumpstart)
        -A asmbl_id list file (for contig jumpstart)
        -t asmbl_id SQL file (for order jumpstart)
        -T asmbl_id list file (for order jumpstart)
        -h (help)


Some options that were not mentioned above are the capital letter options
versus lower case letter options. For lower case (n,a,t,x) the file following
the option contains an sql query returning seq_names (n,x) or asmbl_ids (a,t).
For upper case (N,A,T,X) the file following the option contains a list of
seq_names (N,X) or asmbl_ids (A,T). The lower case and upper case options
can be combined in any combination and the program will take the union of
the two options. The x/X options stand for "eXclude" and will cause the
specified seq_names to be excluded from the output of the program in all
four output files (.seq,.qual,.order,.contig). The x/X seq_names override
all of the other options in terms of these seq_names definately not being
in the output. The relationship of options to output files is as follows:

n/N seq_names for the .seq and .qual files
a/A asmbl_ids for the .contig file and all seq_names in these assemblies
    in the .seq and .qual files
t/T asmbl_ids for the .order file and all seq_names in these assemblies
    in the .seq and .qual files
x/X seq_names to exclude from all of the output files

You can have pull_contig_js.sp generate multiple contigs for a
single assembly with regions removed by using the following format in
the file specified by the -A option:

asmbl_id lend1-rend1:lend2-rend2:...

3636 13425-14567:19876-20678:33333-39876

sequences which are COMPLETELY inside these ranges as determined by asm_lend
and asm_rend in the asmbl_link table will be excluded from the output files
and 3 or 4 contigs will be generated depending on whether the length of
assembly 3636 is greater than 39876.

Here is the help page if you run
TIGR_Assembler -h

The most often modified parameters for changing the merging frequency
are p,e,g,l. For jumpstarting you can use C and/or O. A new one which
can be helpfull is R for reverse complimenting an assembly.

Usage: TIGR_Assembler [options] scratch_file < input_file > output_file
  scratch_file - is the name of temporary data file TIGR_Assembler initially creates and
    later deletes at the end of execution.
  input_file - is the name of the multiple fasta formatted input file of DNA
    sequences. The format of the fasta description line should be:
    >seq_name minimum_clone_length maximum_clone_length median_clone_length clear_end5 clear_end3
    or
    >db|seq_name minimum_clone_length maximum_clone_length median_clone_length clear_end5 clear_end3
    e.g.
    >GHIBF57F 500 3000 1750 33 587
  output_file - is the name of the lassie formatted file which TIGR_Assembler generates
    to describe the set of assemblies TIGR_Assembler constructs. This is basically a flat
    file representation of the assembly and asmbl_link tables.
options:
  -a alignment_directory - is the name of a directory that TIGR_Assembler creates to store
    alignment files in GDE flat format.
  -A ace_output_file - tells TIGR_Assembler to generate "ace" output in the file
    name given.
  -c coverage_filename - output coverage file using the coverage_filename
  -C contig_file - is for "jumpstarting" the alignment process with a previously
    aligned set of contigs in the same format as the GDE alignment files.
  -d - is a flag saying to use the fasta description line as the description line in the
    "ace" file output.
  -D phd_dir - specifies the directory for .phd files for the description line in the
    "ace" file output.
  -e maximum_end - is the maximum length at the end of a DNA fragment that does
    not match another overlapping DNA fragment (sometimes referred to as
    overhang) that will not disqualify a DNA fragment from becoming part of an
    assembly.
  -f fasta_file - is the multiple fasta formatted file TIGR_Assembler creates of the
    consensus sequences of all assemblies including singletons.
  -g max_err_32 - is the maximum number + 1 of alignment errors (mismatches
    or gaps) allowed within any contiguous 32 base pairs in the overlap region
    between two DNA fragments in the same assembly. This is meant to split apart
    splice variants which have short splice differences and would not be
    disqualified by the -p minimum_percent parameter.
  -l minimum_length - is the minimum length two DNA fragments must overlap to
    be considered as a possible assembly.
  -L - is a flag which causes even very LOW pairwise scores to be considered - caution
    using this flag may cause longer run time and a worse assembly.
  -n asm_prefix - is the prefix for assembly names in the fasta file and file
    names in the alignment directory - default is "asm".
  -N - is a flag which indicates the DNA fragments in the input_file should not
    be treated as random genomic fragments for the purpose of determining repeat
    regions.
  -O order_file - specifies which sequences should be merged first in what order.
    The format is three fields per line: # seq_name1 seq_name2  - but seq_name2 is
    optional. Larger numbers in the first field are merged first. If seq_name2 is not
    given then all sequences matching seq_name1 will be merged first.
  -p minimum_percent - is the minimum percent identity that two DNA fragments
    must achieve over their entire region of overlap in order to be considered
    as a possible assembly. Adjustments are made by the program to take into
    account that the ends of sequences are lower quality and doubled base calls
    are the most frequent sequencing error.
  -P - is a flag which indicates the contig_file is in "ace" format.
  -q quality_file - is a multiple fasta format file of quality values corresponding
    to the sequences in the input_file. The quality values must be in the same order
    as the sequences - however sequences without quality values may be skipped.
  -r resort# - specifies how many sequences should be merged before resorting the
    possible merges based on clone constraints.
  -R - is a flag which causes no new merging to take place but just generates a
    reverse compliment of the contig_file.
  -s - is a flag which indicates that singletons (assemblies made up of a single
    DNA fragment) should be included in the lassie output_file - the default is
    not to include singletons.
  -t - is a flag which causes tandem 32mers (a tandem 32mer is a 32mer which occurs
    more than once in at least one sequence read) to be ignored (this is now the default
    behavior and this flag is for backward compatability).
  -T - is a flag which causes no new merging to take place but just a translation of
    the contig_file from one format to another.
  -u - is a flag which causes tandem 32mers to be used for pairwise comparison opposite
    of the -t flag which is now the default).
  -X - is a flag which causes merging to stop when only sequences which appear to be
    repeats are left and these cannot be merged based on clone length constraints.


This is the shell script run_TA which mainly saves some typing.
The optional parameters you specify to run_TA in the following way

run_TA 'optional parameters go inside single or double quotes' proj.seq >& proj.run_log

are now placed after the parameters specified in the run_TA script. This means
that your optional parameters will take precedence including those for stringency
criteria. You can do this instead of making your own copy of run_TA.

#!/bin/csh -f
#Copyright 1995 The Institute for Genomic Research.  All rights reserved.
unset noclobber
if (! -e "${1}") then
	set asmg_options="${1}"
	shift
else
	set asmg_options=""
endif
foreach file (${argv[*]})
	set file_tail=${file:t}
	set file_root=${file_tail:r}
	(TIGR_Assembler -n ${file_root} -g 8 -l 40 -e 15 -p 97.5 -a ${file_root}.align -f ${file_root}.fasta ${asmg_options} ${file_root}.scratch < ${file} >! ${file_root}.asm) >&! ${file_root}.error
end

***NEW***

File Formats

Sequence input file for TIGR_Assembler

Basically multiple FASTA format with some extra information on the description
line about clone length constraints and the clear range. TIGR_Assembler attempts
to use information about sequences from different ends of the same clone to
help place sequence reads in the correct place. In order to do this, the
minimum, maximum and median clone length for a given clone must be specified.
This is done on the description line for each sequence. In addition, now that
the entire sequence read is given in the sequence file TIGR_Assembler needs to
know the clear range for each sequence read. The clear range is the contiguous
region of the sequence read which is free of vector and of reasonable quality.
The clear range is specified by two numbers which indicate an inclusive range.
Coordinates range from 1 to the length of the sequence read. At TIGR these
coordinates are stored as a CLR feature in the feature table of the project
database for each sequence read. The clear range is usually determined by some
vector and quality finding programs - at TIGR we use the ethyl package.

Example:

>seq_name1 min_len max_len med_len clear_left clear_right
sequence
.
.
.
>seq_name2 min_len max_len med_len clear_left clear_right
.
.
.
>seq_nameN min_len max_len med_len clear_left clear_right
.
.
.

where seq_name is the sequence name of the read, min_len is the minimum length
of the clone, max_len is the maximum length of the clone, med_len is the median
length of the clone, clear_left is the left coordinate of the clear range, and
clear_right is the right coordinate of the clear range. Unfortunately for those
not at TIGR the naming convention is very important: the first 7 characters of
the seq_name is the clone name and then the following characters F, TF, S, T3,
T646, and TH indicate "forward" reads and R, TR, V, T7, T644, and TV indicate
"reverse" reads. Reads with the same clone name and opposite directions form
clone pairs which have the clone length constraints applied to them.

The minimum, maximum, and median clone lengths can initially be chosen based
on the size selection process for the clone library. An assembly should then
be done and clone lengths collected for those clone pairs whose members are
in the same assembly. A rule of thumb is too chose minimum and maximum so
that approximately 5% of clone lengths fall below the minimum and 5% fall above
the maximum.

Concrete Example:

>GBBAA01F 1700 2500 2100 33 578
ACGT...
ACGT...
.
.
.
>GBBAA01R 1700 2500 2100 22 435
ACGT...
ACGT...
.
.
.