*************************************************************************************** NOTE: the executable is named "run_LA" not "run_TA"; this is very old documentation. - EJW *************************************************************************************** Software Development Group : closure Faculty Point of Contact : TBA Software Developer : Granger Sutton Description : TIGR_Assembler is a program for shotgun fragment assembly. It takes as input a set of sequence fragments and outputs a set of consensus sequences. Keywords : shotgun fragment assembly mailto : grange@tigr.org ------------------------------------------------------------------------------ This file contains information about how to assemble shotgun sequencing projects using TIGR_Assembler. Included are instructions for running three programs: pull_contig_js.sp, run_TA, and TIGR_Assembler. pull_contig_js.sp is a perl program for pulling down files used as input for TIGR_Assembler. run_TA is a csh shell script for running TIGR_Assembler without having to explicitly provide all of the input parameters. TIGR_Assembler should rarely if ever need to invoked directly. Also included is some information on input and output file formats. Please look for items labelled ***NEW*** to keep up with important changes. Examples of how to use run_TA: For a brand new project, run_TA '-s -q proj.qual' proj.seq >& proj.run_log & For a project already editted the old way, run_TA '-s -q proj.qual -O proj.order' proj.seq >& proj.run_log & For a project already editted the new way, run_TA '-s -q proj.qual -C proj.contig' proj.seq >& proj.run_log & where you can leave out the -s if you don't want singletons included in the asmbl_link and assembly tables. proj is just the name of the project you're working on. proj.qual is the quality values file. proj.seq is the sequences file. proj.run_log is just an error log file. proj.order is the sequence merge order file constructed from the current set of assemblies. proj.contig is the contig file representing the current set of assemblies. The .seq, .qual, .order, and .contig files can be generated using a perl script pull_contig_js.sp which is also installed. Some examples: New project pull_contig_js.sp -p passwd_file -o proj -n seq.sql >& proj.pull_log & editted old way pull_contig_js.sp -p passwd_file -o proj -n seq.sql -t ord.sql >& proj.pull_log & editted new way pull_contig_js.sp -p passwd_file -o proj -n seq.sql -a asm.sql >& proj.pull_log & where passwd_file is a sybase passwd_file of the form user_login database sybase_password sybase_server proj is still the name of the project which the script will use as the prefix for the files it generates. seq.sql is a file with sql to select JUST the seq_names to be used for this project such as select seq_name from sequence where trash is null and ed_ln > 0 ord.sql is a file with sql to select the asmbl_ids to jumpstart the order such as select asmbl_id from assembly where seq# > 1 and asmbl_id >= 1202 and asmbl_id <= 1456 asm.sql is a file with sql to select the asmbl_ids to jumpstart with contigs such as select asmbl_id from assembly where seq# > 1 and asmbl_id >= 1202 and asmbl_id <= 1456 I haven't really finalized how I'm doing clone length constraints so for the time being the .seq file needs to be modified. Use: sed -e '/^>/s/ / minimum maximum median /' < proj.seq > proj.tmp mv proj.tmp proj.seq where minimum maximum median need to be replaced in the above by three numbers representing your best guess of the minimum, maximum and median clone lengths for the plasmid library used for this project. (If there is more than one library with different values the sed command can be modified.) Example: sed -e '/^>/s/ / 500 3000 2000 /' < proj.seq > proj.tmp mv proj.tmp proj.seq The -c 'min_clone_len max_clone_len med_clone_len' option has been added to specify the clone length constraints to be put in the .seq file without having to run sed. WARNING: this still does not solve the problem of multiple libraries with different clone length constraints. Here is the help page if you run pull_contig_js.sp -h usage: pull_contig_js.sp options -U sybase user [default = unix id] -P password -D database -S server [default = DSQUERY] OR -p password_file -o output prefix [default = tmp] -c 'min_clone_len max_clone_len med_clone_len' -x delete seq_name SQL file -X delete seq_name list file -n seq_name SQL file -N seq_name list file -a asmbl_id SQL file (for contig jumpstart) -A asmbl_id list file (for contig jumpstart) -t asmbl_id SQL file (for order jumpstart) -T asmbl_id list file (for order jumpstart) -h (help) Some options that were not mentioned above are the capital letter options versus lower case letter options. For lower case (n,a,t,x) the file following the option contains an sql query returning seq_names (n,x) or asmbl_ids (a,t). For upper case (N,A,T,X) the file following the option contains a list of seq_names (N,X) or asmbl_ids (A,T). The lower case and upper case options can be combined in any combination and the program will take the union of the two options. The x/X options stand for "eXclude" and will cause the specified seq_names to be excluded from the output of the program in all four output files (.seq,.qual,.order,.contig). The x/X seq_names override all of the other options in terms of these seq_names definately not being in the output. The relationship of options to output files is as follows: n/N seq_names for the .seq and .qual files a/A asmbl_ids for the .contig file and all seq_names in these assemblies in the .seq and .qual files t/T asmbl_ids for the .order file and all seq_names in these assemblies in the .seq and .qual files x/X seq_names to exclude from all of the output files You can have pull_contig_js.sp generate multiple contigs for a single assembly with regions removed by using the following format in the file specified by the -A option: asmbl_id lend1-rend1:lend2-rend2:... 3636 13425-14567:19876-20678:33333-39876 sequences which are COMPLETELY inside these ranges as determined by asm_lend and asm_rend in the asmbl_link table will be excluded from the output files and 3 or 4 contigs will be generated depending on whether the length of assembly 3636 is greater than 39876. Here is the help page if you run TIGR_Assembler -h The most often modified parameters for changing the merging frequency are p,e,g,l. For jumpstarting you can use C and/or O. A new one which can be helpfull is R for reverse complimenting an assembly. Usage: TIGR_Assembler [options] scratch_file < input_file > output_file scratch_file - is the name of temporary data file TIGR_Assembler initially creates and later deletes at the end of execution. input_file - is the name of the multiple fasta formatted input file of DNA sequences. The format of the fasta description line should be: >seq_name minimum_clone_length maximum_clone_length median_clone_length clear_end5 clear_end3 or >db|seq_name minimum_clone_length maximum_clone_length median_clone_length clear_end5 clear_end3 e.g. >GHIBF57F 500 3000 1750 33 587 output_file - is the name of the lassie formatted file which TIGR_Assembler generates to describe the set of assemblies TIGR_Assembler constructs. This is basically a flat file representation of the assembly and asmbl_link tables. options: -a alignment_directory - is the name of a directory that TIGR_Assembler creates to store alignment files in GDE flat format. -A ace_output_file - tells TIGR_Assembler to generate "ace" output in the file name given. -c coverage_filename - output coverage file using the coverage_filename -C contig_file - is for "jumpstarting" the alignment process with a previously aligned set of contigs in the same format as the GDE alignment files. -d - is a flag saying to use the fasta description line as the description line in the "ace" file output. -D phd_dir - specifies the directory for .phd files for the description line in the "ace" file output. -e maximum_end - is the maximum length at the end of a DNA fragment that does not match another overlapping DNA fragment (sometimes referred to as overhang) that will not disqualify a DNA fragment from becoming part of an assembly. -f fasta_file - is the multiple fasta formatted file TIGR_Assembler creates of the consensus sequences of all assemblies including singletons. -g max_err_32 - is the maximum number + 1 of alignment errors (mismatches or gaps) allowed within any contiguous 32 base pairs in the overlap region between two DNA fragments in the same assembly. This is meant to split apart splice variants which have short splice differences and would not be disqualified by the -p minimum_percent parameter. -l minimum_length - is the minimum length two DNA fragments must overlap to be considered as a possible assembly. -L - is a flag which causes even very LOW pairwise scores to be considered - caution using this flag may cause longer run time and a worse assembly. -n asm_prefix - is the prefix for assembly names in the fasta file and file names in the alignment directory - default is "asm". -N - is a flag which indicates the DNA fragments in the input_file should not be treated as random genomic fragments for the purpose of determining repeat regions. -O order_file - specifies which sequences should be merged first in what order. The format is three fields per line: # seq_name1 seq_name2 - but seq_name2 is optional. Larger numbers in the first field are merged first. If seq_name2 is not given then all sequences matching seq_name1 will be merged first. -p minimum_percent - is the minimum percent identity that two DNA fragments must achieve over their entire region of overlap in order to be considered as a possible assembly. Adjustments are made by the program to take into account that the ends of sequences are lower quality and doubled base calls are the most frequent sequencing error. -P - is a flag which indicates the contig_file is in "ace" format. -q quality_file - is a multiple fasta format file of quality values corresponding to the sequences in the input_file. The quality values must be in the same order as the sequences - however sequences without quality values may be skipped. -r resort# - specifies how many sequences should be merged before resorting the possible merges based on clone constraints. -R - is a flag which causes no new merging to take place but just generates a reverse compliment of the contig_file. -s - is a flag which indicates that singletons (assemblies made up of a single DNA fragment) should be included in the lassie output_file - the default is not to include singletons. -t - is a flag which causes tandem 32mers (a tandem 32mer is a 32mer which occurs more than once in at least one sequence read) to be ignored (this is now the default behavior and this flag is for backward compatability). -T - is a flag which causes no new merging to take place but just a translation of the contig_file from one format to another. -u - is a flag which causes tandem 32mers to be used for pairwise comparison opposite of the -t flag which is now the default). -X - is a flag which causes merging to stop when only sequences which appear to be repeats are left and these cannot be merged based on clone length constraints. This is the shell script run_TA which mainly saves some typing. The optional parameters you specify to run_TA in the following way run_TA 'optional parameters go inside single or double quotes' proj.seq >& proj.run_log are now placed after the parameters specified in the run_TA script. This means that your optional parameters will take precedence including those for stringency criteria. You can do this instead of making your own copy of run_TA. #!/bin/csh -f #Copyright 1995 The Institute for Genomic Research. All rights reserved. unset noclobber if (! -e "${1}") then set asmg_options="${1}" shift else set asmg_options="" endif foreach file (${argv[*]}) set file_tail=${file:t} set file_root=${file_tail:r} (TIGR_Assembler -n ${file_root} -g 8 -l 40 -e 15 -p 97.5 -a ${file_root}.align -f ${file_root}.fasta ${asmg_options} ${file_root}.scratch < ${file} >! ${file_root}.asm) >&! ${file_root}.error end ***NEW*** File Formats Sequence input file for TIGR_Assembler Basically multiple FASTA format with some extra information on the description line about clone length constraints and the clear range. TIGR_Assembler attempts to use information about sequences from different ends of the same clone to help place sequence reads in the correct place. In order to do this, the minimum, maximum and median clone length for a given clone must be specified. This is done on the description line for each sequence. In addition, now that the entire sequence read is given in the sequence file TIGR_Assembler needs to know the clear range for each sequence read. The clear range is the contiguous region of the sequence read which is free of vector and of reasonable quality. The clear range is specified by two numbers which indicate an inclusive range. Coordinates range from 1 to the length of the sequence read. At TIGR these coordinates are stored as a CLR feature in the feature table of the project database for each sequence read. The clear range is usually determined by some vector and quality finding programs - at TIGR we use the ethyl package. Example: >seq_name1 min_len max_len med_len clear_left clear_right sequence . . . >seq_name2 min_len max_len med_len clear_left clear_right . . . >seq_nameN min_len max_len med_len clear_left clear_right . . . where seq_name is the sequence name of the read, min_len is the minimum length of the clone, max_len is the maximum length of the clone, med_len is the median length of the clone, clear_left is the left coordinate of the clear range, and clear_right is the right coordinate of the clear range. Unfortunately for those not at TIGR the naming convention is very important: the first 7 characters of the seq_name is the clone name and then the following characters F, TF, S, T3, T646, and TH indicate "forward" reads and R, TR, V, T7, T644, and TV indicate "reverse" reads. Reads with the same clone name and opposite directions form clone pairs which have the clone length constraints applied to them. The minimum, maximum, and median clone lengths can initially be chosen based on the size selection process for the clone library. An assembly should then be done and clone lengths collected for those clone pairs whose members are in the same assembly. A rule of thumb is too chose minimum and maximum so that approximately 5% of clone lengths fall below the minimum and 5% fall above the maximum. Concrete Example: >GBBAA01F 1700 2500 2100 33 578 ACGT... ACGT... . . . >GBBAA01R 1700 2500 2100 22 435 ACGT... ACGT... . . .