Changes

Jeffrey Zuber · 512f9073
--- a/RNAstructure.md
+++ b/RNAstructure.md
@@ -19,6 +19,7 @@ make all
 ```

 ## Running Programs
+Help for each program in the RNAstructure suite can be accessed using the `-h` or `--help` flags.  For example, the command `Fold -h` will access the documentation for the program _Fold_.  Help can also be found [here](http://rna.urmc.rochester.edu/Text/index.html)

 After compilation, the program executables will be in the _exe_ directory within the RNAstructure folder.  You should ensure that the executables are available in the PATH environment variable.  To add a directory to the PATH environment variable, edit the _.bashrc_ file in your home directory with the following line:

@@ -37,3 +38,259 @@ Note that this command needs to be run for every shell session.  To eliminate th
 ```
 export DATAPATH=path_to_data_tables
 ```
+
+## File Formats
+### Sequence File Formats:  FASTA and SEQ
+Nucleotide Sequences can be provided to RNAstructure in either FASTA or SEQ format.
+
+#### FASTA Files
+In FASTA files, each nucleotide sequence begins with a single-line description that must start with the greater-than symbol (>). Subsequent lines should only contain the sequence itself.  The sequence may be formatted with whitespace, which is ignored, however blank lines are not allowed in the middle of FASTA input. FASTA files should have a ".fasta" extension.
+
+SAMPLE FASTA File
+```
+>>Title of Sequence
+AAAGCGGUUUGUUCCACUCGCGUUCAGUUAGCGUUUGCAGUUCUGGGCUCGUCCAUGGAAGCG
+```
+Important notes regarding sequences in RNAstructure:
+*  Sequences are case-sensitive and should generally be in CAPITAL letters.  Lowercase letters indicate a base that cannot form basepairs (i.e. is constrained to be single-stranded/unpaired) in the predicted structure.
+*  Nucleotide sequences can contain U or T interchangably.  These will be interpreted based on the context of the desired operation (i.e. as U in RNA calculations or as T in DNA calculations).
+*  Whitespace characters (spaces, tabs, and line-breaks) are allowed in sequences (for formatting), and will simply be ignored.
+*  The special letters X and N can be used (interchangably) to represent an unknown base or a base that cannot interact with other bases (i.e. it can neither pair nor stack).
+*  Placing at least three X or N bases next to each other allows them to act like an unstructured loop, which can represent a section of unknown identity or a section that has been purposely left out of the prediction.
+
+FASTA File showing special RNAstructure features.
+```
+; This FASTA file has (optional) spaces for formatting.
+; It also has unpaired and non-interacting bases.
+>Title of Sequence
+AAAGCGGU UUGUUCCA CUCaGCGU XXXXUCAG
+UUAGCuuGU UUGCAGUU CUGGGCUC
+```
+
+#### SEQ Files
+SEQ files have the following format:
+
+*  Comment lines must be at the beginning of the file and must each start with a semicolon.  At least one comment line is required. Additional comment lines are allowed as long as each starts with a semicolon.
+*  The title of the sequence must be provided on a single line immediately following the comment line(s).
+*  The sequence must start on the line after the title. It should be entered from 5' to 3' and can include spaces and line breaks for formatting.
+*  Finally, the sequence must end in "1" (the character representing the number one).
+The important [notes](https://rna.urmc.rochester.edu/Text/File_Formats.html#seqnotes) about sequences in RNAstructure also pertain to SEQ files.
+
+SAMPLE SEQ File
+```
+; Comments must start with a semicolon.
+; There can be any number of comments, but at least one is required.
+Sequence Title  — A single-line title must immediately follow the comment(s).
+AAAGCGGC UUAUUGUU UUCAuuuG GUUCCACU
+CGCGUUCA AUCUXXXX UCAGGUUA GUUAGCGA 1
+```
+
+#### Multi-Sequence FASTA File
+Some RNAstructure programs (e.g. TurboFold) can accept a FASTA file that contains multiple sequences as input.
+These are similar to single-sequence FASTA files, except that additional sequences can be listed, each preceeded by a title-line starting with ">" (the greater-than symbol).
+The important [notes](https://rna.urmc.rochester.edu/Text/File_Formats.html#seqnotes) about sequences in RNAstructure also pertain to multi-sequence FASTA files.
+
+SAMPLE Multi-Sequence FASTA File
+```fasta
+>Title of Sequence 1
+AAAGCGGUUUGUUCCACUCGCGUUCAGUUAGCGUUUGCAGUUCUGGGCUC
+>Title of Sequence 2
+UUGAUAUUGGAUGGAAAUGGGUGGGAAGAUGGAAAUUGAGAAUGGAGGGU
+>Title of Sequence 3
+AGUUAUGAAUUAGAUUUGUAGGAAAGUGUUAAGUGUAAAUAGGAUGUGUG
+(...)
+```
+
+#### CT File Format
+A CT (Connectivity Table) file contains secondary structure information for a sequence. These files are saved with a CT extension. When entering a structure to calculate the free energy, the following format must be followed.
+
+1.  Start of first line: number of bases in the sequence
+2.  End of first line: title of the structure
+3.  Each of the following lines provides information about a given base in the sequence. Each base has its own line, with these elements in order:
+    *  Base number: index n
+    *  Base (A, C, G, T, U, X)
+    *  Index n-1
+    *  Index n+1
+    *  Number of the base to which n is paired. No pairing is indicated by 0 (zero).
+    *  Natural numbering. RNAstructure ignores the actual value given in natural numbering, so it is easiest to repeat n here.
+The CT file may hold multiple structures for a single sequence. This is done by repeating the format for each structure without any blank lines between structures. 
+
+The CT file format is such that files generated by RNAstructure are compatible with mfold/Unafold (available from Michael Zuker), and many other software packages.
+
+#### Dot Bracket File Format
+*  Dot bracket files are plain text. They encode a sequence and secondary structure.
+*  Common file extensions are .dot, .bracket, and .dbn
+*  The first line is a title and starts with a ">" character.
+*  The second line contains the sequence.
+*  The third line contains structure information in dot-bracket notation:
+  *  The dot/period "." represents an unpaired nucleotide.
+  *  An open-parenthesis "(" represents the 5'-nucleotide in a pair, and the matching closing parenthesis ")" represents the 3'-nucleotide in the pair.
+  *  Other "bracket"-type symbols can be used to represent basepairs, thereby allowing pseudo-knots to be encoded.
+    Bracket Characters: ()   <>   {}
+
+SAMPLE Dot-Bracket Files
+```
+>A stem-loop structure (with a bulge)
+GGGCAAUCCUCUUCGGGCCC
+((((...((.....))))))
+ 
+>A pseudo-knot structure
+GAUGGCACUCCCAUCAAUUGGAGC
+(((((..<<<))))).....>>>.
+```
+
+#### Constraint File Format
+Folding constraints are saved in plain text with a CON extension. These can be hand edited. For multiple entries of a specific type of constraint, entries are each listed on a separate line. Note that all specifiers, followed by "-1" or "-1 -1", are expected by RNAstructure. For all specifiers that take two arguments, it is assumed that the first argument is the 5'nucleotide. Nucleotides positions are specified from the 5' end, where the first nucleotide in the sequence is in position 1.The file format is as follows:
+
+```
+DS:
+XA
+-1
+SS:
+XB
+-1
+Mod:
+XC
+-1
+Pairs:
+XD1 XD2
+-1 -1
+FMN:
+XE
+-1
+Forbids:
+XF1 XF2
+-1 -1
+```
+*  XA: Nucleotides that will be double-stranded
+*  XB: Nucleotides that will be single-stranded (unpaired)
+*  XC: Nucleotides accessible to chemical modification
+*  XD1, XD2: Forced base pairs
+*  XE: Nucleotides accessible to FMN cleavage (a U that must be in a GU pair)
+*  XF1, XF2: Prohibited base pairs
+
+SAMPLE
+```
+DS:
+15
+25
+76
+-1
+SS:
+17
+18
+20
+35
+-1
+Mod:
+2
+15
+-1
+Pairs:
+16 26
+-1 -1
+FMN:
+-1
+Forbids:
+15 27
+-1 -1
+```
+
+#### SHAPE Data File Format
+The file format for SHAPE reactivity comprises two columns. The first column is the nucleotide number, and the second is the reactivity.
+
+Nucleotides for which there is no SHAPE data can either be left out of the file, or the reactivity can be entered as less than -500. Columns are separated by any white space.
+
+Note that there is no header information. Nucleotides 1 through 10 have no reactivity information. Nucleotide 11 has a normalized SHAPE reactivity of 0.042816. Nucleotide 12 has a normalized SHAPE reactivity of 0, which is NOT the same as having no reactivity when using the pseudo-energy constraints.
+
+By default, RNAstructure looks for SHAPE data files to have the file extension SHAPE, but any plain text file can be read.
+
+SAMPLE
+```
+9 -999
+10 -999
+11 0.042816
+12 0
+13 0.15027
+14 0.16201
+```
+
+#### List File Format
+List files have a LIS extension. This file contains any number of sequences of any length or nucleic acid, each on its own line.
+
+SAMPLE
+```
+CUGAGCCAAG
+GGGCUCAACG
+GGCGUGAGAAAC
+```
+
+#### Offset File Format
+Offset files are plain text. The files contain two colums: the nucleotide followed by the offset value in kcal.
+
+SAMPLE
+```
+1 -0.336512
+2 -1.4448
+3 -1.74126
+4 -3.82745
+```
+
+#### Experimental Pair Bonus File Format
+Bonus files are plain text. They are formatted as an nxn matrix of bonus values, where n is the length of the sequence.
+
+SAMPLE
+```
+0.0 1.0 0.0 1.0 0.0
+0.0 1.0 0.0 1.0 0.0
+0.0 1.0 0.0 1.0 0.0
+0.0 1.0 0.0 1.0 0.0
+0.0 1.0 0.0 1.0 0.0
+```
+
+#### Alignment File Format
+Alignment files are plain text. They are formatted as a nucleotide in the first sequence immediately followed by the nucleotide in the second sequence it's aligned to, separated by a space. Only one alignment pair can be on each line, and the last line of the file must be "-1 -1".
+
+SAMPLE
+```
+10 12
+11 13
+-1 -1
+```
+
+#### NMR File Format
+NMR file provides experimantal NRM constraints to NAPSS.
+
+SAMPLE
+```
+666
+66
+665(+RAY)6
+67
+65(+YAR)7
+665(-RAY)65(-RGY)5(+YGR)5(+YAY)6
+57
+```
+
+#### FASTA Alignment Format
+This is the expected format for sequence alignments used by Multifind.
+
+SAMPLE
+```
+>NZ_GG697986.1 44776 45325
+TCGT-----------TCTTTCCCTTGAATCTCTATGATTAGAACACTATCGTCCAACTGG-------------------AAATGATAATTTAATAATGTACACTTTTTATTTTGTAAGAA
+>NC_002953.3 2279812 2280398
+TTTTCGTCCCGTAG-TTCTTCCATTGAGCCTCTATGATTAGAACACAATCGTCCGGTTATCATACGGCCTCCGCAAGCTAAATGATAATTTAATAATGGACACTTTTGATTGTTTAAGCA
+>NC_017337.1 2316381 2316961
+TCGT-----------TCTTTCCCTTGAATATCTATGATTAGAACACTATCGTGCGTTTATCGTCCAGCCTCCGCAAGCTAAATGACAATTTAATAATGTACACTTTTGATTGTGTAAGCA
+>NC_003923.1 2300707 2301293
+TTTTCGTCCCGTAG-TTCTTCCATTGAGCCTCTATGATTAGAACACAATCGTCCGGTTATCATACGGCCTCCGCAAGCTAAATGATAATTTAATAATGGACACTTTTGATTGTTTAAGCA
+>NC_017338.1 2290333 2290748
+TCGT-----------TCTTTCCCTTGAATATCTATGATTAGAACAC-----------------------------------------------TAATGTACACTTTTGATTGTGTAAACA
+>NZ_JH806555.1 46083 46649
+TCGT-----------TCTTTCCCTTGAACCACTATGATTAGAACACAATCGTCTGGTTATCGTCCACCCTCCGCAAGCTAAATGACAATTTAATAATGTACACTTTTGATTGTGTAAACA
+>NC_002951.2 2282533 2283102
+TTATCATCCGATAGCTCTTTCCCTTGAATATCTATTATTAGAACACTATCGTACGGT-----------CTCCGCAAGCTAAATGACAATTTAATAATGTACACTTTTGATTGTGTAAGCA
+>NC_017340.1 2295263 2295829
+TCGT-----------TCTTTCCCTTGAACCACTATGATTAGAACACAATCGTCTGGTTATCGTCCACCCTCCGCAAGCTAAATGACAATTTAATAATGTACACTTTTGATTGTGTAAACA
+```
\ No newline at end of file