|
|
# SQL Workshop
|
|
|
|
|
|
## Introduction
|
|
|
|
|
|
The purpose of this workshop is to introduce SQL commands and to provide examples of how to leverage the power of databases to make storing and analyzing data easier.
|
|
|
|
|
|
|
|
|
|
|
|
## Required Software
|
|
|
|
|
|
This workshop uses SQLite3 as an example implementation of SQL. This workshop will use SQLiteStudio as an easy-to-use graphical interface for SQLite3. SQLiteStudio can be found [here](https://sqlitestudio.pl/index.rvt). It is available as an installed package or as a portable application that does not require installation. Alternatively, command line only utilities for SQLite can be found [here](https://www.sqlite.org/index.html).
|
|
|
|
|
|
This workshop will also walk through how to connect to a SQLite database from python. The sqlite3 module is installed by default when python is installed. Python can be downloaded [here](https://www.python.org/downloads/).
|
|
|
|
|
|
|
|
|
|
|
|
## The Example Database
|
|
|
|
|
|
A real world database has been provided as an example in this workshop. This database records the accuracy scores for different sequences when their secondary structures are predicted using different thermodynamic parameter sets. These parameter sets are generated by perturbing the optical melting experimental data that were used to derive the nearest neighbor parameters.
|
|
|
|
|
|
There are three tables in the database. The first table describes the sequences used, with the RNA family name and sequence length. The second table describes the parameter sets, including set name, and the perturbations (in terms of standard deviations) that were applied to each optical melting experiment to generate the parameter set. The last table describes the accuracy of secondary structure prediction for each sequence and parameter set.
|
|
|
|
|
|
|
|
|
|
|
|
## Using SQLiteStudio
|
|
|
|
|
|
### Connecting to an SQLite database
|
|
|
|
|
|
SQLite is file-based, where all the information in the database is stored in a single file. To connect to a SQLite database, run SQLiteStudio. Once in SQLiteStudio, select 'Add a database' from the 'Database' menu (Ctrl+O). Select the sqlite file provided for this workshop using the file manager (the database name should populate automatically). Click 'Ok'. The database should now be visible on the pane on the left side of the application window.
|
|
|
|
|
|
|
|
|
|
|
|
### Exploring the structure of the database
|
|
|
|
|
|
If the database browser is not expanded in the left pane, double click the database name to do so. Double click 'Tables' under the database name to show all the tables in the database. There should be three tables in the database ('Accuracy', 'Sequences', and 'Sets'). Double-clicking a table name will open a window for the table in the right window pane. In that pane, you can explore the structure of the table, the raw data in the table, and other attributes of the table. Note that the data tab does not necessarily show all of the data in the table.
|
|
|
|
|
|
|
|
|
|
|
|
### Opening the SQL editor
|
|
|
|
|
|
To perform any SQL commands, you can use the SQL editor, which can be accessed from the 'Tools' menu or using the shortcut 'Alt+E'. The editor has two panes. The top pane is where you can type in an SQL command. The bottom pane displays the results.
|
|
|
|
|
|
|
|
|
|
|
|
### Running a simple SQL command
|
|
|
|
|
|
Try running the following code. Simply copy and paste the code into the SQL editor window. To run the code, press the green triangle 'play' button in the editor toolbar.
|
|
|
|
|
|
```sqlite
|
|
|
SELECT Sequence, Length
|
|
|
FROM Sequences;
|
|
|
```
|
|
|
|
|
|
The SELECT command is used to retrieve data from a database. The words following the SELECT command are typically column names. The FROM clause specifies which table should be queried.
|
|
|
|
|
|
Note that the SQL command can be split among multiple lines and ends with a semicolon. Also note that the SELECT command can be followed by a '*' character instead of a list of column names to retrieve all columns.
|
|
|
|
|
|
|
|
|
|
|
|
### Sorting Results
|
|
|
|
|
|
The results of a SELECT command can be sorted using the 'ORDER BY' clause followed by a column name. For example, the following command is the same as the previous command, except it sorts by sequence length:
|
|
|
|
|
|
```sqlite
|
|
|
SELECT Sequence, Length
|
|
|
FROM Sequences
|
|
|
ORDER BY Length;
|
|
|
```
|
|
|
|
|
|
If you run the command in the SQL editor, you will notice that the results are sorted in ascending order, which is the default behavior. If you want to reverse the order, add a DESC clause after the column name.
|
|
|
|
|
|
You can sort on multiple columns by adding the column names to the 'ORDER BY' clause like:
|
|
|
|
|
|
```sqlite
|
|
|
ORDER BY Length DESC, Sequence DESC
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### Filtering Unique Records
|
|
|
|
|
|
You can limit the output to only unique values using the DISTINCT clause. For example, to get the number of unique parameter sets in the Accuracy table:
|
|
|
|
|
|
```sqlite
|
|
|
SELECT DISTINCT Parameter_Set
|
|
|
FROM Accuracy;
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### Filtering data
|
|
|
|
|
|
There are a couple of ways to reduce the amount of records returned by a SELECT command.
|
|
|
|
|
|
#### WHERE Clause
|
|
|
|
|
|
The results of a SELECT command can be filtered using the WHERE clause, which is added after the FROM clause. The WHERE clause limits the results to those where the expression following the WHERE clause is true. For example, the following code only returns results for sequences shorter than 100 nucleotides:
|
|
|
|
|
|
```sqlite
|
|
|
SELECT Sequence, Length
|
|
|
FROM Sequences
|
|
|
WHERE Length < 100;
|
|
|
```
|
|
|
|
|
|
The WHERE clause can use several different operators:
|
|
|
|
|
|
| Operator | Description |
|
|
|
| --------------- | --------------------------- |
|
|
|
| = | equal to |
|
|
|
| <> | not equal to |
|
|
|
| != | not equal to |
|
|
|
| < | less than |
|
|
|
| <= | less than or equal to |
|
|
|
| !< | not less than |
|
|
|
| > | greater than |
|
|
|
| >= | greater than or equal to |
|
|
|
| !> | not greater than |
|
|
|
| BETWEEN X AND Y | between the numbers X and Y |
|
|
|
| IS NULL | is a null value |
|
|
|
| IN (X, Y, ...) | in a subset of values |
|
|
|
|
|
|
|
|
|
When using strings in the WHERE clause, enclose the strings in either single or double quotation marks.
|
|
|
|
|
|
Write a SQL command to display sequences whose length is greater than 800? How many are there?
|
|
|
|
|
|
##### Combining multiple clauses
|
|
|
|
|
|
The WHERE clause can also include Boolean logic with AND and OR operations. Parenthesis can also be used to combine different logic operations.
|
|
|
|
|
|
Write a SQL command to display tmRNA sequences less than 200 nucleotides long.
|
|
|
|
|
|
#### LIMIT clause
|
|
|
|
|
|
You can reduce the number of rows returned with a LIMIT clause. The LIMIT clause is typically added at the end of the SQL command. It is particularly useful when combined with sorted results. For example, the following code returns the 10 longest sequences.
|
|
|
|
|
|
```sqlite
|
|
|
SELECT Sequence, Length
|
|
|
FROM Sequences
|
|
|
ORDER BY Length DESC
|
|
|
LIMIT 10;
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### Functions
|
|
|
|
|
|
Functions in SQL databases tend to be very implementation specific. Most SQL implementations support both built-in and custom functions. Custom functions are out of the scope of this workshop. The information about functions supported by SQLite3 can be found [here](https://www.sqlite.org/lang_aggfunc.html) and [here](https://www.sqlite.org/lang_corefunc.html). Here are a few useful functions supported by SQLite3
|
|
|
|
|
|
| Function | Description |
|
|
|
| -------- | --------------------------- |
|
|
|
| avg(X) | Returns the average value |
|
|
|
| count(X) | Returns the number of items |
|
|
|
| max(X) | Returns the maximum value |
|
|
|
| min(X) | Returns the minimum value |
|
|
|
| sum(X) | Returns the sum |
|
|
|
|
|
|
The following code will return the average sequence length for all the sequences in the Sequences table.
|
|
|
|
|
|
```sqlite
|
|
|
SELECT avg(Length)
|
|
|
FROM Sequences;
|
|
|
```
|
|
|
|
|
|
Can you write an SQL command that will return the average Sensitivity and PPV for all the scores in the Accuracy table?
|
|
|
|
|
|
#### Naming Columns
|
|
|
|
|
|
Columns can be renamed for the output. This is particularly useful for putting a name to a calculation or a sub-query. This is done using the AS operator. For example the previous example can be written as:
|
|
|
|
|
|
```sqlite
|
|
|
SELECT avg(Length) as 'Average Length'
|
|
|
FROM Sequences;
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### Grouping Results
|
|
|
|
|
|
You can combine similar records using the 'GROUP BY' clause. If you give the 'GROUP BY' clause a column, it will combine results with the same column value.
|
|
|
|
|
|
```sqlite
|
|
|
SELECT Family, avg(Length)
|
|
|
FROM Sequences
|
|
|
GROUP BY Family;
|
|
|
```
|
|
|
|
|
|
Write an SQL command to calculate the average Sensitivity and PPV for each parameter set from the Accuracy table. Sort based on average Sensitivity and print out the 10 best performing parameter sets.
|
|
|
|
|
|
It might be obvious, but the 'GROUP BY' clause will only group data that is in the table. It does not check for completeness of the data. You should verify that all the data you expect is in the table. One way to do that is to use the count function to make sure that every average uses the same amount of data.
|
|
|
|
|
|
#### Filtering Groups
|
|
|
|
|
|
You can filter the groups shown by using the HAVING clause. This clause is used similarly to the WHERE clause to filter the groups.
|
|
|
|
|
|
Use the HAVING clause to verify that no parameter set is missing any sequences.
|
|
|
|
|
|
|
|
|
|
|
|
### Joining Tables
|
|
|
|
|
|
One of the more powerful features of databases is the ability to combine information from multiple tables. There are three main types of table joins: Cross, Inner, and Outer. Cross simply creates all combinations of the joined column values. In our example, a cross join of the Sequences.Sequence and Sets.Parameter_Set columns will generate a row for each row in the Accuracy table. Inner and Outer joins differ on how they treat column values that appear in one table but not the other.
|
|
|
|
|
|
![joins](Images\sql-joins.jpg)
|
|
|
|
|
|
You can perform an INNER JOIN by putting 'INNER JOIN' between two table names, followed by 'ON' and then an equality between two columns from the two tables.
|
|
|
|
|
|
|
|
|
|
|
|
```sqlite
|
|
|
SELECT Parameter_Set, Sequence, Family
|
|
|
FROM Accuracy INNER JOIN Sequences ON Accuracy.Sequence = Sequences.Sequence
|
|
|
ORDER BY Parameter_Set;
|
|
|
```
|
|
|
|
|
|
Write a SQL command that performs an INNER JOIN between Accuracy and Sequence. Then calculate the average Sensitivity and PPV for each RNA family across all parameter sets, as well as the average length for each family.
|
|
|
|
|
|
Write a SQL command that performs an INNER JOIN between Accuracy and Sequence. Then calculate the average Sensitivity and PPV for 5S rRNA for each parameter set.
|
|
|
|
|
|
|
|
|
|
|
|
### Adding or Updating Data
|
|
|
|
|
|
For examples of how to add or update data, look at the python section of this workshop.
|
|
|
|
|
|
|
|
|
|
|
|
## SQLite and Python
|
|
|
|
|
|
To work through the python section of the workshop, open the accompanying python file and follow the direction inside that file.
|
|
|
|
|
|
|
|
|
|
|
|
## Other SQL implementations
|
|
|
|
|
|
Most of the syntax here is relatively consistent across SQL implementations. Some of the syntax for retrieving the database schema (the tables and the fields in the table) can be implementation specific. The functions supported by each implementation can also differ. |
|
|
\ No newline at end of file |