Grouping data using Galaxy
Galaxy has numerous tools to analyse tables and variables they contains.
Here, we are going use the tool Group data by a column and perform aggregate
operation on other columns. (Galaxy Version 2.1.4) to rapidly extract information
from a complex GTF file.
The input data¶
The input data is a GTF file annotation of the dmel 6.59 D. melanogaster genome which contains
only annotations for exons on chromosomes X, 2, 3 and 4.
This file is available on the GitHub repository ARTbio/AnalyseGenome at this URL:
-
Create a new history
Grouping exercise, open theUploadmenu (top-left corner, just below the Galaxy logo).Click the
Paste/Fetch databutton, and paste the above URL in the central field of the panel. Instead of the pre-filledNew File, typedmel.gtf, and click theStartbutton.-
Once the upload is complete, you will notice that the dataset's data type is
gtf, notgtf.gz. This is because Galaxy uncompresses most uploaded data on the fly. Notable exceptions to this default behavior arefastq.gzandfasta.gzfiles. Since these files are often very large, they are kept in their compressed form to save disk space on the server and can be processed directly. -
Have a look to the content of the file. Standard GTF ! 9 columns, the last column contains 4 types of annotations:
gene_id,gene_symbol,transcript_idandtranscript_symbol.
-
-
The first thing we are going to do is simplifying the 9th column, keeping only the
gene_idinformation.- Select the
Column Regex Find And Replacein the left tools bar. - Select the
column 9for the dataset1: dmel.gtf - Click the
Insert Check - In the Find Regex enter the regular expression
- In the Replacement field, do not enter anything
- Click the
RunButton. - In the resulting dataset
2: Column Regex Find And Replace on data 1, check the column 9 (Attributes). What do you see ?
- Select the
-
First Grouping operation.
Now, we are going to group the data in the 9th column (we collapse lines by unique gene_id), while counting the "events" on column 3 (exon), and randomly picking a value on column 1 (the chromosome name, which is expected to be the same for all exons of a given gene)
- Select the
Group data by a column and perform aggregate operation on other columnsin the left tools bar. - The Select data should already be
2: Column Regex Find And Replace on data 1 - In the Group by column menu, select the 9th column.
- Click once on the
Insert Operation - In the Type menu, select
Count - In the On column menu, select
Column: 3 - Click one more time the
Insert Operation - In the Type menu, select
Randomly pick(bottom of the menu) - In the On column menu, select
Column: 1 - Click the
RunButton ! - What do you see in the generated dataset ? → It is a 3-column table. The first column contains unique gene_ids. The second column contains the computed number of exons for this gene. The third column contains the chromosome name for this gene.
- Select the
-
Second Grouping operation.
Finally, we are going the group again the data. This time, we group on the third column (chromosome names) while counting the variable (gene_id) in the 1 column.
- Select the
Group data by a column and perform aggregate operation on other columnsin the left tools bar. - The Select data should already be
3: Group on data 2 - In the Group by column menu, select the 3rd column.
- Click once on the
Insert Operation - In the Type menu, select
Count - In the On column menu, select
Column: 1 - Click the
RunButton ! -
What do you see in the generated dataset ? → It is a 2-column table. The first column contains unique chromosome arms. The second column contains the computed number of genes for the chromosome arm.
?
- Select the