Configuration
Configuration step is required for setting the path the required datasets and environmental variables.
Setting the environmental variables
Inside the CWAS-Plus workspace directory, there is a configuration file configuration.txt
. This file contains a few crucial environmental variables that will be used through CWAS-Plus. The description of each variable is as below:
ANNOTATION_DATA_DIR: This is the path of the directory, which contains annotation datasets, such as bed files.
GENE_MATRIX: This is the path of the gene matrix, which is a text file. The first column should be gene ID, and the second column should be gene name. The other columns will represent each gene list and show whether each row (=gene) are matched to the gene list or not by a binary code (0, 1). 1 if the gene is matched to a gene list, 0 if not.
ANNOTATION_KEY_CONFIG: This is the path of the annotation key file, which is a yaml file. This file contains the name of the annotation datasets inside the annotation dataset directory and the key names that will be used to represent the dataset. All details should be written in yaml syntax. Also, to split the category group to functional score and functional annotation, the users should type each annotation dataset under the matched group dictionary. Below is an example of this file. The format should be (name): (key) with a uniform indentation for each row. Be aware that the name of the annotations should not contain ‘_’. As domains will combined with ‘_’ as a delimiter, using ‘_’ in the annotation name will cause errors.
VEP: This is the path of VEP. If there is a pre-installed VEP, this line would be written in advance when the users typed the command
cwas start
.VEP_CACHE_DIR: This is the path of the directory, which contains cache files and overall resources for VEP.
VEP_CONSERVATION_FILE: This is the path of the conservation file (loftee.sql), which will be used for variant classification.
VEP_LOFTEE: This is the path of the directory of loftee plugin, which will be used for variant classification.
VEP_HUMAN_ANCESTOR_FA: This is the path of the human ancestor fasta file, which will be used for variant classification.
VEP_GERP_BIGWIG: This is the path of the GERP bigwig file, which will be used for variant classification.
VEP_MIS_DB: This is the path of the database in vcf format. This will be used for variant classification. Users can manually prepare this file to classify damaging missense variants.
VEP_MIS_INFO_KEY: The name of the score in the missense classification database. It must be present in the INFO field of the database. The score must be specified by this name in the field. For example, if the user is using MPC score in the database, the database will look like below.
#CHROM
POS
ID
REF
ALT
QUAL
FILTER
INFO
chr1
69094
.
G
A
.
.
MPC=2.73403
chr1
69094
.
G
C
.
.
MPC=2.29136
chr1
69094
.
G
T
.
.
MPC=2.29136
chr1
69095
.
T
A
.
.
MPC=4.31666
VEP_MIS_THRES: The cutoff that will be used for the missense classification. The missense variants scoring equal to or above VEP_MIS_THRES will be classified as damaging missense mutations.
When preparing the ANNOTATION_KEY_CONFIG yaml file, please avoid using underscores (‘_’) in the annotation name. Underscores are used for distinguishing different domains within a single category.
For example, check below.
functional_score:
bed1.bed.gz: annot1
bed2.bed.gz: annot2
functional_annotation:
bed3.bed.gz: annot_3 # Do not use underscores like this. Users can use 'annot3' instead.
bed4.bed.gz: annot4
After filling the configuration file, type the below command for configuration. This process will create a symlink to the annotation dataset directory, gene matrix and the annotation key file to the user’s workspace. Also, based on the annotation key file, a category domain file and a redundant category file will be created. The category domain file contains all the inferior category groups that will be used for CWAS-Plus. The redundant category file contains the combination of categories that will be excluded in CWAS-Plus. This is for removing duplicated categories (for example, coding variants with all genes and coding variants with coding genes) and nonsense categories (for example, missense variants that are indels).
To force configuration (overwrite previous configurations), use -f
option.
cwas configuration
After configuration, a file .cwas_env
that contains environmental variables for CWAS-Plus will be created in the home directory.
Data preparation
For efficient annotation process, the users should merge all bed files by typing the below command. During this process, all bed files will be split into their intersected or non-intersected intervals with numbers that indicate which annotation datasets are matched to the interval in binary scale.
The parameters of the command are as below:
p: The number of processors.
cwas preparation -p 8
After preparation, the merged bed file (merged_annotation.bed.gz
) looks like below:
#ANNOT=ChmE1|ChmE2|ChmE3 |
|||
#chrom |
start |
end |
annot_int |
chr1 |
10000 |
10600 |
1 |
chr1 |
79200 |
80000 |
4 |
chr1 |
610420 |
612020 |
2 |
chr1 |
631820 |
632020 |
8 |
The line starts with #ANNOT
indicates the annotation datasets merged in the bed file. It also indicates the order of the datasets matched to the annot_int
.
The column annot_int
represents the decimal number converted from binary code. The binary code consists of 0 and 1, but the representation is different from ordinary binary numbers. For example, when an interval from 1,000 to 1,010 base overlaps with ChmE1 and ChmE2 region, the binary code for CWAS-Plus will be 110
(1 if the region overlaps, and 0 if not.). CWAS-Plus then converts it to decimal numbers. Here, the 1st position refers to 20, the 2nd position refers to 21, and the 3rd position refers to 22. Therefore, the decimal number would be 1*20 + 1*21 + 0*22 = 3. Using this algorithm, CWAS-Plus merges genomic intervals efficiently.