Risk score analysis
CWAS-Plus utilizes categorized results to estimate the optimal predictor for the phenotype. It trains a Lasso regression model using the number of variants within each category across samples. After training the model with a subset of samples, the remaining test set is employed to calculate the R2. The significance of the R2 value is determined by calculating it from samples with a randomly shuffled phenotype. The number of regressions (-n_reg) can be set to obtain the average R2 value from all regressions.
-i, –input_file: Path to the categorized zarr directory, resulted from categorization process.
-o_dir, –output_directory: Path to the directory where the output files will be saved. By default, outputs will be saved at
$CWAS_WORKSPACE
.-s, –sample_info: Path to the txt file containing the sample information for each sample. This file must have three columns (
SAMPLE
,FAMILY
,PHENOTYPE
) with the exact name.SAMPLE
FAMILY
PHENOTYPE
11000.p1
11000
case
11000.s1
11000
ctrl
11002.p1
11002
case
11002.s1
11002
ctrl
-a, –adjustment_factor: Path to the txt file containing the adjust factors for each sample. This is optional. With this option, CWAS-Plus multiplies the number of variants (or carriers, in -u option) with the adjust factor per sample.
SAMPLE
AdjustFactor
11000.p1
0.932
11000.s1
1.082
11002.p1
0.895
11002.s1
1.113
-c_info, –category_info: Path to a text file category information (*.category_info.txt).
-d, –domain_list: Domain list to filter categories based on GENCODE domain. If ‘run_all’ is given, all available options will be tested. Available options are run_all,all,coding,noncoding,ptv,missense,damaging_missense,promoter,noncoding_wo_promoter,intron,intergenic,utr,lincRNA. By default, all.
-t, –tag: Tag used for the name of the output files. By default, None.
–do_each_one: Use each annotation from functional annotation to calculate risk score. By default, False.
–leave_one_out: Calculate risk score while excluding one annotation from functional annotation. This option is not used when the –do_each_one flag is enabled. By default, False.
-u, –use_n_carrier: Enables the sample-level analysis (the use of the number of samples with variants in each category for burden test instead of the number of variants). With this option, CWAS-Plus counts the number of samples that carry at least one variant of each category.
-thr, –threshold: The number of variants in controls (or the number of control carriers) used to select rare categories. For example, if set to 3, categories with less than 3 variants in controls will be used for training. By default, 3.
-tf, –train_set_fraction: The fraction of the training set. For example, if set to 0.7, 70% of the samples will be used as training set and 30% will be used as test set. By default, 0.7.
-n_reg, –num_regression: Number of regression trials to calculate a mean of R squares. By default, 10.
-f, –fold: Number of folds for cross-validation.
-n, –n_permute: The number of permutations used to calculate the p-value. By default, 1,000.
–predict_only: If set, only predict the risk score and skip the permutation process. By default, False.
-S, –seed: Seed of random state. By default, 42.
-p, –num_proc: Number of worker processes that will be used for the permutation process. By default, 1.
cwas risk_score -i INPUT.categorization_result.txt.gz \
-o_dir OUTPUT_DIR \
-s SAMPLE_LIST.txt \
-a ADJUST_FACTOR.txt \
-c_info CATEGORY_SET.txt \
-thr 3 \
-tf 0.7 \
-n_reg 10 \
-f 5 \
-n 1000 \
-p 8
Users can perform two types of risk score analyses in a loop to identify annotations with the best predictive performance and composition within the annotation set.
Risk score analysis for categories containing a single annotation within a specific domain
cwas risk_score -i INPUT.categorization_result.txt.gz -o_dir OUTPUT_DIR -s SAMPLE_LIST.txt -a ADJUST_FACTOR.txt -c_info CATEGORY_SET.txt -thr 3 -tf 0.7 -n_reg 10 -f 5 -n 1000 -p 8 –do_each_one
Risk score analysis for categories with one annotation excluded from the total annotations
cwas risk_score -i INPUT.categorization_result.txt.gz -o_dir OUTPUT_DIR -s SAMPLE_LIST.txt -a ADJUST_FACTOR.txt -c_info CATEGORY_SET.txt -thr 3 -tf 0.7 -n_reg 10 -f 5 -n 1000 -p 8 –leave_one_out