close
close

Development of a polygenic score predicting drug resistance and patient outcome in breast cancer

Discovery pipeline for identification of expression biomarkers associated with drug response

To systematically discover the biomarkers that are predictive of anti-cancer drug response we followed the computational framework shown in Fig. 1A. The GDSC database includes IC50 of 192 drugs on 809 cancer cell lines, while the COSMIC database provides the microarray-based Z-score expression profile of approximately 16,000 genes in 970 cancer cell lines. 777 cancer cell lines were common between these databases. Higher IC50 values indicate greater resistance. We hypothesized that if high expression of a gene is correlated positively with drug response (IC50) then the gene could be associated with drug resistance. In contrast, if high expression of a gene is negatively correlated with drug response, then the expression of the gene could be a marker of sensitivity to the drug.

Fig. 1: Workflow to identify genes whose expression positively correlated with anti-cancer drug resistance in cancer cell lines.
figure 1

A Framework for discovering biomarkers that predict response to anti-cancer drugs. B Spearman correlation of IC50s of anti-cancer drugs with z-score normalized expression levels of FAM129B across cancer cell lines (n = 777) in the GDSC database. C Schematic shows the workflow by which FAM129B, and other drug resistance genes were identified in cancer cell lines for 51 FDA approved drugs from the GDSC and 34 FDA approved drugs from the CTRP database. D 36 genes in GDSC that were among the top 50 genes for ≥20 drugs, when ranked by spearman correlation coefficient (SCC) of gene expression with IC50 for each drug. Bar plot showing number of drugs for which the expression vs IC50 correlation coefficient was among the top 50 SCCs. FAM129B predicted resistance for the largest number of drugs. E Bar plot showing number of drugs for each of the 36 genes identified from the GDSC database is significantly (adjusted P

Pairwise correlation analysis identifies NIBAN2/FAM129B as a candidate marker for chemoresistance in cancer cell lines

We demonstrate the results in pan-cancer for one of the top genes whose expression showed among the highest correlation with IC50 of several drugs, FAM129B. Temozolomide (DNA alkylating agent), Cytarabine (Cytidine antimetabolite) and Gemcitabine (Cytidine antimetabolite) all lead to DNA damage and halt DNA replication. The plots of IC50 for the three drugs versus expression of FAM129B, in all 777 cancer cell lines are shown along with the correlation coefficients and the p-value (Fig. 1B). These correlation values for 16,248 genes were ranked from highest to lowest. FAM129B ranked as the best correlated gene (adjusted P = 7.66e−47) for Temozolomide resistance (Supplementary Fig. 1A). In addition, FAM129B ranked 7 (adjusted P = 3.96e−34) and 13 (adjusted P = 3.97e−24) for resistance against Cytarabine and Gemcitabine, respectively. To validate this finding, we performed correlation analysis of expression of 18,541 genes with drug sensitivity of Temozolomide, Cytarabine and Gemcitabine in the CTRP dataset. The correlation of FAM129B expression with IC50 of the drugs was consistently within the top 2% of the genes (Supplementary Figure 1B). Although in the CTRP dataset, FAM129B ranked lower for Temozolomide and Gemcitabine, there were other drugs where it ranked higher, e.g. it was the third candidate for Sorafenib resistance (rho = 0.28, adjusted P = 6.53e−12).

Identification of a set of genes predictive of drug resistance

To identify genes most predictive of resistance to multiple drugs, we followed the strategy described in the schematic Fig. 1C. For each of 51 FDA approved drugs we ranked the correlation values of each gene (from GDSC) giving the lowest rank to the highest correlation value. We kept genes which were significantly correlated (adjusted P 1D). FAM129B predicted resistance to the highest number of drugs (39 of the 51 FDA approved drugs), with diverse mechanism of actions which included alkylating agents, anti-inflammatory agents, anti-metabolites, kinase inhibitors, PARP inhibitors, NAE inhibitors, anti-estrogens, BCL-2 inhibitors, cytoskeleton inhibitors, topoisomerase inhibitors, mTOR inhibitors, HDAC inhibitors and proteosome inhibitors (as shown in Supplementary Data 1). Experimentally, silencing of FAM129B has been shown to produce sensitivity to Oxaliplatin in breast cancer cells23.

We next validated this finding in the independent CTRP dataset, where there were 34 FDA approved drugs in common with the GDSC. We repeated the workflow and performed correlation analysis of 18,541 gene expression with AUC values of 34 drugs. The genes were ranked from the most correlated to the least correlated. Although the rank of a gene for a particular drug is well correlated between GDSC and CTRP, the exact ranks in GDSC and CTRP are randomly scattered around zero, as can be seen in the residual plots (Supplementary Fig. 2). Thus, despite the general correlation, the exact rank of a gene in predicting resistance to a drug varies between the two databases and we decided against using concordance of rank as the method to pick the list of predictive genes. Instead, we simply selected the 36 drug-resistance genes in GDSC and tested whether these 36 genes predicted resistance to multiple drugs in CTRP. The 36 genes predictive of resistance in ≥20 drug in GDSC were included in the top 2% of the genes that were significantly (adjusted P 1E). FAM129B showed resistance for 17 of the 34 drugs and was among the best predictors of resistance in the CTRP dataset.

Identifying drug resistance of cancer cell lines based on a polygenic score of 36 genes

Once we had identified 36 genes that predicted resistance to a substantial number of drugs, we wondered whether a polygenic score specifically derived from these 36 genes, UAB36, has a higher correlation with the IC50 to a specific drug than that of the best gene, FAM129B (see Methods). We calculated the SCCs of UAB36 vs. IC50 (of all drugs, on all cell lines) and polygenic score vs. IC50 (of all drugs on all cell lines). The distribution of the raw correlation coefficient values and their conversion to Z-score using Fisher’s Z-transformation is described in methods and is shown in Supplementary Figure 3. The distribution of correlation coefficients (expressed as Z-score) of UAB36 score with IC50 is clearly much higher than that of FAM129B with IC50s (p −12 in Wilcoxon Rank Sum test) in all cell lineages (Fig. 2A).

Fig. 2: Correlation of IC50 with UAB36 score is higher than with FAM129B expression alone across cancer cell lineages and against different drugs.
figure 2

A Violin plot shows the distribution of Fisher transformed z-score obtained from the spearman correlation of IC50s for FDA approved drugs with expression of FAM129B (colored in purple) and with UAB36 (colored in yellow). In every cancer type, the violin plot for UAB36 was significantly different from that for FAM129B (p −12 in Wilcoxon Rank Sum test). B Heatmap shows spearman correlation of IC50s of FDA-approved drugs in specific lineages versus expression of FAM129B or UAB36 score. Pink box: Positive SCC and FDR 

We also performed the analysis after separating the drugs in different lineages and determining the correlation coefficient of UAB36 score with the IC50 of any drug. After correcting for FDR, the UAB36 score showed a significantly higher correlation with the IC50 than FAM129B’s correlation with IC50, both in terms of cell-lineage and drug used (Fig. 2B). The spearman correlation coefficient values and FDR corrected p values of FAM129B with IC50 of drug are provided in the Supplementary Data 2. Similarly, the values for UAB36 are provided in the Supplementary Data 3. Because we selected the genes for the polygenic score using IC50 of multiple drugs in GDSC, we expected UAB36 based polygenic scores to be effective against different drugs in different lineages even in CTRP. Indeed, in the CTRP database, UAB36 showed significant positive correlation with 5-Fluorouracil in the large intestine (rho: 0.5; p value: 0.0003726), Crizotinib in the lung cancer (rho: 0.19; p value: 0.02585), Axitinib in kidney cancer (rho: 0.71; p value: 0.0008528) and Gemcitabine in the ovarian cancer (rho: 0.67; p value: 8.22e−05) as shown in Supplementary Fig. 4.

UAB36 outperforms established gene signatures in predicting tamoxifen resistance in breast cancer cells

Turning to a specific cancer, the analysis in Fig. 2B showed that in the GDSC dataset breast cancer cells the UAB36 score had a very high correlation to IC50 to tamoxifen. In support of this, the UAB36 score was significantly (P = 8.63e−03) higher in tamoxifen resistant subclones of MCF7 cell lines than in tamoxifen sensitive subclones from the GSE2645924 (Supplementary Fig. 5). The UAB36 score is highly correlated with IC50 of tamoxifen (rho = 0.55, P = 7.1e−05) than FAM129B expression alone with the IC50 (rho = 0.3, P = 3.80e−02), in 48 breast cancer cell lines (Fig. 3A, B). The polygenic score was next tested in the breast cancer cells (n = 35) from the CTRP dataset. We could not include KRT8 gene expression into the model as its expression was not reported in the CTRP dataset. With the same polygenic score formula, the score obtained from the 35 genes remained significantly associated with tamoxifen resistance in this independent CTRP dataset and here again, the polygenic score is better correlated with IC50 than FAM129B alone (Fig. 3E, F).

Fig. 3: UAB36 is better correlated with IC50 of Tamoxifen in breast cancer cells compared to FAM129B, ENDORSE or PAM50.
figure 3

AD Breast cancer cell lines (n = 48) from the GDSC database. EH Breast cancer cell lines (n = 35) from the CTRP database. Scatter plot shows spearman correlation of IC50 (or AUC) to Tamoxifen with expression level of FAM129B, Polygenic score of UAB36 gene set, Polygenic score of ENDORSE gene set and Polygenic score of PAM50 gene set, as indicated. The IC50 values from the GDSC database are expressed in micromolar concentrations. The AUC values from the CTRP database range from 0 to 29.4.

To assess the efficacy of UAB36 in comparison to published gene signatures in breast cancer cells, we downloaded the following gene sets: ENDORSE which includes 63 genes linked to ER+ breast cancer patient survival treated with endocrine therapy25, and PAM50, a widely accepted classifier consisting of 50 genes for molecular subtyping of breast cancer. A study by Chia et al.26 showed in NCIC CTG MA.12 clinical trial, where premenopausal women with primary breast cancer treated with adjuvant tamoxifen, PAM50 subtyping could predict response to tamoxifen treatment. We then calculated the spearman correlation of the expression of these genes with the IC50 of tamoxifen in the breast cancer cells from the GDSC and applied the same polygenic score formula to create the polygenic score for ENDORSE genes and PAM50 genes. Intriguingly, the polygenic scores derived from these sets exhibited lower correlation strengths, as depicted in Fig. 3C, D, when compared with the UAB36 score (Fig. 3B).

Similarly, we created the polygenic score based on gene sets in the CTRP, and here again UAB36 demonstrated a stronger positive correlation with the IC50 of tamoxifen than the ENDORSE or PAM50 gene sets (Fig. 3F–H).

No genes from UAB36 overlapped with those in ENDORSE and PAM50 sets.

Enrichment of tamoxifen resistance gene sets in breast cancer patients with a high UAB36 score

With the promising results observed in the breast cancer cells, we next asked in breast cancer patients whether high UAB36 score was correlated with the expression of genes known to be associated with tamoxifen resistance, a surrogate marker of such resistance.

We performed GSEA on z-score normalized RNA-Seq data from patient tumors in the TCGA breast cancer (TCGA-BRCA), specifically focusing on ER+/HER2 tumors. We used the same polygenic score formula from the 36 genes, using the correlation coefficients with IC50 for tamoxifen on the breast cancer cell lines in the GDSC database and the expression levels of the genes in the tumors. Subsequently, patients were stratified into risk groups based on q1 and q4 quartile cut-off. We found a significant enrichment of MASSARWEH_TAMOXIFEN_RESISTANCE_UP gene set (NES: 2.17 and FDR q value 4A. Upregulation of these genes was associated with resistance to tamoxifen in MCF-7 mouse xenografts20. A similar result was obtained for CREIGHTON_ENDOCRINE_THERAPY_RESISTANCE_5 gene set (NES: 2.03 and FDR q value 4B), whose expression is associated with resistance to endocrine therapy in MCF-7 mouse xenografts expressing ESR121. The robustness of these associations was validated in an independent METABRIC cohort of ER+/HER2 breast cancer tumors that received adjuvant endocrine therapy27,28. Consistent with the TCGA cohort, the high-UAB36-score group demonstrated significant enrichment for the same two gene sets with NES 1.81 and 1.86, respectively (Fig. 4C, D).

Fig. 4: Tumors with high UAB36 score are enriched for Tamoxifen resistance genes in breast cancer (ER+/HER2−) patients.
figure 4

Enrichment plot shows enrichment of MASSARWEH_TAMOXIFEN_RESISTANCE_UP gene set (A) and CREIGHTON_ENDOCRINE_THERAPY_RESISTANCE_5 gene set (B) in high UAB36 expressing tumors from the TCGA-BRCA (ER+/HER2−) cohort. The TPM expression values of each gene is normalized to z-scores across samples. NES normalized enrichment score, FDR false discovery rate. C, D Same as A, B except in the high UAB36 expressing tumors from the METABRIC breast cancer (ER+/HER2−) cohort. E Venn diagram showing the overlap of MASSARWEH_TAMOXIFEN_RESISTANCE_UP gene set which are significantly (FDR F Bar plot showing the spearman correlation coefficients of the top 50 MASSARWEH_TAMOXIFEN_RESISTANCE_UP genes with significant positive correlation with UAB36 score in the TCGA-BRCA (ER+/HER2−) cohort. G Same as F except in the METABRIC (ER+/HER2−) cohort. Here, bars colored in dark green are the genes which are in common between TCGA and METABRIC cohort. H Scatter plot showing positive correlation of UAB36 score with the IC50 for tamoxifen in the TCGA-BRCA (ER+/HER2−) cohort predicted by Li et al.16.

The GSEA suggest that the UAB36 polygenic score should be correlated with genes known to be involved in tamoxifen resistance. We calculated the spearman correlation of expression of each of the genes in the MASSARWEH_TAMOXIFEN_RESISTANCE_UP gene set (n = 582) with the UAB36 score in the ER+/HER2 TCGA-BRCA tumors. 357 of the 576 genes were significantly (FDR +/HER2 METABRIC tumors showed that 209 of the 561 genes were significantly positively correlated with UAB36 score. 40% of the genes significantly positively correlated with UAB36 score in the TCGA cohort were also similarly correlated in the METABRIC cohort (Fig. 4E). We extracted the top 50 positively correlated genes from both the cohorts (Fig. 4F, G) and identified 14 genes that were in common between the two cohorts within the top 50 gene lists, highlighting a subset of tamoxifen resistance associated genes that are most strongly and reproducibly correlated with UAB36 score. Thus, there is indeed a significant and conserved correlation of genes known to be associated with tamoxifen resistance with the UAB36 score. In addition, the high UAB36 score group of patients from the TCGA showed enrichment of expression of gene sets associated with multiple drug resistance, gefitinib resistance, fluorouracil resistance, response to cisplatin and response to oxidative stress as shown in Supplementary Fig. 6. The UAB36 genes themselves are enriched in genes involved in cellular response to stress and several cancer survival pathways as shown in Supplementary Table 2.

There have been previous attempts to predict resistance to drugs using gene expression datasets. For example, Li et al.16 used machine learning approaches on GDSC cancer cell lines to predict IC50 values of 272 drugs for TCGA tumors. We downloaded the predicted IC50 values of tamoxifen for TCGA-BRCA cohort and correlated these predicted IC50 values of the tumors with the UAB36 scores of the same tumors. The predicted IC50s to tamoxifen are positively correlated (rho = 0.45, P = 1.4e−14) with UAB36 score (Fig. 4H), again suggesting that the UAB36 score performs well to predict resistance/sensitivity to tamoxifen in patient tumors.

UAB36 predicts poor outcome in ER+/HER2 breast cancer patient treated with tamoxifen

If a high UAB36 score predicts high IC50 of breast cancer cells to tamoxifen, we should expect that high UAB36 score should predict poor outcome in breast cancer patients treated with tamoxifen. To elucidate the prognostic value of UAB36 on ER+ cancers treated with tamoxifen, we performed Kaplan–Meier survival analysis in the 448 patients of the METABRIC cohort. Compared with tumors in the low UAB36-score group, those with a high-score were significantly associated with tumor recurrence and patient death (Fig. 5A–C). It is noteworthy, that the separation between two curves in Fig. 5A–C is wider than with FAM129B alone (Fig. 5B–D) for relapse free survival and overall survival, supporting that the polygenic drug resistance risk score derived from 36 genes is better than FAM129B alone for predicting outcome post-tamoxifen treatment.

Fig. 5: UAB36 is strongly associated with poor outcome in breast cancer (ER+/HER2−) patients treated with Tamoxifen.
figure 5

AD In the METABRIC (ER+/HER2−) patient tumors treated with tamoxifen. A Relapse free survival Kaplan–Meier curves of tumors stratified using UAB36. B Relapse free survival Kaplan–Meier curve of tumors stratified using expression of FAM129B. C Overall survival Kaplan–Meier curves of tumors stratified using UAB36. D Overall survival Kaplan–Meier curve of tumors stratified using expression of FAM129B. Patients were stratified into two groups based on quartiles (Q1 and Q4) threshold. The P values for the survival analysis were obtained using a Mantel log-rank test (two-sided). E In the GSE9195 breast cancer (ER+) patients treated with tamoxifen. Relapse free survival Kaplan–Meier curves of tumors stratified using UAB36. Patients were stratified into two groups based on median threshold. F In the TCGA-BRCA cohort treated with tamoxifen. Overall survival Kaplan–Meier curves of tumors stratified using UAB36. Patients were stratified into two groups based on quartiles (Q1 and Q4) threshold.

Similarly, we tested prognostic association of UAB36 in an independent cohort GSE919529,30, where 77 ER+ breast cancer patients were treated with tamoxifen in the adjuvant setting. Using the same UAB36 score formula, the high-score group correlated with tumor relapse (Fig. 5E). Due to a smaller sample size of this cohort, we employed a median cut-off for risk stratification. Lastly, we examined the 252 TCGA-BRCA patients for which we have information about outcome following treatment with tamoxifen31. Using a quartile-based stratification, patients with high UAB36 scores were again found to have worse survival outcomes, though the level of significance was marginal (Fig. 5F). Thus, in three different cohorts, METABRIC, GSE9195 and TCGA-BRCA, a high UAB36 score was associated with poor outcome.

UAB36 outperforms the prognostic ability of ENDORSE and PAM50 and is a predictor of outcome independent of known clinical risk factors in breast cancer patients

To evaluate the prognostic potential of UAB36 against the established gene signatures ENDORSE and PAM50, we performed Cox regression analysis for overall survival in the METABRIC ER+/HER2 breast cancer cohort treated with tamoxifen. Additionally, the impact of clinical risk factors, including age, tumor stage and tumor grade, on patient survival was assessed. As depicted in Table 1, we first stratify the patients into two groups based on median threshold of UAB36. In the univariate analyses, patients with high UAB36 score exhibited significant hazard ratio (HR) of 1.33 (p-value: 1.11e−03), indicating patients with high UAB36 score have a higher risk of death after tamoxifen treatment compared to those with a low UAB36 score. In comparison a high score in the ENDORSE or PAM50 gene sets were also associated with increased risk of poor outcome as demonstrated by significant higher hazard ratio. Age and patient tumor grade and tumor stage showed significant association with poor survival, as expected from the Literature25,32,33.

Table 1 Cox regression analysis of UAB36 with established gene signatures and clinical risk factors for overall survival in the tamoxifen treated METABRIC ER+/HER2− cohort (n = 895)

To account for potential confounding factors, we then performed the multivariate Cox regression analysis using the three polygenic scores (together) along with the three clinical co-variates (age, grade and stage). The association between UAB36 and patient survival remained significant (HR = 1.28, p-value = 2.34E−02) even after adjusting for the clinical co- variates and the other polygenic scores. In contrast, ENDORSE and PAM50 lost their significance in predicting patient survival outcome when the clinical co-variates were included with (Table 1) or without (Supplementary Table 3) taking UAB36 into consideration. Age remained a significant factor in the model, but with a small effect size, but tumor stage maintained its association with poor survival. Tumor grade lost its significance in the multivariate model.

A similar multivariate Cox model in another dataset, GSE9195, showed that high UAB36 score was significant and an independent predictor of patient tumor relapse even with age and tumor grade as co-variates (Table 2).

Table 2 Cox regression analysis of UAB36 with established clinical risk factors for relapse free survival in the tamoxifen treated GSE9195 ER+ cohort (n = 77)

These results collectively indicate UAB36 polygenic score holds promise as a therapeutic response biomarker and an independent predictor of drug resistance risk and poor survival in ER+/HER2 breast cancer patients undergoing tamoxifen treatment.

All genes in UAB36 are required in the UAB36

To confirm that a set of 36 genes is essential for the UAB36 gene signature, we constructed a competing 35-gene signatures (UAB36-minus-1) by deleting one gene in turn from the set and asked whether systematic removal of a gene altered the ability to predict patient death probability, as measured by increase or decrease of HR. In the METABRIC cohort, we performed multivariate Cox regression analysis for each of the “36-minus-1” gene signatures, taking ENDORSE, PAM50 gene signatures as well as age, tumor grade and tumor stage as confounding factor, and compared the hazard ratio with the original hazard ratio of UAB36 (HR = 1.28). Removal of most of the genes, decreased HR, thus decreasing predictive ability. However, single removal of a small number of genes, ABCC3, ADAM9, ANXA2, ARHGAP29, CCN1, EPHA2, GIPC1, GNG12, KIAA152, LMNA, MYOF or S100A10 gene, significantly increased the HR (Supplementary Data 4). Similarly, we repeated the above approach of “UAB36-1” in the GSE9195 cohort. Here, we found removal of either TM4SF1, MET, GNG12, TNFRSF12A, CAPN2, and TUFT1 significantly increased the HR (Supplementary Data 5). Between the two cohorts GNG12 was the only gene whose removal consistently increased the HR in both cohorts, as shown in Table 3.

Table 3 Systematic removal of genes from the UAB36

Next, we prepared 36 competing 34-gene signatures, i.e. “UAB36-minus-2”, by deleting any two genes from the set. Removal of two sets, DCBLD2 + TM4SF1 or ARHGAP29 + TM4SF1 increased the HR in both METABRIC and GSE9195.

Further, we proceeded with 33-gene signatures “UAB36-3” and found that removal of a set containing FAM129B, ANXA2 and S100A10 increased the HR.

Although the removal of selected 1, 2 or 3 genes increased the HR (Table 3), the increase was minor in one cohort or the other. Therefore, for the current study, we decided to keep all genes in UAB36 as essential. In the future, as new studies publish new breast cancer cohorts treated with tamoxifen, we will test if removal of these genes from UAB36 significantly improves the HR by an even larger extent.