# Community-based prediction of drug combos

### Establishing the human protein–protein interactome

We assembled 15 generally used databases, specializing in high-quality PPIs with 5 forms of evidences: (1) binary, bodily PPIs examined by high-throughput yeast-two-hybrid (Y2H) screening system, combining binary PPIs examined from two publicly out there high-quality Y2H datasets38,39, and one unpublished dataset, out there at http://ccsb.dana-farber.org/interactome-data.html; (2) literature-curated PPIs recognized by affinity purification adopted by affinity-purification mass spectrometry (AP-MS), Y2H, and literature-derived low-throughput experiments; (three) binary, bodily PPIs derived from protein three-dimensional constructions; (four) kinase-substrate interactions by literature-derived low-throughput and high-throughput experiments; and (5) signaling networks by literature-derived low-throughput experiments. The protein-coding genes have been mapped to their official gene symbols primarily based on GeneCards (http://www.genecards.org/) and their Entrez ID. Computationally inferred interactions rooted in evolutionary evaluation, gene expression information, and metabolic associations have been excluded. The up to date human interactome contains 243,603 PPIs connecting 16,677 distinctive proteins, and is 40% better in measurement in comparison with our beforehand used human interactome14. The human protein–protein interactome are offered within the Supplementary Knowledge 1.

### Building of drug–goal community

We collected high-quality bodily drug–goal interactions on FDA-approved or clinically investigational medication from 6 generally used information sources, and outlined a bodily drug–goal interplay utilizing reported binding affinity information: inhibition fixed/efficiency (Ki), dissociation fixed (Kd), median efficient focus (EC50), or median inhibitory focus (IC50) ≤ 10 µM. Drug–goal interactions have been acquired from the DrugBank database (v4.three)40, the Therapeutic Goal Database (TTD, v4.three.02)41, and the PharmGKB database (December 30, 2015)42. Particularly, bioactivity information of drug–goal pairs have been collected from three broadly used databases: ChEMBL (v20, accessed in December 2015)43, BindingDB (downloaded in December 2015)44, and IUPHAR/BPS Information to PHARMACOLOGY (downloaded in December 2015)45. After extracting the bioactivity information associated to medication from these databases, we retained solely the drug–goal interactions that meet the next 4 standards: (i) binding affinities, together with Ki, Kd, IC50 or EC50 every ≤10 μM; (ii) proteins will be represented by distinctive UniProt accession quantity; (iii) proteins are marked as “reviewed” within the UniProt database46; and (iv) proteins are from Homo sapiens. In complete, 15,051 drug–goal interactions connecting 4428 medication and 2256 distinctive human targets have been constructed, together with 1978 medication which have no less than two experimentally validated targets (Supplementary Knowledge 2).

### Accumulating gold-standard pairwise drug combos

On this research, we centered on pairwise drug combos by assembling the medical information from the a number of information sources (Supplementary Be aware three). Every drug in combos was required to have the experimentally validated goal info: every EC50, IC50, Ki, or Kd ≤ 10 µM. Compound identify, generic identify, or industrial identify of every drug was standardized by MeSH and UMLS vocabularies47 and additional transferred to DrugBank ID from the DrugBank database (v4.three)40. Duplicated drug pairs have been eliminated. In complete, 681 distinctive pairwise drug combos connecting 362 medication have been retained (Supplementary Knowledge three).

### Accumulating antagonistic drug–drug interactions

We compiled clinically reported antagonistic drug–drug interactions (DDIs) information from the DrugBank database (v4.three)40. Right here, we centered on antagonistic drug interactions the place every drug has the experimentally validated goal info. Compound identify, generic identify, or industrial identify of every drug have been standardized by MeSH and UMLS vocabularies47 and additional transferred to DrugBank ID from the DrugBank database (v4.three)40. In complete, 13,397 clinically reported antagonistic DDIs connecting 658 distinctive medication have been retained (Supplementary Knowledge four). As well as, we collected cardiovascular event-specific antagonistic DDIs from the TWOSIDE database35. TWOSIDE contains over 868,221 vital associations connecting 59,220 drug pairs and 1301 antagonistic occasions35. On this research, we centered on four forms of cardiovascular occasions: arrhythmia (MeSH ID: D001145), coronary heart failure (MeSH ID: D006333), myocardial infarction (MeSH ID: D009203), and hypertension (MeSH ID: D006973).

### Chemical similarity evaluation of drug pairs

We downloaded chemical construction info (SMILES format) from the DrugBank database (v4.three)40 and computed MACCS fingerprints of every drug utilizing Open Babel v2.three.148. If two drug molecules have a and b bits set of their MACCS fragment bit-strings, with c of those bits being set within the fingerprints of each medication, the Tanimoto coefficient (T) of a drug–drug pair is outlined as:

$$mathrm = fraca + b – c$$

(three)

T is broadly utilized in drug discovery and improvement49, providing a worth within the vary of zero (no bits in frequent) to at least one (all bits are the identical).

### Protein sequence similarity (identification) evaluation

We downloaded the canonical protein sequences of drug targets (proteins) in Homo sapiens from UniProt database (http://www.uniprot.org/). We calculated the protein sequence similarity SP(a, b) of two drug targets a and b utilizing the Smith–Waterman algorithm50. The Smith–Waterman algorithm performs native sequence alignment by evaluating segments of all attainable lengths and optimizing the similarity measure for figuring out comparable areas between two strings of protein canonical sequences of drug targets. The general sequence similarity of the targets binding two medication A and B is decided by Eq. (four) by averaging all pairs of proteins a and b with (a in A) and (b in B) underneath the situation (a ne b). This situation ensures that for medication with frequent targets we don’t take pairs under consideration the place a goal can be in comparison with itself.

$$langle S_p rangle = frac1mathop limits_ S_pleft( a,b proper)$$

(four)

### Gene co-expression evaluation

We downloaded the RNA-seq information (RPKM worth) throughout 32 tissues from GTEx V6 launch (accessed on April 2016, https://gtexportal.org/). For every tissue, we regarded these genes with RPKM ≥ 1 in additional than 80% samples as tissue-expressed genes. To measure the extent to which drug target-coding genes (a and b) related to the drug-treated ailments are co-expressed, we calculated the Pearson’s correlation coefficient ((PCCleft( a,b proper))) and the corresponding P-value through F-statistics for every pair of drug target-coding genes a and b throughout 32 human tissues. To be able to scale back the noise of co-expression evaluation, we mapped PCC(a, b) into the human protein–protein interactome community (Supplementary Strategies 2) to construct a co-expressed protein–protein interactome community as described beforehand51. The co-expression similarity of the drug target-coding genes related to two medication A and B is computed by averaging PCC(a,b) over all pairs of targets a and b with (a in A) and (b in B) as under:

$$langle S_mathrm rangle = frac1mathop limits_ |PCCleft( a,b proper)|$$

(5)

### Gene Ontology (GO) similarity evaluation

The Gene Ontology (GO) annotation for all drug target-coding genes was downloaded from the web site: http://www.geneontology.org/. We used three forms of the experimentally validated or literature-derived evidences: organic processes (BP), molecular perform (MF), and mobile part (CC), excluding annotations inferred computationally. The semantic comparability of GO annotations affords quantitative methods to compute similarities between genes and gene merchandise. We computed GO similarity SGO(a,b) for every pair of drug target-coding genes a and b utilizing a graph-based semantic similarity measure algorithm52 applied in an R bundle, named GOSemSim53. The general GO similarity of the drug target-coding genes binding to 2 medication A and B was decided by Eq. (6), averaging all pairs of drug target-coding genes a and b with (a in A) and (b in B).

$$langle S_mathrm rangle = frac1mathop limits_ S_mathrmleft( a,b proper)$$

(6)

### Scientific similarity evaluation

Scientific similarities of drug pairs derived from the drug Anatomical Therapeutic Chemical (ATC) classification techniques codes have been generally used to foretell new drug targets54. The ATC codes for all FDA-approved medication used on this research have been downloaded from the DrugBank database (v4.three)40. The kth stage drug medical similarity (Sk) of medicine A and B is outlined through the ATC codes as under.

$$S_kleft( proper) = fracATC_k(A) cup ATC_k(B)$$

(7)

the place ATCk represents all ATC codes on the kth stage. A rating Satc(A, B) is used to outline the medical similarity between medication A and B:

$$S_atcleft( proper) = fracn$$

(eight)

the place n represents the 5 ranges of ATC codes (starting from 1 to five). Be aware that medication can have a number of ATC codes. For instance, nicotine (a potent parasympathomimetic stimulant) has 4 totally different ATC codes: N07BA01, A11HA01, C04AC01, C10AD02. For a drug with a number of ATC codes, the medical similarity was computed for every ATC code, after which, the common medical similarity was used54.

### Comparability with goal set-overlapping strategy

On this part, we in contrast the launched network-based separation (Eq. (2)) of medicine with overlap measures which might be solely primarily based on shared targets, with out utilizing the PPI community. Right here, we examined two measures to quantify the overlap between goal units of drug A and drug B:

$$mathrm;;C = left| A cap B proper|/mathrmmin(left| A proper|,left| B proper|)$$

(9)

$$;J = left| A cap B proper|/left| proper|$$

(10)

Each values vary from zero to 1: J, C = zero revealing no frequent targets shared by the medication. An overlap coefficient C = 1 signifies that one set is a whole subset of the opposite, the place Jaccard-index J = 1 is for 2 an identical goal units (Supplementary Fig. 4a). Supplementary Figs. 4b and 4c present the distribution of C and J for all 1,955,253 drug pairs. The target-set overlap is low for many drug pairs, and the bulk (96.eight% = 1,892,455/1,955,253) don’t share any targets. To research the statistical significance of the noticed overlaps, we used a hypergeometric mannequin. The null speculation is that drug targets are randomly positioned from the area of all N protein-coding genes within the human interactome. The overlap anticipated for 2 goal units A and B is then given by

$$c_ = fracN$$

(11)

For each noticed overlap (c_ = left| A cap B proper|), we then decided the fold-change

$$fc = frac$$

(12)

and the P-values for enrichment and depletion (e.g., fewer frequent targets than anticipated), primarily based on the hypergeometric distribution.

### Community-based separation of medicine

A network-based separation of a drug pair, A and B, is calculated through Eq. (2). We evaluated 4 different totally different distance measures that keep in mind the trail lengths between two drug goal units: (a) the closest measure, representing the common shortest path size between targets of drug A and the closest goal of the drug A; (b) the shortest measure, representing the common shortest path size amongst all targets of medicine; (c) the kernel measure, down-weighting longer paths through an exponential penalty; (d) the centre measure, representing the shortest path size amongst all targets of medicine with the best closeness centrality amongst drug targets. Given A and B, the set of drug targets for A and B, and dAB, the shortest path size between nodes a and b within the interactome, we outline these distance measures as follows:

$$:langle d_^Crangle = frac1 + leftVert B rightVertleft( mathop limits_a in A min_dleft( a,b proper) + mathop limits_ min_a in Adleft( a,b proper) proper)$$

(13)

$$mathrm:langle d_^Srangle = frac1leftVert A rightVert instances leftVert B rightVertmathop limits_a in A,b in B d(a,b)$$

(14)

$$mathrmKernel:langle d_^krangle = frac – 1 + leftVert B rightVertleft( {mathop limits_a in A itlnmathop limits_ frac + mathop limits_ itlnmathop limits_a in A frac} proper)$$

(15)

$$:langle d_^ccrangle = d(centre_A,centre_B)$$

(16)

the place (centre_B), the topological centre of A, is outlined as

$$centre_B = argmin_u in Bmathop limits_ d(b,u)$$

(17)

If the (centre_A) or (centre_B) just isn’t distinctive, all of the nodes in (centre_A) or (centre_B) are used to outline the centre, and shortest path lengths between these nodes are averaged. If the (centre_B) just isn’t distinctive, all nodes are used to outline the centre and the shortest path lengths to those nodes are averaged.

### Accumulating disease-association genes

We built-in illness–gene annotation information from eight totally different sources and excluded the duplicated entries (Supplementary Be aware four). We annotated all protein-coding genes utilizing gene Entrez ID, chromosomal location, and the official gene symbols from the NCBI database55. Every cardiovascular occasion was outlined by MeSH and UMLS vocabularies47. On this research, we constructed disease-associated genes for four forms of cardiovascular occasions: arrhythmia (MeSH ID: D001145), coronary heart failure (MeSH ID: D006333), myocardial infarction (MeSH ID: D009203), and hypertension/hypertension (MeSH ID: D006973).

### Efficiency analysis

We used space underneath the receiver working attribute (ROC) curve (AUC) to judge how effectively the community proximity discriminates FDA-approved or experimentally validated pairwise combos from random drug pairs. We counted the true constructive price and false constructive price at totally different community proximities as thresholds as an example the ROC curve. As detrimental drug pairs aren’t sometimes reported within the literature or publicly out there databases, we use all unknown drug pairs as detrimental samples. As well as, we chosen the identical portion of unknown drug pairs as constructive samples to regulate the dimensions imbalance. We repeated this process 100 instances and reported the common AUC values to check the efficiency of various approaches.

### Statistical evaluation

All statistical analyses have been carried out utilizing the R bundle (v3.2.three, http://www.r-project.org/).

### Reporting abstract

Additional info on experimental design is on the market within the Nature Analysis Reporting Abstract linked to this text.