CEthreader version 1.0 ============================ Directories in this source folder: bin lib (include dssp, ResPRE, deepMSA and psipred4) scripts (buildquery.pl for building ce file from fasta format and buildtemplate_profile.pl for building deepMSA-based profile of template) test (example) test.sh readme.txt licence.txt CEthreader contain 3 programs, CEdecomposition, CEthreading and CEthreading_mixsearch Overview Ethreader (Contact Eigenvector-based threader) is a template-based protein structure prediction algorithm guided by contact maps. CEthreader converts contact map predicted by ResPRE into a set of single-body Eigenvectors through Eigendecomposition technique, and subsequently performs dynamic programming based on the contact Eigenvector together with secondary structure and sequence profile to identify templates. Combination of contacts, secondary structures, and profiles enables CEthreader to improve the accuracy in template detection than profile-based threading. Reference W Zheng, Q Wuyun, Y Li, SM Mortuza, C Zhang, R Pearce, J Ruan, Y Zhang. Detecting distant-homology protein structures by aligning deep neural-network based contact maps. PLOS Computational Biology, 15: e1007411 (2019). ############################################################################################################################################### 1. Installation You can directly run it in Linux system. ################################################################################################################################################ 2. Programs instruction (1). CEdecomposition This program is contact map eigen decomposition method, design for preparing ce file for CEthreading program. Usage: CEdecomposition -i inputtype[q: query t: template] -f fasta [-n native pdb] -s psipred.horiz [-d dssp] -c metapsicov_contact_format[if -n no predicted_contact need] -o outfile -m [linear num or exp num or top num] -mtx psiblastmtx opitions: -i: input type q: query t: template for query: -f: input fasta for query -s: psipred horiz out file -c: metapsicov format contact file -m choose one of following three cutoff, default exp linear: linear model for top num*L contact cutoff, default num=2 exp: exponent model for top L^num contact cutoff, default num=1.2 top: top num[fixed] contact cutoff for template: -n: native structure for template -d: dssp file common: -mtx: psiblast mtx file -o: output file Example file can be found in test fold. Some example commands For template (native structure) ./bin/CEdecomposition -i t -n ./test/d2ccya_.pdb -d ./test/d2ccya_.dssp -mtx ./test/d2ccya_.mtx -o ./test/native.ce For query (sequence) ./bin/CEdecomposition -i q -f ./test/d5csma_.fasta -s ./test/d5csma_.horiz -c ./test/d5csma_.meta -mtx ./test/d5csma_.mtx -o ./test/query.ce -m exp 1.2 How to build ce file by you own seleced contacts For example: original contact map by CASP format 1 19 0 8 0.991 1 18 0 8 0.71 91 103 0 8 0.700 ........... 32 54 0 8 0.001 (total 3000 contacts) You can select any contacts as you want, for example (only two contacts and ignore the confidence scores) 91 103 0 8 0.700 32 54 0 8 0.001 Then write these to a file (mycontact.con) Then use CEdecomposition do eigen decomposition CEdecomposition -i q -f fastafile -s psipred.horiz -c mycontact.con -o outfile -m top 2 -mtx psiblastmtx Then you will get a input file basing on a conatct map only contains two contacts. This "-m top" parameters are useful when you build your own contact map. You don't need change any source code! ------------------------------------------------------------------------------------------------------------------- (2). CEthreading This program is contact based threading method, it contains two contact map alignment algorithm. 1. Eigen Decomposition Based Alignment 2. Contact Map Alignment Usage: -q queryfile [-p query profile] -d databasefile -o outbase [options] opitions: -m: 0,1,2,3. Contact Map Alignment Method, 0: MapAlign, 1: EigenAlign, 2: EigenProfileAlign, 3: MixAlign-longest run time, run all 0,1,2 to get Maximal CMOQ score. default=1 EigenALign/EigenProfileAlign/MixAlign options: -eignum minimal Contact Eigen Vector length, set to min(queryEigNum,eignum,templateEigNum) default=7 -st: 0,1. Eigen Vector Score functions, 0: inner cross production, V1*V1, 1: inner cross production divided by square of max vector Norm V1*V2/max(|V1|,|V2|)^2 or 0 if |V1|=|V2|=0. default=1 -gt: 0,1. Eigen Contact Gap Penalty Type, 0: Minus Maximal of Absolutely Element of Score Matrix, 1 UserDefined. default=1 with go=-1.0 ge=-0.1 -go: Negative Value. Contact Gap Open Penalty. default=-1.0. if -gt select 0, doesn't work! -ge: Negative Value. Contact Gap Extension Penalty. default=-0.1. if -gt select 0, doesn't work! -ot Output format. 0 best. 1 all. default=0. EigenProfileAlign/MixAlign options: -pw: [0,1], PsiBlast Profile Weight. default=0.4 -sw: [0,1], Second Structure Weigth. default=0.1 -b: Positive Value. Bonus Score for Aligned Pair, default=0.0 if EigenAlign in MixAlign, or 0.1 for EigenProfileAlign/MixAlign EigenProfileAlign score=(1.0-pw-sw)*ContactScore+pw*ProfileScore+sw*SecondStructureScore+BonusScore, so pw+sw should <=1.0 Example command CEthreading -q ./test/query.ce -p ./test/query.prf -d ./test/native.ce -o ./test/test -m 2 -------------------------------------------------------------------------------------------------------------------------------- (3). CEthreading_mixsearch This program is contact based threading method with hybrid threading approach, it contains two contact map alignment algorithm. 1. Eigen Decomposition Based Alignment 2. Contact Map Alignment (Baker's Map_align) Usage: -q queryfile [-p query profile] -d databasefile -o outbase [options] opitions: -m: 0,1,2. Contact Map Alignment Method, 0: MapAlign, 1: EigenAlign, 2: EigenProfileAlign. default=1 EigenALign/EigenProfileAlign options: [hybrid threading approach options, greedy search strategy] -eignum minimal Contact Eigen Vector length, set to min(queryEigNum,eignum,templateEigNum) default=7 -npeignum np search Contact Eigen Vector length, set to default=7, O(time)=2^(npeignum+1)+2*(eignum-npeignum)-2, if npeignum=0, full greedy search, if npeignum=eignum, full enumerative search, else, first enumerative search, then greedy search -st: 0,1. Eigen Vector Score functions, 0: inner cross production, V1*V1, 1: inner cross production divided by square of max vector Norm V1*V2/max(|V1|,|V2|)^2 or 0 if |V1|=|V2|=0. default=1 -gt: 0,1. Eigen Contact Gap Penalty Type, 0: Minus Maximal of Absolutely Element of Score Matrix, 1 UserDefined. default=1 with go=-1.0 ge=-0.1 -go: Negative Value. Contact Gap Open Penalty. default=-1.0. if -gt select 0, doesn't work! -ge: Negative Value. Contact Gap Extension Penalty. default=-0.1. if -gt select 0, doesn't work! -ot Output format. 0 best. 1 all. default=0. EigenProfileAlign options: -pw: [0,1], PsiBlast Profile Weight. default=0.4 -sw: [0,1], Second Structure Weigth. default=0.1 -b: Positive Value. Bonus Score for Aligned Pair, default=0.0 if EigenAlign in MixAlign, or 0.1 for EigenProfileAlign/MixAlign EigenProfileAlign score=(1.0-pw-sw)*ContactScore+pw*ProfileScore+sw*SecondStructureScore+BonusScore, so pw+sw should <=1.0 Generally, you can use -eignum and -npeignum control your searching method as enumerative search, mix search or greedy search. example: ./bin/CEthreading_mixsearch -q ./test/query.ce -p ./test/d5csma_.prf -d ./test/native.ce -o ./test/test_mixsearch -m 2 -eignum 10 -npeignum 4 this command means you can use 10 eigen vectors do alignment, while the first 4 will do enumerative search, the rest 6 will do greedy search. greedy search: -eignum k -npeignum 0 (or 1), 2k times DP mix search: -eignum n -npeignum k (n>=k), 2^(k+1)-2+2(n-k) times DP enumerative search: -eignum k -npeignum k, 2^(k+1)-2 times DP CEthreading_mixsearch, greedy search is useful when scan a large database, then you can use enumerative search (CEthreading) realign you top templates. this approach can significantly reduce your searching time and keep your search accuracy by our benchmark test. ################################################################################################################################## 3. Databases: (1). For build deep MSA User should download three sequence databases Unicluset30 from http://gwdu111.gwdg.de/~compbiol/uniclust/2018_08/ Uniref90 from https://www.uniprot.org/downloads Metaclust from https://metaclust.mmseqs.org/2018_06/ And change variables "$msa_hhblitsdb" "$msa_jackhmmerdb" "$msa_hmssearchdb" in scripts/buildquery.pl (2). For threading library You can either download from https://zhanglab.dcmb.med.umich.edu/CEthreader/ (you need dowwnload both CE database and PDB database if you want to build model or get the aligned CA coordinates) or build by your own library. To build your own library, please check CEdecomposition command, start from the pdb file, a. generate sequence file from pdb file b. build deepMSA-based profile with scripts/buildtemplate_profile.pl, then keep *.mtx file c. generate secondary structure file from pdb by lib/dssp d. generate *.ce file by CEdecomposition ######################################################################################################################################################## 4. Dependence We attach "dssp", "psipred-4.01", "deepMSA" and "ResPRE" in lib folder, to use ResPRE, please read README file in ResPRE folder and install python with "numpy", "scipy" and "pytorch". then change variable "$python" in scripts/buildquery.pl ######################################################################################################################################################### 5. Output After you run CEthreading -q ./test/query.ce -p ./test/query.prf -d ./test/native.ce -o ./test/mytest -m 2 You will get two files: mytest.fasta and mytest.txt The *.txt file contains the alignment/threading summary example %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% d1al3a_ d4n13a_ 6 000 0.4041958 ^ ^ ^ ^ ^ query template best_alignment_index eigen_vector_sign contact_map_overlap_score(CMOq) %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% The *.fasta file contains the sequence alignment information example %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% #d1al3a__d4n13a__6 >d1al3a_ TWPDKGSLYVATTHTQARYALPGVIKGFIERYPRVSLHMHQGSPTQIAEAVSKGNADFAIATEAL---HLYDDLVMLPCYHWNRSIVVTPEHPLATKGSVSIEELAQ---------------YPLVTY--------TFGFTGRSELDTAFNRAGLTPRIVFTATDADVIKTYVR LGLGVGVIASMAVDPVSDPDLVKLDANGIFS------------HSTTKIGFRRSTFLRSYMYDFIQRFAPHLTRDVVDTAVALRSNEDIEAMFKDIKLPEK >d4n13a_ ----EKIVSIGGSTTVSPI-LDEMILRYDKINNNTKVTYDAQGSSVGINGLFNKIYKIAISSRDLTKEEIEQGAKETVFAYDALIFITSPEIKITNITEENLAKILNGEIQNWKQVGGPDAKINFINRDSSSGSYSSIKDLLLNKIFKTHEEAQFRQDGIVVKSNGEVIEKTSL TPHSIGYIGLGYAKNSIEKGLNILSVNSTYPTKETINSNKYTIKRNLIIVTNY---EDKSVTQFIDFMTSSTGQDIVEEQGFLGIKT-------------- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% If you have any question, please contact with Wei Zheng (zhengwei@umich.edu or jlspzw139@sina.com).