Search for a command to run...
This Supplementary Dataset is provided as a compressed ZIP file, which contains the following folders: seqs_PET_MHET_TPA: Folder containing the amino acid sequences of putative PET hydrolases (PETases), MHET hydrolases (MHETases) and TPA-degrading enzymes (TphA1, TphA2, TphA3, TphB) found in metagenomes from temperate (Ocean Microbial Reference Gene Catalog, OM-RGC.v1) and polar marine samples (Polar Marine Reference Gene Catalog, PM-RGC; Antarctic samples collected in Chile Bay, Western Antarctic Peninsula) identified by searches using Hidden Markov Models. The sequences are stored in FASTA files named: prot_PETases.fasta prot_MHETases.fasta prot_TphA1.fasta prot_TphA2.fasta prot_TphA3.fasta prot_TphB.fasta seqs_MetaGs: Folder containing the amino acid sequences of the 9 selected putative PETases (MetaG1-MetaG9) from a high-confidence PETase-like clade in a maximum likelihood phylogenetic reconstruction of 443 putative PETases from temperate and polar metagenomes. These MetaGs clustered together with known PETases used as reference for the HMM search of putative PETases. The sequences are stored in FASTA files named: prot_PETases_selected.fasta: Contains the 9 selected putative PETases with the identifiers from the metagenomic analysis. prot_PETases_selected_renamed.fasta: Contains the same selected putative PETases in the same order as the previous file but renamed as MetaG1-MetaG9. prot_PETases_selected_SignalP_v5_PrDOS.fasta: Contains the truncated sequences of the MetaG1-MetaG9 enzymes after eliminating signal peptides and disordered regions using SignalP v5 and PrDOS, respectively. CF_MetaGs: Folder containing all the results from the protein structure predictions of the MetaGs (MetaGN, where N goes from 1-9) selected from the high-confidence PETase-like clade, using ColabFold v1.5.5 with default parameters (1 seed, 5 models, 3 recycles, no templates, alphafold2_ptm model for monomers, MSAs generated with MMseqs2 searching against UniRef100 and the ColabFoldDB environmental database). The folder contains: MetaGN_env: Folder with results from MMseqs2 after searching similar sequences to the query against the Uniref100 and ColabFoldDB environmental database. MetaGN.a3m: Final MSA used in the ColabFold protein structure predictions in a3m format. MetaGN_coverage.png: Plot showing the depth (number of sequences per position) and diversity (sequence identity) of the retrieved sequences in the MSA vs the query sequence. MetaGN_plddt.png: Plot of the predicted local distance difference test (plDDT) per position for all 5 models, ranked by average plDDT and predicted Template Modeling score (pTM). MetaGN_pae.png: Plots of the Predicted Alignment Error (PAE) for the 5 ranked models MetaGN_predicted_aligned_error_v1.json: PAE matrix in JSON format, including the max_PAE (maximum PAE value in the matrix). MetaGN_scores_rank_XXX_alphafold2_ptm_model_Y_seed_000.json: JSON file including all confidence metrics (plDDT, PAE, max_PAE, pTM) for the predicted structure of the corresponding query sequence, where XXX is the ranking of the model (from 001 to 005) and Y is the model number (from 1 to 5). MetaGN_unrelaxed_rank_XXX_alphafold2_ptm_model_Y_seed_000: PDB file of the predicted structures for the corresponding query sequence, where XXX is the ranking of the model (from 001 to 005) and Y is the model number (from 1 to 5). CF_MetaGs_bestpred: Folder containing the best predicted protein structures (rank 1) in PDB format for the MetaG1-MetaG9 enzymes, ranked by plDDT and pTM. Includes a text file with the plDDT and pTM parameters of these predicted structures. rvET_PETase_clade: Folder containing the real value Evolutionary Trace (rvET) analysis of 33 out of 37 sequences from the high-confidence PETase-like clade and of the same clade split into Type I (11 sequences) and Type II PETase-like enzymes (22 sequences), using the Universal Evolutionary Trace server. The analysis contains the following folders and files: high-confidence_PETase-like_clade.fasta: FASTA file containing all sequences in the high-confidence PETase-like clade. high-confidence_PETase-like_clade_final.fasta: FASTA file in which sequences gene_956802, gene_9327, gene_870527, and NODE_94302, all from the Type II clade, were removed due to their significantly shorter length (<220 amino acid residues) compared to the rest of the sequences in the clade (minimum length = 254 amino acid residues; maximum length = 448 amino acid residues). STAMP_struct_align: Folder with the results of the structural alignment of the experimental structures of known PETases from Moraxella sp. TA144 (PDB 8SPK), Ideonella sakaiensis 201-F6 (PDB 6EQE), Thermobifida alba AHK119 (PDB 6AID), Thermobifida fusca KW3 (4CG3), Thermobifida cellulosilytica (PDB 5LUI), Thermomonospora curvata (PDB 7YKO) and the metagenomic leaf-branch compost cutinase LCC (PDB 4EB0) using STAMP, a structural alignment tool available in the MultiSeq extension of VMD v1.9.4. It contains all PDB structures used in the structural alignment and the resulting Multiple Sequence Alignment (MSA) in FASTA format, named profile.fasta clustal_align: Folder with the re-alignment of the sequences in the high-confidence PETase-like clade using the structure-based MSA as a profile for the sequence alignment in Clustal Omega, followed by a quick manual refinement of gappy regions. It contains FASTA files of the resulting MSAs of all sequences in the clade (clustalo_MSA_all.fasta) and of the split between Type I (clustalo_MSA_TypeI.fasta) and Type II (clustalo_MSA_TypeII.fasta) PETase-like sequences. The redundant sequences from the structural alignment that was used as a profile for the re-alignment of the amino acid sequences were eliminated from the MSAs. rvET_results: Folder containing the results of the rvET anaysis in the Universal Evolutionary Trace server, using the Clustal Omega alignments of the whole high-confidence PETase-like clade (rvET_all; MetaG9 was used as reference), and of the split between Type I (rvET_TypeI; MetaG1 was used as reference) and Type II (rvET_TypeII; MetaG9 was used as reference) as inputs. The ET_myid.ranks file was used for further analysis. conservation_analysis.txt: Text file containing a summary of the rvET scores for the catalytic residues, residues in subsites I and II and extended loop that comprise the binding site of these putative PETases, and the extra cysteine residues in the active site of Type II enzymes. The scores go from strict conservation (rvET score = 1.00) to changes by amino acid with similar physicochemical properties (rvET < 1.50) to higher amino acid diversity (rvET > 1.50). MAGs: Folder containing a total of 112 Metagenome-Assembled Genomes (MAGs) from temperate marine samples (OM-RGC.v1; 59 MAGs) and polar marine and ice samples (PM-RGC and Chile Bay; 53 Arctic and Antarctic MAGs) containing at least one PET degradation pathway enzyme. The prefix of the files contained in this folder, which corresponds to the MAGs (.fasta file), the predicted genes (.fasta.fna file) and corresponding predicted proteins (.fasta.faa file) are as follows: Ant_ICE_2016: Chile Bay sampling station, glaciar ice sample (3 MAGs) Ant_P3_2_2016: Chile Bay sampling station P3 (62°27′6” S; 59°40′6” W), year 2016, 2m depth (2 MAGs) Ant_P3_2_2018: Chile Bay sampling station P3 (62°27′6” S; 59°40′6” W), year 2018, 2m depth (4 MAGs) Ant_P3_2_2019: Chile Bay sampling station P3 (62°27′6” S; 59°40′6” W), year 2019, 2m depth (1 MAG) Ant_P3_30_2019: Chile Bay sampling station P3 (62°27′6” S; 59°40′6” W), year 2019, 2m depth (5 MAGs) TARA_ANE: OM-RGC.v1, Atlantic Ocean (North East) (8 MAGs) TARA_ANW: OM-RGC.v1, Atlantic Ocean (North West) (5 MAGs) TARA_ASE: OM-RGC.v1, Atlantic Ocean (South East) (1 MAG) TARA_ASW: OM-RGC.v1, Atlantic Ocean (South West) (4 MAGs) TARA_ION: OM-RGC.v1, Indian Ocean (North) (1 MAG) TARA_IOS: OM-RGC.v1, Indian Ocean (South) (3 MAGs) TARA_MED: OM-RGC.v1, Mediterranean Ocean (5 MAGs) TARA_PON: OM-RGC.v1, Pacific Ocean (North) (4 MAGs) TARA_PSE: OM-RGC.v1, Pacific Ocean (South East) (7 MAGs) TARA_PSW: OM-RGC.v1, Pacific Ocean (South West) (8 MAGs) TARA_RED: OM-RGC.v1, Red Sea (8 MAGs) TARA_SOC: OM-RGC.v1, Southern Ocean (5 MAGs) Genome_NN: PM-RGC (38 MAGs)