S1A serine proteases

This example shows how the S1A serine proteases dataset is organized and how you can use it to test cocoatree’s features.

This dataset is actually composed of two MSAs:
  • the MSA from Halabi et al., (2008)

  • the MSA from Rivoire et al., (2016)

Imports

import numpy as np
from cocoatree.datasets import load_S1A_serine_proteases

import cocoatree.msa as c_msa
import cocoatree.statistics.position as c_pos

Halabi dataset

Use load_S1A_serine_dataset function to load the Halabi dataset

halabi = load_S1A_serine_proteases(paper='halabi')

print(halabi.keys())
dict_keys(['sequence_ids', 'alignment', 'sector_positions', 'metadata', 'pdb_sequence', 'pdb_positions'])
halabi is a dictionary with 6 elements:
  • sequence_ids: a list of the sequence identifiers for the MSA

  • alignment: the aligned sequences

  • sector_positions: a Numpy NpzFile of the sectors’ positions as found in

the Halabi study
  • metadata: a Pandas dataframe containing metadata associated with the

alignment, such as taxonomic information.
  • pdb_sequence: a tuple of the pdb sequence of 3TGI (rat’s trypsin)

  • pdb_positions: a list of the named positions found in the PDB

Access the sequence data

loaded_seqs = halabi["alignment"]
loaded_seqs_id = halabi["sequence_ids"]
print(loaded_seqs[:5])
print(loaded_seqs_id[:5])
print()
print("Number of sequences:", len(loaded_seqs))

n_loaded_pos, n_loaded_seqs = len(loaded_seqs[0]), len(loaded_seqs)
print(f"The loaded MSA has {n_loaded_seqs} sequences and {n_loaded_pos} \
      positions.")
['----------------------------------------------------------------------------------------------------------------------------------IVGG-------------------------------YTCQEN-----------------------------------------SVPYQVSLNS--------------------GYHFCGGSLIN------DQWVVS------------AAHCYKS-----------------------------RIQVRLGEHNIN-----------------------VLEGNEQFVNAAKIIKHPNFDR------------------------KTLNNDIMLIKLSSP-VKLN-----------------------ARVATV----------------------ALP-----------------SSCAP-AG------------TQCLISGWG-NTLSSG---------------------------------------VNEPD----LL-QCLDAPLLP---------------------------QADCEASYP-----------------GKITDNMVCVGFLEGG---------KDSCQGDSG-GPVVCN----------GELQGIVSWGY-------GCALPD-----NPGVYTKVCN------YVDWIQDTIAAN------------------------------------------------------------------------------------------', '-----------------------------------------------------------------------------------------------------------------------------------VGG-------------------------------TEAQRN-----------------------------------------SWPSQISLQYRSG---------------SSWAHTCGGTLIR------QNWVMT------------AAHCVDRE---------------------------LTFRVVVGEHNLN-----------------------QNDGTEQYVGVQKIVVHPYWNTD----------------------DVAAGYDIALLRLAQS-VTLNSY-----------------------VQLGV----------------------LPRAG----------------TILANN------------SPCYITGWGL-TR-------------------TNG---------------------QLAQ----TL-QQAYLPTVD---------------------------YAICSSSSYW---------------GSTVKNSMVCAGGDGV----------RSGCQGDSG-GPLHCLVN------GQYAVHGVTSFVSR-----LGCNVTR-----KPTVFTRVSA------YISWINNVIASN------------------------------------------------------------------------------------------', '----------------------------------------------------------------------------------------------------------------------------------IVGG-------------------------------RRARPH-----------------------------------------AWPFMVSLQLR-------------------GGHFCGATLIA------PNFVMS------------AAHCVANVN-------------------------VRAVRVVLGAHNLS-----------------------RREPTRQVFAVQRIFEDGYDP-------------------------VNLLNDIVILQLNGS-ATINAN-----------------------VQVAQ----------------------LPAQG----------------RRLGNG------------VQCLAMGWG-LLGRNR----------------------------------------GIAS----VL-QELNVTVV----------------------------TSLCR-------------------------RSNVCTLVRGRQ---------AGVCFGDSG-SPLVCN----------GLIHGIASFVRG------GCASGL-----YPDAFAPVAQ------FVNWIDSIIQ--------------------------------------------------------------------------------------------', '----------------------------------------------------------------------------------------------------------------------------------IIGG-------------------------------TESKPH-----------------------------------------SRPYMAYLEIVTS---------------NGPSKFCGGFLIR------RNFVLT------------AAHCAGR-----------------------------SITVTLGAHNIT-----------------------EEEDTWQKLEVIKQFRHPKYNT------------------------STLHHDIMLLKLKEK-ASLTLA-----------------------VGTLP----------------------FPSQF----------------NFVPPG------------RMCRVAGWGRT---------------------GVLK--------------------PGSD----TL-QEVKLRLMD---------------------------PQACSHFRD------------------FDHNLQLCVGNPRKT---------KSAFKGDSG-GPLLCA----------GVAQGIVSYGRS------DAKP--------PAVFTRISH------YRPWINQILQAN------------------------------------------------------------------------------------------', '----------------------------------------------------------------------------------------------------------------------------------IIGG-------------------------------HEAKPH-----------------------------------------SRPYMAYLQIMDE---------------YSGSKKCGGFLIR------EDFVLT------------AAHCSGS-----------------------------KIQVTLGAHNIK-----------------------EQEKMQQIIPVVKIIPHPAYNS------------------------KTISNDIMLLKLKSK-AKRSSA-----------------------VKPLN----------------------LPRRN----------------VKVKPG------------DVCYVAGWGK-LG-------------------PMG---------------------KYSD----TL-QEVELTVQE---------------------------DQKCESYLKN----------------YFDKANEICAGDPKIK---------RASFRGDSG-GPLVCK----------KVAAGIVSYGQN------DGST--------PRAFTKVST------FLSWIKKTMKKS------------------------------------------------------------------------------------------']
['4139558', '7546312', '230004', '4139720', '10835843']

Number of sequences: 1470
The loaded MSA has 1470 sequences and 832       positions.

Filter the MSA

sequences, sequences_id, positions = c_msa.filter_sequences(
    loaded_seqs, loaded_seqs_id, gap_threshold=0.4, seq_threshold=0.2)
n_pos = len(positions)
print(f"After filtering, we have {n_pos} remaining positions.")
print(f"After filtering, we have {len(sequences)} remaining sequences.")
After filtering, we have 242 remaining positions.
After filtering, we have 1456 remaining sequences.

Compute the number of effective sequences

seq_weights, m_eff = c_pos.compute_seq_weights(sequences)
print('Number of effective sequences %d' % np.round(m_eff))
print()
Number of effective sequences 1053

Access the metadata

metadata = halabi["metadata"]
print(metadata)
        Seq_ID  ...         Family
0      4139558  ...        Muridae
1      7546312  ...         Suidae
2       230004  ...      Hominidae
3      4139720  ...      Hominidae
4     10835843  ...        Muridae
...        ...  ...            ...
1465  54641874  ...  Drosophilidae
1466  55236641  ...      Culicidae
1467  55238595  ...      Culicidae
1468  66772117  ...  Drosophilidae
1469  24660124  ...  Drosophilidae

[1470 rows x 14 columns]

The Seq_ID column corresponds to the sequence identifiers found in loaded_seqs_id

Access PDB information

pdb_seq = halabi["pdb_sequence"]
pdb_pos = halabi["pdb_positions"]
print("Start of 3TGI sequence:\n", pdb_seq[:10])
print("Named positions of the first ten residues in 3TGI protein sequence:\n",
      pdb_pos[:10])
Start of 3TGI sequence:
 ('IVGGYTCQENSVPYQVSLNSGYHFCGGSLINDQWVVSAAHCYKSRIQVRLGEHNINVLEGNEQFVNAAKIIKHPNFDRKTLNNDIMLIKLSSPVKLNARVATVALPSSCAPAGTQCLISGWGNTLSSGVNEPDLLQCLDAPLLPQADCEASYPGKITDNMVCVGFLEGGKDSCQGDSGGPVVCNGELQGIVSWGYGCALPDNPGVYTKVCNYVDWIQDTIAAN',)
Named positions of the first ten residues in 3TGI protein sequence:
 ['16', '17', '18', '19', '20', '21', '22', '23', '24', '25']

Access sector information

sectors = halabi["sector_positions"]
print(sectors)
NpzFile '/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/cocoatree/datasets/data/S1A_serine_proteases/halabi_sectors.npz' with keys: sector_1, sector, sector_3

There are three sectors that have been identified in Halabi et al. (2008)

print("The first sector identified in Halabi et al. is composed of the \
      following positions:", sectors["sector_1"])
The first sector identified in Halabi et al. is composed of the       following positions: [ 17 161 172 176 177 180 183 187 188 189 191 192 213 215 216 220 221 226
 227 228]

To see an example on how to use this dataset for a coevolution analysis, go to: Perform full SCA analysis on the S1A serine protease dataset

For an example on how to visualize cocoatree’s results with this dataset, go to : Plot XCoR together with (phylogenetic) tree and metadata

Rivoire dataset

Similarly, you can access the dataset from Rivoire et al. (2016) paper:

rivoire = load_S1A_serine_proteases(paper='rivoire')
print(rivoire.keys())

loaded_seqs = rivoire["alignment"]
loaded_seqs_id = rivoire["sequence_ids"]
print("Number of sequences:", len(loaded_seqs))

n_loaded_pos, n_loaded_seqs = len(loaded_seqs[0]), len(loaded_seqs)
print(f"The loaded MSA has {n_loaded_seqs} sequences and {n_loaded_pos} \
      positions.")
dict_keys(['sequence_ids', 'alignment', 'sector_positions', 'metadata', 'pdb_sequence', 'pdb_positions'])
Number of sequences: 1390
The loaded MSA has 1390 sequences and 832       positions.

The dataset is organized in the same way as Halabi’s, although in this case there are 6 sectors:

sectors = rivoire["sector_positions"]
print(sorted(sectors))

for sect in sectors:
    print(sect, "is composed of:", len(sectors[sect]), "positions.")
['sector', 'sector_1', 'sector_3', 'sector_4', 'sector_5', 'sector_6']
sector_1 is composed of: 32 positions.
sector is composed of: 22 positions.
sector_3 is composed of: 20 positions.
sector_4 is composed of: 10 positions.
sector_5 is composed of: 6 positions.
sector_6 is composed of: 7 positions.

References

Total running time of the script: (0 minutes 0.946 seconds)

Gallery generated by Sphinx-Gallery