Note

Go to the end to download the full example code or to run this example in your browser via Binder.

S1A serine proteases¶

This example shows how the S1A serine proteases dataset is organized and how you can use it to test cocoatree’s features.

This dataset is actually composed of two MSAs:

the MSA from Halabi et al., (2008)
the MSA from Rivoire et al., (2016)

Imports

import numpy as np
from cocoatree.datasets import load_S1A_serine_proteases

import cocoatree.msa as c_msa
import cocoatree.statistics.position as c_pos

Halabi dataset¶

Use load_S1A_serine_dataset function to load the Halabi dataset

halabi = load_S1A_serine_proteases(paper='halabi')

print(halabi.keys())

dict_keys(['sequence_ids', 'alignment', 'sector_positions', 'metadata', 'pdb_sequence', 'pdb_positions'])

halabi is a dictionary with 6 elements:

sequence_ids: a list of the sequence identifiers for the MSA
alignment: the aligned sequences
sector_positions: a Numpy NpzFile of the sectors’ positions as found in

the Halabi study

metadata: a Pandas dataframe containing metadata associated with the

alignment, such as taxonomic information.

pdb_sequence: a tuple of the pdb sequence of 3TGI (rat’s trypsin)
pdb_positions: a list of the named positions found in the PDB

Access the sequence data¶

loaded_seqs = halabi["alignment"]
loaded_seqs_id = halabi["sequence_ids"]
print(loaded_seqs[:5])
print(loaded_seqs_id[:5])
print()
print("Number of sequences:", len(loaded_seqs))

n_loaded_pos, n_loaded_seqs = len(loaded_seqs[0]), len(loaded_seqs)
print(f"The loaded MSA has {n_loaded_seqs} sequences and {n_loaded_pos} \
      positions.")

['----------------------------------------------------------------------------------------------------------------------------------IVGG-------------------------------YTCQEN-----------------------------------------SVPYQVSLNS--------------------GYHFCGGSLIN------DQWVVS------------AAHCYKS-----------------------------RIQVRLGEHNIN-----------------------VLEGNEQFVNAAKIIKHPNFDR------------------------KTLNNDIMLIKLSSP-VKLN-----------------------ARVATV----------------------ALP-----------------SSCAP-AG------------TQCLISGWG-NTLSSG---------------------------------------VNEPD----LL-QCLDAPLLP---------------------------QADCEASYP-----------------GKITDNMVCVGFLEGG---------KDSCQGDSG-GPVVCN----------GELQGIVSWGY-------GCALPD-----NPGVYTKVCN------YVDWIQDTIAAN------------------------------------------------------------------------------------------', '-----------------------------------------------------------------------------------------------------------------------------------VGG-------------------------------TEAQRN-----------------------------------------SWPSQISLQYRSG---------------SSWAHTCGGTLIR------QNWVMT------------AAHCVDRE---------------------------LTFRVVVGEHNLN-----------------------QNDGTEQYVGVQKIVVHPYWNTD----------------------DVAAGYDIALLRLAQS-VTLNSY-----------------------VQLGV----------------------LPRAG----------------TILANN------------SPCYITGWGL-TR-------------------TNG---------------------QLAQ----TL-QQAYLPTVD---------------------------YAICSSSSYW---------------GSTVKNSMVCAGGDGV----------RSGCQGDSG-GPLHCLVN------GQYAVHGVTSFVSR-----LGCNVTR-----KPTVFTRVSA------YISWINNVIASN------------------------------------------------------------------------------------------', '----------------------------------------------------------------------------------------------------------------------------------IVGG-------------------------------RRARPH-----------------------------------------AWPFMVSLQLR-------------------GGHFCGATLIA------PNFVMS------------AAHCVANVN-------------------------VRAVRVVLGAHNLS-----------------------RREPTRQVFAVQRIFEDGYDP-------------------------VNLLNDIVILQLNGS-ATINAN-----------------------VQVAQ----------------------LPAQG----------------RRLGNG------------VQCLAMGWG-LLGRNR----------------------------------------GIAS----VL-QELNVTVV----------------------------TSLCR-------------------------RSNVCTLVRGRQ---------AGVCFGDSG-SPLVCN----------GLIHGIASFVRG------GCASGL-----YPDAFAPVAQ------FVNWIDSIIQ--------------------------------------------------------------------------------------------', '----------------------------------------------------------------------------------------------------------------------------------IIGG-------------------------------TESKPH-----------------------------------------SRPYMAYLEIVTS---------------NGPSKFCGGFLIR------RNFVLT------------AAHCAGR-----------------------------SITVTLGAHNIT-----------------------EEEDTWQKLEVIKQFRHPKYNT------------------------STLHHDIMLLKLKEK-ASLTLA-----------------------VGTLP----------------------FPSQF----------------NFVPPG------------RMCRVAGWGRT---------------------GVLK--------------------PGSD----TL-QEVKLRLMD---------------------------PQACSHFRD------------------FDHNLQLCVGNPRKT---------KSAFKGDSG-GPLLCA----------GVAQGIVSYGRS------DAKP--------PAVFTRISH------YRPWINQILQAN------------------------------------------------------------------------------------------', '----------------------------------------------------------------------------------------------------------------------------------IIGG-------------------------------HEAKPH-----------------------------------------SRPYMAYLQIMDE---------------YSGSKKCGGFLIR------EDFVLT------------AAHCSGS-----------------------------KIQVTLGAHNIK-----------------------EQEKMQQIIPVVKIIPHPAYNS------------------------KTISNDIMLLKLKSK-AKRSSA-----------------------VKPLN----------------------LPRRN----------------VKVKPG------------DVCYVAGWGK-LG-------------------PMG---------------------KYSD----TL-QEVELTVQE---------------------------DQKCESYLKN----------------YFDKANEICAGDPKIK---------RASFRGDSG-GPLVCK----------KVAAGIVSYGQN------DGST--------PRAFTKVST------FLSWIKKTMKKS------------------------------------------------------------------------------------------']
['4139558', '7546312', '230004', '4139720', '10835843']

Number of sequences: 1470
The loaded MSA has 1470 sequences and 832       positions.

Filter the MSA¶

sequences, sequences_id, positions = c_msa.filter_sequences(
    loaded_seqs, loaded_seqs_id, gap_threshold=0.4, seq_threshold=0.2)
n_pos = len(positions)
print(f"After filtering, we have {n_pos} remaining positions.")
print(f"After filtering, we have {len(sequences)} remaining sequences.")

After filtering, we have 242 remaining positions.
After filtering, we have 1456 remaining sequences.

Compute the number of effective sequences¶

seq_weights, m_eff = c_pos.compute_seq_weights(sequences)
print('Number of effective sequences %d' % np.round(m_eff))
print()

Number of effective sequences 1053

Access the metadata¶

metadata = halabi["metadata"]
print(metadata)

        Seq_ID  ...         Family
    4139558  ...        Muridae
    7546312  ...         Suidae
     230004  ...      Hominidae
    4139720  ...      Hominidae
   10835843  ...        Muridae
...        ...  ...            ...
54641874  ...  Drosophilidae
55236641  ...      Culicidae
55238595  ...      Culicidae
66772117  ...  Drosophilidae
24660124  ...  Drosophilidae

[1470 rows x 14 columns]

The Seq_ID column corresponds to the sequence identifiers found in loaded_seqs_id

Access PDB information¶

pdb_seq = halabi["pdb_sequence"]
pdb_pos = halabi["pdb_positions"]
print("Start of 3TGI sequence:\n", pdb_seq[:10])
print("Named positions of the first ten residues in 3TGI protein sequence:\n",
      pdb_pos[:10])

Start of 3TGI sequence:
 ('IVGGYTCQENSVPYQVSLNSGYHFCGGSLINDQWVVSAAHCYKSRIQVRLGEHNINVLEGNEQFVNAAKIIKHPNFDRKTLNNDIMLIKLSSPVKLNARVATVALPSSCAPAGTQCLISGWGNTLSSGVNEPDLLQCLDAPLLPQADCEASYPGKITDNMVCVGFLEGGKDSCQGDSGGPVVCNGELQGIVSWGYGCALPDNPGVYTKVCNYVDWIQDTIAAN',)
Named positions of the first ten residues in 3TGI protein sequence:
 ['16', '17', '18', '19', '20', '21', '22', '23', '24', '25']

Access sector information¶

sectors = halabi["sector_positions"]
print(sectors)

NpzFile '/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/cocoatree/datasets/data/S1A_serine_proteases/halabi_sectors.npz' with keys: sector_1, sector, sector_3

There are three sectors that have been identified in Halabi et al. (2008)

print("The first sector identified in Halabi et al. is composed of the \
      following positions:", sectors["sector_1"])

The first sector identified in Halabi et al. is composed of the       following positions: [ 17 161 172 176 177 180 183 187 188 189 191 192 213 215 216 220 221 226
 227 228]

To see an example on how to use this dataset for a coevolution analysis, go to: Perform full SCA analysis on the S1A serine protease dataset

For an example on how to visualize cocoatree’s results with this dataset, go to : Plot XCoR together with (phylogenetic) tree and metadata

Rivoire dataset¶

Similarly, you can access the dataset from Rivoire et al. (2016) paper:

rivoire = load_S1A_serine_proteases(paper='rivoire')
print(rivoire.keys())

loaded_seqs = rivoire["alignment"]
loaded_seqs_id = rivoire["sequence_ids"]
print("Number of sequences:", len(loaded_seqs))

n_loaded_pos, n_loaded_seqs = len(loaded_seqs[0]), len(loaded_seqs)
print(f"The loaded MSA has {n_loaded_seqs} sequences and {n_loaded_pos} \
      positions.")

dict_keys(['sequence_ids', 'alignment', 'sector_positions', 'metadata', 'pdb_sequence', 'pdb_positions'])
Number of sequences: 1390
The loaded MSA has 1390 sequences and 832       positions.

The dataset is organized in the same way as Halabi’s, although in this case there are 6 sectors:

sectors = rivoire["sector_positions"]
print(sorted(sectors))

for sect in sectors:
    print(sect, "is composed of:", len(sectors[sect]), "positions.")

['sector', 'sector_1', 'sector_3', 'sector_4', 'sector_5', 'sector_6']
sector_1 is composed of: 32 positions.
sector is composed of: 22 positions.
sector_3 is composed of: 20 positions.
sector_4 is composed of: 10 positions.
sector_5 is composed of: 6 positions.
sector_6 is composed of: 7 positions.

References¶

Total running time of the script: (0 minutes 0.900 seconds)

Gallery generated by Sphinx-Gallery