Note

Go to the end to download the full example code. or to run this example in your browser via Binder

S1A serine proteases¶

Load the dataset

halabi
Number of sequences 1470
The loaded MSA has 1470 sequences and 832         positions.
After filtering, we have 242 remaining positions.
After filtering, we have 1456 remaining sequences.
computing weight of seq 1/1456
computing weight of seq 101/1456
computing weight of seq 201/1456
computing weight of seq 301/1456
computing weight of seq 401/1456
computing weight of seq 501/1456
computing weight of seq 601/1456
computing weight of seq 701/1456
computing weight of seq 801/1456
computing weight of seq 901/1456
computing weight of seq 1001/1456
computing weight of seq 1101/1456
computing weight of seq 1201/1456
computing weight of seq 1301/1456
computing weight of seq 1401/1456
Number of effective sequences 1053

rivoire
Number of sequences 1390
The loaded MSA has 1390 sequences and 832         positions.
After filtering, we have 243 remaining positions.
After filtering, we have 1376 remaining sequences.
computing weight of seq 1/1376
computing weight of seq 101/1376
computing weight of seq 201/1376
computing weight of seq 301/1376
computing weight of seq 401/1376
computing weight of seq 501/1376
computing weight of seq 601/1376
computing weight of seq 701/1376
computing weight of seq 801/1376
computing weight of seq 901/1376
computing weight of seq 1001/1376
computing weight of seq 1101/1376
computing weight of seq 1201/1376
computing weight of seq 1301/1376
Number of effective sequences 1019

import numpy as np
from cocoatree.datasets import load_S1A_serine_proteases

import cocoatree.msa as c_msa
import cocoatree.statistics.position as c_pos


for paper in ["halabi", "rivoire"]:
    print(paper)
    dataset = load_S1A_serine_proteases(paper=paper)

    print("Number of sequences", len(dataset["alignment"]))
    loaded_seqs = dataset["alignment"]
    loaded_seqs_id = dataset["sequence_ids"]
    n_loaded_pos, n_loaded_seqs = len(loaded_seqs[0]), len(loaded_seqs)

    print(f"The loaded MSA has {n_loaded_seqs} sequences and {n_loaded_pos} \
        positions.")

    sequences, sequences_id, positions = c_msa.filter_sequences(
        loaded_seqs, loaded_seqs_id, gap_threshold=0.4, seq_threshold=0.2)
    n_pos = len(positions)
    print(f"After filtering, we have {n_pos} remaining positions.")
    print(f"After filtering, we have {len(sequences)} remaining sequences.")

    seq_weights, m_eff = c_pos.compute_seq_weights(sequences)
    print('Number of effective sequences %d' %
          np.round(m_eff))
    print()

Total running time of the script: (0 minutes 1.479 seconds)

Gallery generated by Sphinx-Gallery