蛋白质序列聚类(有/无 MSA)

问题描述 投票:0回答:2

我有 NGS 数据(仅限唯一克隆),我想使用 Python 语言根据相似性(最好是聚类)对它们进行分组。请查看以下示例序列。还可以隔离图像中圈出的最远的克隆。

注意:我之前已经问过类似的问题,但那是针对相同长度的DNA,因此不需要MSA,但在下面的示例中,我想探索如果没有MSA,我们是否可以执行聚类并找到更远的克隆。

任何帮助将不胜感激。

                **NGS Sample Data is as below**

>3971 
AVTLDESGGGLQTPGGALSLVCKASGFTFSDRGMGWVRQAPGKGLEFVACIENDGSWTAYGSAVKGRATISRDNGQSTVRLQLNNLRAEDTATYYCAKSAGGSLLLTVVILTVGSIDAWGHGT 
>3186 
AVTLGESGGGLQTPGGALSLVCKASGFTFSSHGMAWVRQAPGKGLEFVAGIGNTGSNPNYGAAVKGRATISRDNRQSTVRLQLNNLRAEDTGTYFCAKRAYAASWSGSDRIDAWGHGT 
>3066 
AVTLGESGGGLQTPGGGLSLVCKASGFTFSSFNMFWVRQAPGKGLEYVAGIDNTGSYTAYGAAVKGRATISRDNGQSTLRLQLNNLRAEDTATYYCAKSFDGRYYHSIDGVDAIDAWGHGT 
>3719 
AVTLGESGGGLQTPGGTVSLVCKGSGFDFSSYNMQWVRQAPGKGLEFIAQINGAGSGTNYGPAVKGRATISRDNGQSTVRLQLNNLRAEDTAIYYCAKSYDGRYYHSIDGVDAIDAWGHGT
>127 
AVTLGESGGGLQTPGGALSLVCKGSGFTLSSFNMGWVRQAPGKGLEWVGVISSSGRYTEYGAAVKGRAAISRDDGQSTVRLQLNNLRAEDTAIYFCAKGIGTAYCGSCVGEIDTWGHGT 
>144 
AVTLDESGGGLQTPGGGLSLVCKASGFTFSSHGMGWVRQAPGKGLEFVADISGSGSSTNYGPAVKGRATISRDNGQSTVRLQLNDLRAEDTATYYCAKYVGSIGCGSTAGIDAWGHGT 
>291 
AVTLDESGGGLQTPGGALSLVCKASGFTFSDRGMHWVRQAPGKGLEWVAGIGNSGSGTTYGSAVKGRATISRDNGQSTLRLQLNNLRPEDTATYFCARATCIGSGYCGWGTYRIDAWGHGT 
>381 
AVTLDESGGGLQTPGGALSLVCKASGFTFSRFNMFWVRQAPGKGLEWVAAISSGSSTWYGSAVKGRATISRDNGQSTVRLQLNNLRAEDTGTYYCTKAAGNGYRGWTTYIAGWIDAWGHGT

                       **Desired output is as below**

python-3.x data-science bioinformatics biopython hierarchical-clustering
2个回答
0
投票
from Bio import SeqIO
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import numpy as np
import matplotlib.pyplot as plt

# Example protein sequences in a FASTA file
example_fasta = """>144
AVTLGESGGGLQTPGGALSLVCKGSGFTLSSFNMGWVRQAPGKGLEWVGVISSSGRYTEYGAAVKGRAAISRDDGQSTVRLQLNNLRAEDTAIYFCAKGIGTAYCGSCVGEIDTWGHGT
>291
AVTLDESGGGLQTPGGALSLVCKASGFTFSDRGMHWVRQAPGKGLEWVAGIGNSGSGTTYGSAVKGRATISRDNGQSTLRLQLNNLRPEDTATYFCARATCIGSGYCGWGTYRIDAWGHGT
>381
AVTLDESGGGLQTPGGALSLVCKASGFTFSRFNMFWVRQAPGKGLEWVAAISSGSSTWYGSAVKGRATISRDNGQSTVRLQLNNLRAEDTGTYYCTKAAGNGYRGWTTYIAGWIDAWGHGT
>3971
AVTLDESGGGLQTPGGALSLVCKASGFTFSDRGMGWVRQAPGKGLEFVACIENDGSWTAYGSAVKGRATISRDNGQSTVRLQLNNLRAEDTATYYCAKSAGGSLLLTVVILTVGSIDAWGHGT
>3186
AVTLGESGGGLQTPGGALSLVCKASGFTFSSHGMAWVRQAPGKGLEFVAGIGNTGSNPNYGAAVKGRATISRDNRQSTVRLQLNNLRAEDTGTYFCAKRAYAASWSGSDRIDAWGHGT
>3066
AVTLGESGGGLQTPGGGLSLVCKASGFTFSSFNMFWVRQAPGKGLEYVAGIDNTGSYTAYGAAVKGRATISRDNGQSTLRLQLNNLRAEDTATYYCAKSFDGRYYHSIDGVDAIDAWGHGT
>3719
AVTLGESGGGLQTPGGTVSLVCKGSGFDFSSYNMQWVRQAPGKGLEFIAQINGAGSGTNYGPAVKGRATISRDNGQSTVRLQLNNLRAEDTAIYYCAKSYDGRYYHSIDGVDAIDAWGHGT"""

# Save example data to a FASTA file
fasta_file = "example_protein.fasta"
with open(fasta_file, "w") as f:
    f.write(example_fasta)

# Read protein sequences from the FASTA file
sequences = list(SeqIO.parse(fasta_file, "fasta"))
protein_sequences = [str(seq.seq).upper() for seq in sequences]

# Preprocessing and numerical representation of protein sequences
# Replace this with your specific method for encoding protein sequences
def protein_to_num(seq):
    # Example: encoding amino acids as numerical values based on their properties
    amino_acid_properties = {'A': 1, 'R': 2, 'N': 3, 'D': 4, 'C': 5, 'Q': 6, 'E': 7, 'G': 8, 'H': 9, 'I': 10,
                             'L': 11, 'K': 12, 'M': 13, 'F': 14, 'P': 15, 'S': 16, 'T': 17, 'W': 18, 'Y': 19, 'V': 20}
    return np.array([amino_acid_properties.get(aa, 0) for aa in seq])

# Convert protein sequences to numerical representations
protein_data = [protein_to_num(seq) for seq in protein_sequences]

# Find the maximum length of protein sequences
max_len = max(len(seq) for seq in protein_data)

# Pad protein sequences with zeros to make them of equal length
protein_data_padded = [np.pad(seq, (0, max_len - len(seq)), 'constant') for seq in protein_data]

# Convert the list of padded arrays to a 2D numpy array
protein_data_array = np.array(protein_data_padded)

# Handle missing values (NaNs) by replacing them with zeros
protein_data_array[np.isnan(protein_data_array)] = 0

# Perform PCA
pca = PCA(n_components=2)
pca_data = pca.fit_transform(protein_data_array)

# Perform KMeans clustering
kmeans = KMeans(n_clusters=5, random_state=42)
kmeans.fit(pca_data)

# Visualize the clusters
colors = ['r', 'g', 'b', 'c', 'm']
for i, label in enumerate(set(kmeans.labels_)):
    cluster_points = np.array([pca_data[j] for j, l in enumerate(kmeans.labels_) if l == label])
    plt.scatter(cluster_points[:, 0], cluster_points[:, 1], color=colors[i], label=f'Cluster {label}')

    # Annotate points with sequence identifiers
    for j, seq_label in enumerate(sequences):
        if kmeans.labels_[j] == label:
            plt.annotate(seq_label.id, (pca_data[j, 0], pca_data[j, 1]), textcoords="offset points", xytext=(5,5), ha='right')

plt.legend()
plt.show()

0
投票

我知道您提供的图像是 kmeans 聚类示例。我认为如果您想避免 MSA,这可能是一个不错的选择。您可以在此处

阅读更多相关信息

例如,您可以根据Levenshtein距离

使用kmeans对数据进行聚类
© www.soinside.com 2019 - 2024. All rights reserved.