我有 NGS 数据(仅限唯一克隆),我想使用 Python 语言根据相似性(最好是聚类)对它们进行分组。请查看以下示例序列。还可以隔离图像中圈出的最远的克隆。
注意:我之前已经问过类似的问题,但那是针对相同长度的DNA,因此不需要MSA,但在下面的示例中,我想探索如果没有MSA,我们是否可以执行聚类并找到更远的克隆。
任何帮助将不胜感激。
**NGS Sample Data is as below**
>3971
AVTLDESGGGLQTPGGALSLVCKASGFTFSDRGMGWVRQAPGKGLEFVACIENDGSWTAYGSAVKGRATISRDNGQSTVRLQLNNLRAEDTATYYCAKSAGGSLLLTVVILTVGSIDAWGHGT
>3186
AVTLGESGGGLQTPGGALSLVCKASGFTFSSHGMAWVRQAPGKGLEFVAGIGNTGSNPNYGAAVKGRATISRDNRQSTVRLQLNNLRAEDTGTYFCAKRAYAASWSGSDRIDAWGHGT
>3066
AVTLGESGGGLQTPGGGLSLVCKASGFTFSSFNMFWVRQAPGKGLEYVAGIDNTGSYTAYGAAVKGRATISRDNGQSTLRLQLNNLRAEDTATYYCAKSFDGRYYHSIDGVDAIDAWGHGT
>3719
AVTLGESGGGLQTPGGTVSLVCKGSGFDFSSYNMQWVRQAPGKGLEFIAQINGAGSGTNYGPAVKGRATISRDNGQSTVRLQLNNLRAEDTAIYYCAKSYDGRYYHSIDGVDAIDAWGHGT
>127
AVTLGESGGGLQTPGGALSLVCKGSGFTLSSFNMGWVRQAPGKGLEWVGVISSSGRYTEYGAAVKGRAAISRDDGQSTVRLQLNNLRAEDTAIYFCAKGIGTAYCGSCVGEIDTWGHGT
>144
AVTLDESGGGLQTPGGGLSLVCKASGFTFSSHGMGWVRQAPGKGLEFVADISGSGSSTNYGPAVKGRATISRDNGQSTVRLQLNDLRAEDTATYYCAKYVGSIGCGSTAGIDAWGHGT
>291
AVTLDESGGGLQTPGGALSLVCKASGFTFSDRGMHWVRQAPGKGLEWVAGIGNSGSGTTYGSAVKGRATISRDNGQSTLRLQLNNLRPEDTATYFCARATCIGSGYCGWGTYRIDAWGHGT
>381
AVTLDESGGGLQTPGGALSLVCKASGFTFSRFNMFWVRQAPGKGLEWVAAISSGSSTWYGSAVKGRATISRDNGQSTVRLQLNNLRAEDTGTYYCTKAAGNGYRGWTTYIAGWIDAWGHGT
**Desired output is as below**
from Bio import SeqIO
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import numpy as np
import matplotlib.pyplot as plt
# Example protein sequences in a FASTA file
example_fasta = """>144
AVTLGESGGGLQTPGGALSLVCKGSGFTLSSFNMGWVRQAPGKGLEWVGVISSSGRYTEYGAAVKGRAAISRDDGQSTVRLQLNNLRAEDTAIYFCAKGIGTAYCGSCVGEIDTWGHGT
>291
AVTLDESGGGLQTPGGALSLVCKASGFTFSDRGMHWVRQAPGKGLEWVAGIGNSGSGTTYGSAVKGRATISRDNGQSTLRLQLNNLRPEDTATYFCARATCIGSGYCGWGTYRIDAWGHGT
>381
AVTLDESGGGLQTPGGALSLVCKASGFTFSRFNMFWVRQAPGKGLEWVAAISSGSSTWYGSAVKGRATISRDNGQSTVRLQLNNLRAEDTGTYYCTKAAGNGYRGWTTYIAGWIDAWGHGT
>3971
AVTLDESGGGLQTPGGALSLVCKASGFTFSDRGMGWVRQAPGKGLEFVACIENDGSWTAYGSAVKGRATISRDNGQSTVRLQLNNLRAEDTATYYCAKSAGGSLLLTVVILTVGSIDAWGHGT
>3186
AVTLGESGGGLQTPGGALSLVCKASGFTFSSHGMAWVRQAPGKGLEFVAGIGNTGSNPNYGAAVKGRATISRDNRQSTVRLQLNNLRAEDTGTYFCAKRAYAASWSGSDRIDAWGHGT
>3066
AVTLGESGGGLQTPGGGLSLVCKASGFTFSSFNMFWVRQAPGKGLEYVAGIDNTGSYTAYGAAVKGRATISRDNGQSTLRLQLNNLRAEDTATYYCAKSFDGRYYHSIDGVDAIDAWGHGT
>3719
AVTLGESGGGLQTPGGTVSLVCKGSGFDFSSYNMQWVRQAPGKGLEFIAQINGAGSGTNYGPAVKGRATISRDNGQSTVRLQLNNLRAEDTAIYYCAKSYDGRYYHSIDGVDAIDAWGHGT"""
# Save example data to a FASTA file
fasta_file = "example_protein.fasta"
with open(fasta_file, "w") as f:
f.write(example_fasta)
# Read protein sequences from the FASTA file
sequences = list(SeqIO.parse(fasta_file, "fasta"))
protein_sequences = [str(seq.seq).upper() for seq in sequences]
# Preprocessing and numerical representation of protein sequences
# Replace this with your specific method for encoding protein sequences
def protein_to_num(seq):
# Example: encoding amino acids as numerical values based on their properties
amino_acid_properties = {'A': 1, 'R': 2, 'N': 3, 'D': 4, 'C': 5, 'Q': 6, 'E': 7, 'G': 8, 'H': 9, 'I': 10,
'L': 11, 'K': 12, 'M': 13, 'F': 14, 'P': 15, 'S': 16, 'T': 17, 'W': 18, 'Y': 19, 'V': 20}
return np.array([amino_acid_properties.get(aa, 0) for aa in seq])
# Convert protein sequences to numerical representations
protein_data = [protein_to_num(seq) for seq in protein_sequences]
# Find the maximum length of protein sequences
max_len = max(len(seq) for seq in protein_data)
# Pad protein sequences with zeros to make them of equal length
protein_data_padded = [np.pad(seq, (0, max_len - len(seq)), 'constant') for seq in protein_data]
# Convert the list of padded arrays to a 2D numpy array
protein_data_array = np.array(protein_data_padded)
# Handle missing values (NaNs) by replacing them with zeros
protein_data_array[np.isnan(protein_data_array)] = 0
# Perform PCA
pca = PCA(n_components=2)
pca_data = pca.fit_transform(protein_data_array)
# Perform KMeans clustering
kmeans = KMeans(n_clusters=5, random_state=42)
kmeans.fit(pca_data)
# Visualize the clusters
colors = ['r', 'g', 'b', 'c', 'm']
for i, label in enumerate(set(kmeans.labels_)):
cluster_points = np.array([pca_data[j] for j, l in enumerate(kmeans.labels_) if l == label])
plt.scatter(cluster_points[:, 0], cluster_points[:, 1], color=colors[i], label=f'Cluster {label}')
# Annotate points with sequence identifiers
for j, seq_label in enumerate(sequences):
if kmeans.labels_[j] == label:
plt.annotate(seq_label.id, (pca_data[j, 0], pca_data[j, 1]), textcoords="offset points", xytext=(5,5), ha='right')
plt.legend()
plt.show()
我知道您提供的图像是 kmeans 聚类示例。我认为如果您想避免 MSA,这可能是一个不错的选择。您可以在此处
阅读更多相关信息例如,您可以根据Levenshtein距离
使用kmeans对数据进行聚类