如何从数据帧中按顺序在特定列之间绘制散点图

问题描述 投票:0回答:1

我有以下数据框

merged_dft
来散点图两列,例如。
snv vs snv-dra

samples snv het-hom ti-tv   snv-drg het-hom-drg ti-tv-drg   insertion-drg   deletion-drg    insertion   deletion    ins-del-ratio-drg   ins-del-ratio   Sample_name Sex Superpopulation_code
0   NA20126 4592368 2.14    1.97    4770140 2.26    1.96    523917  536443  472931  494200  0.98    0.96    NA20126 male    AFR
1   NA20127 4699751 2.04    1.97    4918959 2.18    1.97    562430  572733  485645  505302  0.98    0.96    NA20127 female  AFR
2   NA20128 4636463 2.09    1.97    4854107 2.22    1.97    552634  566283  478801  500632  0.98    0.96    NA20128 female  AFR
3   NA20129 4638940 2.11    1.97    4863336 2.23    1.97    552984  565534  478078  499867  0.98    0.96    NA20129 female  AFR
4   NA20274 4339811 2.10    1.96    4554995 2.23    1.96    524046  530728  456420  471116  0.99    0.97    NA20274 female  AFR
.... 
....

--

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.lines as mlines
import scipy.stats as stats

# x = merged_dft['snv']
# y = merged_dft['snv-drg']

# x_min = merged_dft['snv'].min()
# x_max = merged_dft['snv'].max()

# y_min = merged_dft['snv-drg'].min()
# y_max = merged_dft['snv-drg'].max()

# lineStart = min(x_min,y_min)
# lineEnd = max(x_max,y_max)

# Create a scatter plot
# plt.scatter(x, y, c='tab:blue')
sns.scatterplot(data=merged_dft, x='snv', y='snv-drg', hue='Superpopulation_code' )

plt.xlabel('NPM')
plt.ylabel('Drgen')
plt.title('Count_SNVs')
plt.rcParams.update({'figure.figsize':(10,8), 'figure.dpi':100})

plt.plot([lineStart, lineEnd], [lineStart, lineEnd], color = 'r', linestyle = 'dashed')
plt.xlim(lineStart, lineEnd)
plt.ylim(lineStart, lineEnd)

r, p = stats.pearsonr(x, y)
plt.annotate('r = {:.2f}'.format(r), xy=(0.1, 0.95), xycoords='axes fraction')

# plt.legend(bbox_to_anchor=(1.025,1), loc='upper left', borderaxespad=0.)

enter image description here

我想按顺序对

npm_col
drg_col
中的一对列进行散点图/皮尔逊相关。我无法通过下面的代码完成它。

示例:

snv vs snv-drg
,
het-hom vs het-hom-drg
,
ti-tv vs ti-tv-drg

# set 1 coloumns 
npm_col = merged_dft[['snv', 'het-hom', 'ti-tv']]
npm_col

# set 2 coloumns 
drg_col = merged_dft[['snv-drg', 'het-hom-drg', 'ti-tv-drg']]
drg_col

--

for i in range(len(npm_col)):
    for j in range(len(drg_col)):
        plt.figure()
        plt.scatter(merged_dft[npm_col], merged_dft[drg_col])
        plt.xlabel(npm_col)
        plt.ylabel(drg_col)
        plt.title(f'Scatter plot between {npm_col} and {drg_col}')
        plt.rcParams.update({'figure.figsize':(10,8), 'figure.dpi':100})
        plt.plot([lineStart, lineEnd], [lineStart, lineEnd], color = 'r', linestyle = 'dashed')
        plt.xlim(lineStart, lineEnd)
        plt.ylim(lineStart, lineEnd)
        # r, p = stats.pearsonr(x, y)
        r, p = stats.pearsonr(merged_dft[npm_col], merged_dft[drg_col])
        plt.annotate('r = {:.2f}'.format(r), xy=(0.1, 0.95), xycoords='axes fraction')
        # plt.legend(bbox_to_anchor=(1.025,1), loc='upper left', borderaxespad=0.)
        plt.show()

感谢您的帮助!

python pandas scatter-plot
1个回答
0
投票

感谢您提供问题的详细信息。我了解您想要创建散点图并按顺序计算列对的皮尔逊相关性。这是一个 Python 脚本,应该可以完成您正在寻找的任务:

    import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats

# Define the column pairs
column_pairs = [
    ('snv', 'snv-drg'),
    ('het-hom', 'het-hom-drg'),
    ('ti-tv', 'ti-tv-drg')
]

# Create a plot for each column pair
for npm_col, drg_col in column_pairs:
    plt.figure(figsize=(10, 8))
    
    # Create scatter plot
    sns.scatterplot(data=merged_dft, x=npm_col, y=drg_col, hue='Superpopulation_code')
    
    # Add title and labels
    plt.title(f'{npm_col} vs {drg_col}')
    plt.xlabel(npm_col)
    plt.ylabel(drg_col)
    
    # Add diagonal line
    x_min, x_max = plt.xlim()
    y_min, y_max = plt.ylim()
    line_start = min(x_min, y_min)
    line_end = max(x_max, y_max)
    plt.plot([line_start, line_end], [line_start, line_end], color='r', linestyle='dashed')
    
    # Calculate and add Pearson correlation
    r, p = stats.pearsonr(merged_dft[npm_col], merged_dft[drg_col])
    plt.annotate(f'r = {r:.2f}', xy=(0.1, 0.95), xycoords='axes fraction')
    
    plt.tight_layout()
    plt.show()

此代码将自动创建您请求的三个图形对(snv 与 snv-drg、het-hom 与 het-hom-drg、ti-tv 与 ti-tv-drg)并计算每个图形对的 Pearson 相关性。

© www.soinside.com 2019 - 2024. All rights reserved.