数据集中的数据纯粹由字符组成。例如:
p,x,s,n,t,p,f,c,n,k,e,e,s,s,w,w,p,w,o,p,k,s,u
e,x,s,y,t,a,f,c,b,k,e,c,s,s,w,w,p,w,o,p,n,n,g
e,b,s,w,t,l,f,c,b,n,e,c,s,s,w,w,p,w,o,p,n,n,m
p,x,y,w,t,p,f,c,n,n,e,e,s,s,w,w,p,w,o,p,k,s,u
e,x,s,g,f,n,f,w,b,k,t,e,s,s,w,w,p,w,o,e,n,a,g
可以在agaricus-lepiota.data in the uci machine learning datasets mushroom dataset中找到完整的数据副本
是否有通过matplotlib使用char数据(而不是必须将数据集转换为数字)的可视化方法?
只是为了任何形式的可视化,即:
filename = 'mushrooms.csv'
df_mushrooms = pd.read_csv(filename, names = ["Classes", "Cap-Shape", "Cap-Surface", "Cap-Colour", "Bruises", "Odor", "Gill-Attachment", "Gill-Spacing", "Gill-Size", "Gill-Colour", "Stalk-Shape", "Stalk-Root", "Stalk-Surface-Above-Ring", "Stalk-Surface-Below-Ring", "Stalk-Colour-Above-Ring", "Stalk-Colour-Below-Ring", "Veil-Type", "Veil-Colour", "Ring-Number", "Ring-Type", "Spore-Print-Colour", "Population", "Habitat"])
#If there are any entires (rows) with any missing values/NaN's drop the row.
df_mushrooms.dropna(axis = 0, how = 'any', inplace = True)
df_mushrooms.plot.scatter(x = 'Classes', y = 'Cap-Shape')
可以这样做,但是从图形的角度来看,这种方法并没有任何意义。如果你按照你的要求去做它会是这样的:
我知道我不应该告诉某人如何展示他们的图表,但这并没有向我传达任何信息。问题是使用Classes
和Cap-Shape
字段为你的x
和y
索引将始终在同一个地方放置相同的字母。没有变化。也许还有一些其他字段可以用作索引,然后使用Cap-Shape
作为标记,但因为它不会添加任何值。这对我个人而言。
要使用字符串作为标记,您可以使用matplotlib.markers
中描述的“$ ... $”标记,但我必须再次提供警告,这样的图形比传统方法慢得多,因为您必须遍历您的行数据帧。
fig, ax = plt.subplots()
# Classes only has 'p' and 'e' as unique values so we will map them as 1 and 2 on the index
df['Class_Id'] = df.Classes.map(lambda x: 1 if x == 'p' else 2)
df['Cap_Val'] = df['Cap-Shape'].map(lambda x: ord(x) - 96)
for idx, row in df.iterrows():
ax.scatter(x=row.Class_Id, y=row.Cap_Val, marker=r"$ {} $".format(row['Cap-Shape']), c=plt.cm.nipy_spectral(row.Cap_Val / 26))
ax.set_xticks([0,1,2,3])
ax.set_xticklabels(['', 'p', 'e', ''])
ax.set_yticklabels(['', 'e', 'j', 'o', 't', 'y'])
fig.show()