我已经编写了一个管道来将查询发送到uniprot,但其中一个查询遇到了一个奇怪的问题。 我已将其放入下面的一个小测试用例中。
我得到了预期的数据帧 (
df
) 结构(一行 15 列,每个字段一个),但是当我将其导出到 CSV 并在 Excel 中打开时,它看起来被破坏了。 具体来说,我得到的不是一行,而是两行,第二行从 'Sequence'
数据框列的中间开始(我在底部评论中给出了更多详细信息)。 这是 99 个查询之一,其余的都很好。 我怀疑这是我的 pd.to_csv
通话中的一个问题,但如果有人可以提供更多详细信息,我们将不胜感激。
谢谢! 蒂姆
import requests
import pandas as pd
import io
def queries_to_table(base, query, organism_id):
rest_url = base + f'query=(({query})AND(organism_id:{organism_id}))'
response = requests.get(rest_url)
if response.status_code == 200:
return pd.read_csv(io.StringIO(response.text),
sep = '\t')
else:
raise ValueError(f'The uniprot API returned a status code of {response.status_code}. '\
'This was not 200 as expected, which may reflect an issue '\
f'with your query: {query}.\n\nSee here for more '\
'information: https://www.uniprot.org/help/rest-api-headers. '\
f'Full url: {rest_url}')
size = 500
fields = 'accession,id,protein_name,gene_names,organism_name,'\
'length,sequence,go_p,go_c,go,go_f,ft_topo_dom,'\
'ft_transmem,cc_subcellular_location,ft_intramem'
url_base = f'https://rest.uniprot.org/uniprotkb/search?size={size}&'\
f'fields={fields}&format=tsv&'
query = '(id:TITIN_HUMAN)'
organism_id = 9606
df = queries_to_table(url_base, query, organism_id)
#-> df looks fine - one row and 15 columns
pd.concat([df]).to_csv('test2_error.csv')
#-> opening in excel this is broken - it splits df['Sequence'] into two rows at
#the junction between 'RLLANAECQEGQSVCFEIRVSGIPPPTLKWEKDG' and
#'PLSLGPNIEIIHEGLDYYALHIRDTLPEDTGYY'. In df['Sequence'], this sequence is joined
#by a 'q' (the below string covers the junction, and has the previously quoted substrings in capitals):
#tdstlrpmfkRLLANAECQEGQSVCFEIRVSGIPPPTLKWEKDGqPLSLGPNIEIIHEGLDYYALHIRDTLPEDTGYYrvtatntags
如果我运行您的代码并在文本编辑器中打开 CSV 文件,它将有一个标题行 (
,Entry,Entry Name,Protein names,Gene Names,Organism,Length,Sequence,Gene Ontology (biological process),Gene Ontology (cellular component),Gene Ontology (GO),Gene Ontology (molecular function),Topological domain,Transmembrane,Subcellular location [CC],Intramembrane
) 和一行数据。
在 Numbers(Mac 默认电子表格软件)中打开文件也可以正常显示。
IOW,我认为 Pandas 很好,你的代码也很好——只是 Excel 行为不当(像往常一样)。如果您需要 Excel 文件,请使用
df.to_excel
...但在我看来,最好不要将 Excel 用于您关心数据完整性的目的。