写入 csv 时 Pandas 数据框被破坏

Question

我已经编写了一个管道来将查询发送到uniprot，但其中一个查询遇到了一个奇怪的问题。我已将其放入下面的一个小测试用例中。

我得到了预期的数据帧 (

df

) 结构（一行 15 列，每个字段一个），但是当我将其导出到 CSV 并在 Excel 中打开时，它看起来被破坏了。具体来说，我得到的不是一行，而是两行，第二行从

'Sequence'

数据框列的中间开始（我在底部评论中给出了更多详细信息）。这是 99 个查询之一，其余的都很好。我怀疑这是我的

pd.to_csv

通话中的一个问题，但如果有人可以提供更多详细信息，我们将不胜感激。

谢谢！蒂姆

import requests 
import pandas as pd
import io 

def queries_to_table(base, query, organism_id):
    rest_url = base + f'query=(({query})AND(organism_id:{organism_id}))'
    response = requests.get(rest_url)
    if response.status_code == 200:
        return pd.read_csv(io.StringIO(response.text), 
                           sep = '\t')
    else:
        raise ValueError(f'The uniprot API returned a status code of {response.status_code}.  '\
                         'This was not 200 as expected, which may reflect an issue '\
                         f'with your query:  {query}.\n\nSee here for more '\
                         'information: https://www.uniprot.org/help/rest-api-headers.  '\
                         f'Full url: {rest_url}')

size = 500
fields = 'accession,id,protein_name,gene_names,organism_name,'\
          'length,sequence,go_p,go_c,go,go_f,ft_topo_dom,'\
          'ft_transmem,cc_subcellular_location,ft_intramem'
url_base = f'https://rest.uniprot.org/uniprotkb/search?size={size}&'\
           f'fields={fields}&format=tsv&'
query = '(id:TITIN_HUMAN)'
organism_id = 9606

df = queries_to_table(url_base, query, organism_id)
#-> df looks fine - one row and 15 columns

pd.concat([df]).to_csv('test2_error.csv')
#-> opening in excel this is broken - it splits df['Sequence'] into two rows at 
#the junction between 'RLLANAECQEGQSVCFEIRVSGIPPPTLKWEKDG' and 
#'PLSLGPNIEIIHEGLDYYALHIRDTLPEDTGYY'. In df['Sequence'], this sequence is joined 
#by a 'q' (the below string covers the junction, and has the previously quoted substrings in capitals):
#tdstlrpmfkRLLANAECQEGQSVCFEIRVSGIPPPTLKWEKDGqPLSLGPNIEIIHEGLDYYALHIRDTLPEDTGYYrvtatntags

Answer 1

如果我运行您的代码并在文本编辑器中打开 CSV 文件，它将有一个标题行 (

,Entry,Entry Name,Protein names,Gene Names,Organism,Length,Sequence,Gene Ontology (biological process),Gene Ontology (cellular component),Gene Ontology (GO),Gene Ontology (molecular function),Topological domain,Transmembrane,Subcellular location [CC],Intramembrane

) 和一行数据。

在 Numbers（Mac 默认电子表格软件）中打开文件也可以正常显示。

IOW，我认为 Pandas 很好，你的代码也很好——只是 Excel 行为不当（像往常一样）。如果您需要 Excel 文件，请使用

df.to_excel

...但在我看来，最好不要将 Excel 用于您关心数据完整性的目的。

写入 csv 时 Pandas 数据框被破坏

问题描述投票：0回答：1

1个回答

最新问题

写入 csv 时 Pandas 数据框被破坏

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1