如何基于满足某些条件的现有文件来创建和重新排列新的csv文件?

问题描述 投票:0回答:1

我有一个csv文件,其推文具有4列(user_Id,状态,tweet_Id,tweet_text)和50,000多行。第一列user_id具有4个唯一ID,在整个列中重复。第二列状态是二进制分类,每个推文具有0或1。第三列是tweet ID,第四列是tweet的文本。对于第一列。

输入文件已经基于两列进行了排序,首先是tweet_id,然后是user_id。该文件如下所示:

  Sr#,       user_id,     status,      tweet_id,                 tweet_text

   1,         3712,          1,         444567,       It is not easy to to do this you know...

   2,         3713,          0,         444567,       It is not easy to to do this you know...

   3,         3714,          1,         444567,       It is not easy to to do this you know...

   4,         3715,          1,         444567,       It is not easy to to do this you know...

   5,         3712,          1,         444572,       The process is yet to start

   6,         3713,          0,         444572,       The process is yet to start

   7,         3714,          0,         444572,       The process is yet to start

   8,         3712,          1,         444580,       I am betting on this

   9,         3714,          0,         444580,       I am betting on this

  10,         3715,          0,         444580,       I am betting on this

    and so on.......

如果观察前4行,则user_id值不同,但tweet_id和文本相同。对于行号类似。 4、5和6,user_id不同,但tweet_id和文本相同。

我必须编写一个新的csv文件,其中,对于每个tweet_id和text,第一列的所有用户ID(在本示例中为4)都被创建为新列,对于每个user id列,该tweet的分类值是状态列,写在新的ID列下。如果没有worker_id的状态值,则该user_id的状态值留为空白。

输出文件可能看起来像这样。

Sr#,         tweet_text,                        tweet_id,    3712,    3713,    3714,   3715

1,    It is not easy to to do this you know...,  444567,       1,       0,       1,     1

2,    The process is yet to start,               444572,       1,       0,       0,

3,    I am betting on this,                      444580,       1,                0,     0

我尝试过这样的想法,只要tweet_id发生更改,tweet_id,tweet_text和四个唯一ID的状态都会写入新文件。我使用的代码如下:

 import csv
 import pandas as pd

 with open('combined_csvFinalSortedClean2.csv', 'w', newline='') as csvfile:
   filewriter = csv.writer(csvfile, delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL)
   filewriter.writerow(['tweet_id','tweet_text', '3712', '3713', '3714', '3714'])

 df = pd.read_csv('combined_csvFinalSortedClean2.csv', sep=',', header=None, index_col=False)

 with open("combined_csvFinalSorted2.csv", "r", encoding="utf-8") as csv_file:
   reader = csv.reader(csv_file, delimiter=',')
   header = next(reader) # get header
   curr_tweet=0
   curr_wid=0
   count=0

   for row in reader:
     wid=row[0]
     id=row[2]

     if (curr_tweet!=id) and (curr_wid!=wid):
      curr_tweet=id
      curr_wid=wid
      count=1
      df[0]=id
      df[1]=row[3]

     if wid==3712:
       df[2]=row[2]
     else: 
       df[2] = None

     if wid==3713:
       df[3]=row[2]
     else: 
       df[3]= None

     if wid==3714:
       df[4]=row[2]
     else: 
       df[4] = None

     if wid==3715:
       df[5]=row[2]
     else: 
       df[5] = None

     df.to_csv('output_file.csv', sep=',', encoding='utf-8', index=False)
     count+=1

     #else:
       #None
       #count+=1

我很累,但问题是pnada的to_csv模块仅将最后一行写入新的输出文件,并且根据给定的if ... else条件,没有将任何内容写入四个唯一ID列。我将不胜感激。.

谢谢..

python pandas csv twitter
1个回答
0
投票

这是一种使用pivot_table的方法:

pivot_table
© www.soinside.com 2019 - 2024. All rights reserved.