学习时一直坚持使用 BeautifulSoup 进行刮擦。需要一些指点

问题描述 投票:0回答:1

我开始使用 BeautifulSoup 学习屏幕抓取。首先,我采用了以下格式的维基百科文章

<table class="wikitable sortable jquery-tablesorter">
    <caption></caption>
    <thead>
        <tr>
            <th colspan="2" style="width: 6%;" class="headerSort" tabindex="0" role="columnheader button" title="Sort ascending">Opening</th>
            <th style="width: 20%;" class="headerSort" tabindex="0" role="columnheader button" title="Sort ascending">Title</th>
            <th style="width: 10%;" class="headerSort" tabindex="0" role="columnheader button" title="Sort ascending">Director</th>
            <th style="width: 45%;" class="headerSort" tabindex="0" role="columnheader button" title="Sort ascending">Cast</th>
            <th style="width: 30%;" class="headerSort" tabindex="0" role="columnheader button" title="Sort ascending">Production company</th>
            <th class="unsortable" style="width: 1%;"><abbr title="Reference(s)">Ref.</abbr></th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td rowspan="3" style="text-align: center; background: #77bc83;">
                <b>
                    O<br />
                    C<br />
                    T
                </b>
            </td>
            <td rowspan="1" style="text-align: center; background: #77bc83;"><b>11</b></td>
            <td style="text-align: center;">
                <i><a href="/wiki/Viswam_(film)" title="Viswam (film)">Viswam</a></i>
            </td>
            <td>Sreenu Vaitla</td>
            <td>
                <link rel="mw-deduplicated-inline-style" href="mw-data:TemplateStyles:r1129693374" />
                <div class="hlist">
                    <ul>
                        <li><a href="/wiki/Gopichand_(actor)" title="Gopichand (actor)">Gopichand</a></li>
                        <li><a href="/wiki/Kavya_Thapar" title="Kavya Thapar">Kavya Thapar</a></li>
                        <li><a href="/wiki/Vennela_Kishore" title="Vennela Kishore">Vennela Kishore</a></li>
                        <li><a href="/wiki/Sunil" title="Sunil">Sunil</a></li>
                        <li><a href="/wiki/Naresh" title="Naresh">Naresh</a></li>
                    </ul>
                </div>
            </td>
            <td>
                Chitralayam Studios<br />
                People Media Factory
            </td>
            <td style="text-align: center;">
                <sup id="cite_ref-180" class="reference">
                    <a href="#cite_note-180"><span class="cite-bracket">[</span>178<span class="cite-bracket">]</span></a>
                </sup>
            </td>
        </tr>
        <tr>
            <td rowspan="2" style="text-align: center; background: #77bc83;"><b>31</b></td>
            <td style="text-align: center;">
                <i><a href="/wiki/Lucky_Baskhar" title="Lucky Baskhar">Lucky Baskhar</a></i>
            </td>
            <td><a href="/wiki/Venky_Atluri" title="Venky Atluri">Venky Atluri</a></td>
            <td>
                <link rel="mw-deduplicated-inline-style" href="mw-data:TemplateStyles:r1129693374" />
                <div class="hlist">
                    <ul>
                        <li><a href="/wiki/Dulquer_Salmaan" title="Dulquer Salmaan">Dulquer Salmaan</a></li>
                        <li><a href="/wiki/Meenakshi_Chaudhary" title="Meenakshi Chaudhary">Meenakshi Chaudhary</a></li>
                    </ul>
                </div>
            </td>
            <td><a href="/wiki/S._Radha_Krishna" title="S. Radha Krishna">Sithara Entertainments</a></td>
            <td style="text-align: center;">
                <sup id="cite_ref-181" class="reference">
                    <a href="#cite_note-181"><span class="cite-bracket">[</span>179<span class="cite-bracket">]</span></a>
                </sup>
            </td>
        </tr>
        <tr>
            <td style="text-align: center;">
                <i><a href="/wiki/Mechanic_Rocky" title="Mechanic Rocky">Mechanic Rocky</a></i>
            </td>
            <td>Ravi Teja Mullapudi</td>
            <td>
                <link rel="mw-deduplicated-inline-style" href="mw-data:TemplateStyles:r1129693374" />
                <div class="hlist">
                    <ul>
                        <li><a href="/wiki/Vishwak_Sen" title="Vishwak Sen">Vishwak Sen</a></li>
                        <li><a href="/wiki/Meenakshi_Chaudhary" title="Meenakshi Chaudhary">Meenakshi Chaudhary</a></li>
                    </ul>
                </div>
            </td>
            <td>SRT Entertainments</td>
            <td style="text-align: center;">
                <sup id="cite_ref-182" class="reference">
                    <a href="#cite_note-182"><span class="cite-bracket">[</span>180<span class="cite-bracket">]</span></a>
                </sup>
            </td>
        </tr>
        <tr>
            <td style="text-align: center; background: #77ea83;">
                <b>
                    N<br />
                    O<br />
                    V
                </b>
            </td>
            <td style="text-align: center; background: #77ea83;"><b>9</b></td>
            <td style="text-align: center;">
                <i><a href="/wiki/Telusu_Kada" title="Telusu Kada">Telusu Kada</a></i>
            </td>
            <td><a href="/wiki/Neeraja_Kona" title="Neeraja Kona">Neeraja Kona</a></td>
            <td>
                <link rel="mw-deduplicated-inline-style" href="mw-data:TemplateStyles:r1129693374" />
                <div class="hlist">
                    <ul>
                        <li><a href="/wiki/Siddhu_Jonnalagadda" title="Siddhu Jonnalagadda">Siddhu Jonnalagadda</a></li>
                        <li><a href="/wiki/Raashii_Khanna" title="Raashii Khanna">Raashii Khanna</a></li>
                        <li><a href="/wiki/Srinidhi_Shetty" title="Srinidhi Shetty">Srinidhi Shetty</a></li>
                    </ul>
                </div>
            </td>
            <td>People Media Factory</td>
            <td style="text-align: center;">
                <sup id="cite_ref-183" class="reference">
                    <a href="#cite_note-183"><span class="cite-bracket">[</span>181<span class="cite-bracket">]</span></a>
                </sup>
            </td>
        </tr>
        <tr>
            <td rowspan="2" style="text-align: center; background: #f4ca16; textcolor: #000;">
                <b>
                    D<br />
                    E<br />
                    C
                </b>
            </td>
            <td rowspan="1" style="text-align: center; background: #f8de7e;"><b>6</b></td>
            <td style="text-align: center;">
                <i><a href="/wiki/Pushpa_2:_The_Rule" title="Pushpa 2: The Rule">Pushpa 2: The Rule</a></i>
            </td>
            <td><a href="/wiki/Sukumar" title="Sukumar">Sukumar</a></td>
            <td>
                <link rel="mw-deduplicated-inline-style" href="mw-data:TemplateStyles:r1129693374" />
                <div class="hlist">
                    <ul>
                        <li><a href="/wiki/Allu_Arjun" title="Allu Arjun">Allu Arjun</a></li>
                        <li><a href="/wiki/Fahadh_Faasil" title="Fahadh Faasil">Fahadh Faasil</a></li>
                        <li><a href="/wiki/Rashmika_Mandanna" title="Rashmika Mandanna">Rashmika Mandanna</a></li>
                    </ul>
                </div>
            </td>
            <td><a href="/wiki/Mythri_Movie_Makers" title="Mythri Movie Makers">Mythri Movie Makers</a></td>
            <td style="text-align: center;">
                <sup id="cite_ref-184" class="reference">
                    <a href="#cite_note-184"><span class="cite-bracket">[</span>182<span class="cite-bracket">]</span></a>
                </sup>
            </td>
        </tr>
        <tr>
            <td rowspan="1" style="text-align: center; background: #f8de7e;"><b>20</b></td>
            <td style="text-align: center;"><i>Robinhood</i></td>
            <td><a href="/wiki/Venky_Kudumula" title="Venky Kudumula">Venky Kudumula</a></td>
            <td>
                <link rel="mw-deduplicated-inline-style" href="mw-data:TemplateStyles:r1129693374" />
                <div class="hlist">
                    <ul>
                        <li><a href="/wiki/Nithiin" title="Nithiin">Nithiin</a></li>
                        <li><a href="/wiki/Sreeleela" title="Sreeleela">Sreeleela</a></li>
                    </ul>
                </div>
            </td>
            <td><a href="/wiki/Mythri_Movie_Makers" title="Mythri Movie Makers">Mythri Movie Makers</a></td>
            <td style="text-align: center;">
                <sup id="cite_ref-185" class="reference">
                    <a href="#cite_note-185"><span class="cite-bracket">[</span>183<span class="cite-bracket">]</span></a>
                </sup>
            </td>
        </tr>
    </tbody>
    <tfoot></tfoot>
</table>

这是我写的Python脚本:

soup = BeautifulSoup(html_page, "html.parser")

tables = soup.find_all("table",{"class":"wikitable sortable"})
headers = ['month','day','movie','director','cast','producer','reference']
movie_tables = []
total_movies = 0
for table in tables:
    caption = table.find("caption")
    if not caption or not caption.get_text().strip():
        movie_tables.append(table)

#captions = soup.find_all("caption")

max_columns = len(headers)

# List to store dictionaries
data_dict_list = []

movies= []
for movie_table in movie_tables:
    table_rows = movie_table.find("tbody").find_all("tr")[1:]
    for table_row in table_rows:
        total_movies += 1
        columns = table_row.find_all('td')
        row_data = [col.get_text(strip=True) for col in columns]
        # If the row has fewer columns than the max, pad it with None
        if len(row_data) == 6:
            row_data.insert(0, None)
        elif len(row_data) == 5:
            row_data.insert(0, None)
            row_data.insert(1, None)
        for col in columns:
            li_tags = col.find_all('li')
            if li_tags:
                cast=""
                for li in li_tags:
                    li_values = li.get_text(strip=True)
                    cast = ', '.join(li_values)

                row_data.append(cast)
            else:
                row_data.append(col.get_text())
         # Create a dictionary mapping headers to row data
        row_dict = dict(zip(headers, row_data))
        
        # Append the dictionary to the list
        data_dict_list.append(row_dict)

# Print the list of dictionaries
for row_dict in data_dict_list:
    print(row_dict)

这是我得到的输出(这里仅显示一些项目):

{'month': 'OCT', 'day': '11', 'movie': 'Viswam', 'director': 'Sreenu Vaitla', 'cast': 'GopichandKavya ThaparVennela KishoreSunilNaresh', 'producer': 'Chitralayam StudiosPeople Media Factory', 'reference': '[178]'}

{'month': None, 'day': '31', 'movie': 'Lucky Baskhar', 'director': 'Venky Atluri', 'cast': 'Dulquer SalmaanMeenakshi Chaudhary', 'producer': 'Sithara Entertainments', 'reference': '[179]'}

{'month': None, 'day': None, 'movie': 'Mechanic Rocky', 'director': 'Ravi Teja Mullapudi', 'cast': 'Vishwak SenMeenakshi Chaudhary', 'producer': 'SRT Entertainments', 'reference': '[180]'}

{'month': 'NOV', 'day': '9', 'movie': 'Telusu Kada', 'director': 'Neeraja Kona', 'cast': 'Siddhu JonnalagaddaRaashii KhannaSrinidhi Shetty', 'producer': 'People Media Factory', 'reference': '[181]'}

{'month': 'DEC', 'day': '6', 'movie': 'Pushpa 2: The Rule', 'director': 'Sukumar', 'cast': 'Allu ArjunFahadh FaasilRashmika Mandanna', 'producer': 'Mythri Movie Makers', 'reference': '[182]'}

{'month': None, 'day': '20', 'movie': 'Robinhood', 'director': 'Venky Kudumula', 'cast': 'NithiinSreeleela', 'producer': 'Mythri Movie Makers', 'reference': '[183]'}

这就是我想要得到的(只是在这里显示最后一项):

{'month': 'DEC', 'day': '20', 'movie': 'Robinhood', 'director': 'Venky Kudumula', 'cast': 'Nithiin|Sreeleela', 'producer': 'Mythri Movie Makers', 'reference': '[183]'}

我在最后一天左右一直在尝试调试这个,但我不知道哪里出了问题。

我期待:

  1. 当这些列跨越多行并且未在所有行中表示时,月份、日期将填充在所有项目中。
  2. 接下来,我想在不同的演员之间有一个分隔符,以便我以后可以轻松创建图表。
  3. 另外,在执行所有这些操作时,如何提取超链接并将其存储在字典中的单独键中?
python web-scraping beautifulsoup
1个回答
0
投票

如果您希望显示所有月份,则必须填写未显示的月份。空虚的月份都有一个共同点,那就是和前一个月一样。您可以简单地创建一个名为

lastMonth
的变量,将其分配给第一个月,然后将其与下一个月进行比较。如果下个月是空的,则更换它。如果不为空且与
lastMonth
不同,则将
lastMonth
变量的值替换为当前月份。对所有词典重复此操作,您将获得所有月份。

© www.soinside.com 2019 - 2024. All rights reserved.