用 beautifulsoup 抓取歌词

问题描述 投票:0回答:4

使用 Genius API,我获取了歌词页面的歌曲 url。我现在想使用

beautifulsoup4
进行网络抓取;但是,我遇到了一个错误。这是代码:

from bs4 import BeautifulSoup
import requests

def scrap_song_url(url):
    page = requests.get(url)
    html = BeautifulSoup(page.text, 'html.parser')
    lyrics = html.find('div', class_='lyrics').get_text()

    return lyrics

在这里,我正在查看歌词页面的 html。为了举例,请查看这个特定的网址:

https://genius.com/Acceptance-permanent-lyrics
。通过 html 进行探索,歌词似乎包含在
div
下,类为
'lyrics'
HTML

但是,尝试使用

html.find
找到它会返回
NoneType
对象,因此
.get_text()
会引发错误。我认为这意味着,由于某种原因,没有找到 html 标签(或者无论你怎么称呼它,我真的不知道 html)。如何从给定歌词 url 的 div 类
'lyrics'
获取歌词?

python html beautifulsoup
4个回答
3
投票

有一个受支持且看起来很酷的 Genius API Python 包装器:LyricsGenius。你应该试试。使用 pip 安装很简单:

pip install lyricsgenius

从其文档来看,收集歌词看起来容易得多:

from lyricsgenius import Genius

genius = Genius(token)
genius.search_artist('Andy Shauf')
artist.save_lyrics()

2
投票

呃,我不认为那是歌词的地方。对于那个特定的页面,我做了:

lyrics = html.select("div[class*=Lyrics__Container]")

并获得了歌词(与一堆其他 HTMl 混合在一起)。有很多清洁工作要做。 '*' 使您能够获得以 Lyrics__Container start 的所有类,因为之后有一串数字和字母,我认为它们可能会更改。


0
投票

首先使用 attribute 选择器隔离主歌/副歌部分后,您可以使用 stripped_strings 挑选出单独的行。语法外部有一些列表未嵌套。

import requests
from bs4 import BeautifulSoup as bs
from pprint import pprint

r = requests.get('https://genius.com/Acceptance-permanent-lyrics')
soup = bs(r.content, 'lxml')
pprint([i for j in [[line for line in verse.stripped_strings] for verse in soup.select('[data-scrolltrigger-pin]')] for i in j])

# pprint('\n'.join([i for j in [[line for line in verse.stripped_strings] for verse in soup.select('[data-scrolltrigger-pin]')] for i in j]))

0
投票

这是一个答案没有使用身份验证。首先,安装以下软件包:

pip install requests beautifulsoup4

以下代码使用硬编码的歌词页面:

import requests
from bs4 import BeautifulSoup

# URL of the song lyrics page
url = 'https://genius.com/Drake-gods-plan-lyrics'

response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
lyrics_div = soup.find('div', class_='Lyrics__Container-sc-1ynbvzw-1 kUgSbL')

lyrics = lyrics_div.get_text(strip=True) if lyrics_div else "Lyrics not found"
print(lyrics)

我通过检查 HTML 源代码获得了

Lyrics__Container-sc-1ynbvzw-1 kUgSbL
类。幸运的是,它在 URL 中具有相同的名称。

请注意,

url
的常见形式为
https://genius.com/<artist_name>-<song_name>-lyrics

输出如下:

[Intro]And they wishin' and wishin'And wishin' and wishin', they wishin' on meYeah[Verse 1]I been movin' calm, don't start no trouble with meTryna keep it peaceful is a struggle for meDon't pull up at 6 AM to cuddle with meYou know how I like it when you lovin' on meI don't wanna die for them to miss meYes, I see the things that they wishin' on meHope I got some brothers that outlive meThey gon' tell the story, shit was different with me[Chorus]God's plan, God's planI hold back, sometimes I won't, yeahI feel good, sometimes I don't (Ayy, don't)I finessed down Weston Road (Ayy, 'nessed)Might go down a G-O-D (Yeah, wait)I go hard on Southside G (Yeah, wait)I make sure that north-side eatAnd still[Post-Chorus]Bad thingsIt's a lot of bad things that they wishin' and wishin'And wishin' and wishin', they wishin' on meBad thingsIt's a lot of bad things that they wishin' and wishin'And wishin' and wishin', they wishin' on meYeah, ayy, ayy
© www.soinside.com 2019 - 2024. All rights reserved.