网页抓取的困难

Question

我刚刚看到一篇名为“史上最伟大的 500 首歌曲”的文章，心想“哦，这太酷了，我打赌他们还制作了一个我可以关注的 Spotify/Apple 音乐列表”。嗯...他们没有。简而言之，我想知道是否可以 1) 废弃网站来提取歌曲，2) 然后进行某种批量上传到 Spotify 来创建列表。

网站中歌曲的标题和作者的结构如下：

网站截图

。我已经尝试使用谷歌表格中的 importxml() 公式来废弃网络，但没有成功。我知道报废部分比其他部分更容易，并且由于我是编程新手，我很乐意设法部分实现这一目标。我确信这个任务可以在 python 上轻松完成。

Answer 1

1。抓取歌曲

我使用了 python3 和 selenium，他们的网站并没有阻止这一点。如有必要，请务必调整您的

chromedriver 路径

，以及底部的 .txt 文件的输出路径。一旦完成并且您拥有 .txt 文件，您可以将其关闭。 import time from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.chrome.service import Service s = Service(r'/Users/main/Desktop/chromedriver') driver = webdriver.Chrome(service=s) # just setting some vars, I used Xpath because I know that top_500 = 'https://www.rollingstone.com/music/music-lists/best-songs-of-all-time-1224767/' cookie_button_xpath = "// button [@id = 'onetrust-accept-btn-handler']" div_containing_links_xpath = "// div [@id = 'pmc-gallery-list-nav-bar-render'] // child :: a" song_names_xpath = "// article [@class = 'c-gallery-vertical-album'] / child :: h2" links = [] songs = [] driver.get(top_500) # accept cookies, give time to load time.sleep(3) cookie_btn = driver.find_element(By.XPATH, cookie_button_xpath) cookie_btn.click() time.sleep(1) # extracting all the links since there are only 50 songs per page links_to_next_pages = driver.find_elements(By.XPATH, div_containing_links_xpath) for element in links_to_next_pages: l = element.get_attribute('href') links.append(l) # extracting the songs, then going to next page and so on until we hit 500 counter = 1 # were starting with 1 here since links[0] is the current page we are already on while True: list = driver.find_elements(By.XPATH, song_names_xpath) for element in list: s = element.text songs.append(s) if len(songs) == 500: break driver.get(links[counter]) counter += 1 time.sleep(2) # verify that there are no duplicates, if there were, something would be off if len(songs) != len( set(songs) ): print('something went wrong') else: print('seems fine') with open('/Users/main/Desktop/output_songs.txt', 'w') as file: file.writelines(line + '\n' for line in songs)

2。准备 Spotify

转到

并创建一个帐户（使用您的 Spotify 帐户）。然后创建一个应用程序，随意命名。在您的应用程序上单击设置和白名单
在您的应用程序上单击“用户和访问”并添加您的 Spotify 帐户
让选项卡保持打开状态，我们稍后会回来

3.准备您的环境

Node.js
，因此请确保您的计算机上已安装它
下载
this
cd
进入文件夹并运行
```
npm install
```
authorization_code
文件夹并在编辑器中打开 app.js
找到

var scope

并将“playlist-modify-public”附加到字符串中，这样您的应用程序就可以访问您的 Spotify 播放列表，请参阅
此处
现在返回
中的应用程序，我们需要将客户端 ID 和客户端密钥分别复制到
var client_id 和 var client_secret
```
（在 app.js 文件中）。 
```
var redirect_uri
```
将会是
```
http://localhost:8888/callback
```
 - 不要忘记保存您的更改。
```

cd

进入
authorization_code
```
文件夹并使用
```
node app.js运行app.js（这基本上是在您的PC上运行的服务器）
现在，如果有效，请让它运行并转到
，在那里授权您的 Spotify 帐户

复制完整的令牌，包括溢出，使用检查元素来获取它
调整以下python脚本中的

user_id

和
auth
```
变量以及
```
output_songs.txt
```
的路径（at with open）并运行，未找到的歌曲将在最后打印出来，用 Google 搜索一下。它们通常也在 Spotify 上出现，但谷歌似乎有更好的搜索算法（惊讶的皮卡丘脸）。
```
import requests import re import json # this is NOT you display name, it's your user name!! user_id = 'YOUR_USERNAME' # paste your auth token from spotify; it can time out then you have to get a new one, so dont panic if you get a bunch of responses in the 400s after some time auth = {"Authorization": "Bearer YOUR_AUTH_KEY_FROM_LOCALHOST"} playlist = [] err_log = [] base_url = 'https://api.spotify.com/v1' search_method = '/search' with open('/Users/main/Desktop/output_songs.txt', 'r') as file: songs = file.readlines() # this querys spotify does some magic and then appends the tracks spotify uri to an array def query_song_uris(): for n, entry in enumerate(songs): x = re.findall(r"'([^']*)'", entry) title_len = len(entry) - len(x[0]) - 4 title = x[0] artist = entry[:title_len] payload = { 'q': (entry), 'track:': (title), 'artist:': (artist), 'type': 'track', 'limit': 1 } url = base_url + search_method try: r = requests.get(url, params=payload, headers=auth) print('\nquerying spotify; ', r) c = r.content.decode('UTF-8') dic = json.loads(c) track_uri = dic["tracks"]["items"][0]["uri"] playlist.append(track_uri) print(track_uri) except: err = f'\nNr. {(len(songs)-n)}: ' + f'{entry}' err_log.append(err) playlist.reverse() query_song_uris() # creates a playlist and returns playlist id def create_playlist(): payload = { "name": "Rolling Stone: Top 500 (All Time)", "description": "music for old men xD with occasional hip hop appearences. just kidding" } url = base_url + f'/users/{user_id}/playlists' r = requests.post(url, headers=auth, json=payload) c = r.content.decode('UTF-8') dic = json.loads(c) print(f'\n\ncreating playlist @{dic["id"]}; ', r) return dic["id"] def add_to_playlist(): playlist_id = create_playlist() while True: if len(playlist) > 100: p = playlist[:100] else: p = playlist payload = {"uris": (p)} url = base_url + f'/playlists/{playlist_id}/tracks' r = requests.post(url, headers=auth, json=payload) print(f'\nadding {len(p)} songs to playlist; ', r) del playlist[ : len(p) ] if len(playlist) == 0: break add_to_playlist() print('\n\ncheck your spotify :)') print("\n\n\nthese tracks didn't make it, check manually:\n") for line in err_log: print(line) print('\n\n')

完成

如果您不想自己运行代码，播放列表如下： https://open.spotify.com/playlist/5fdLKYNFlA4XSvhEl36KXS

如果您遇到问题，也可以在Web API 快速入门

中或在

Web API 文档中描述从第 2 步开始的所有内容。关于Apple Music

所以苹果看起来非常封闭（惊讶哈哈）。但我发现你可以查询 i-Tunes 商店。给出的回复还包含 Apple Music 上歌曲的直接链接。你也许可以从那里开始。

从 iTunes 搜索 API（Apple 音乐）获取 ISRC 代码

PS：不可否认，正则表达式是巫术，但你们都支持我

网页抓取的困难

问题描述投票：0回答：1

1个回答

我使用了 python3 和 selenium，他们的网站并没有阻止这一点。如有必要，请务必调整您的

如果您不想自己运行代码，播放列表如下： https://open.spotify.com/playlist/5fdLKYNFlA4XSvhEl36KXS

最新问题

网页抓取的困难

问题描述 投票：0回答：1

1个回答

我使用了 python3 和 selenium，他们的网站并没有阻止这一点。 如有必要，请务必调整您的

如果您不想自己运行代码，播放列表如下： https://open.spotify.com/playlist/5fdLKYNFlA4XSvhEl36KXS

最新问题

问题描述投票：0回答：1

我使用了 python3 和 selenium，他们的网站并没有阻止这一点。如有必要，请务必调整您的