chartink.com 上的网页抓取

Question

请帮我抓取这个链接。链接 - https://chartink.com/screener/time-pass-48 我正在尝试网络抓取，但它没有显示我想要的表格。请同样帮助我。

我已经尝试过这段代码，但它没有给我想要的结果。

import requests
from bs4 import BeautifulSoup

URL = 'https://chartink.com/screener/time-pass-48'
page = requests.get(URL)
print(page)

soup = BeautifulSoup(page.content, 'html.parser')
print(soup)

Answer 1

数据确实来自 POST 请求。您不需要允许 JavaScript 运行。您只需要选取一个 cookie（

ci_session

- 可以使用 Session 对象来保存来自初始登陆页面请求的 cookie 并通过后续 POST 传递）和一个令牌（

X-CSRF-TOKEN

- 可以从初始请求响应中的

meta

标签）：

import requests
from bs4 import BeautifulSoup as bs
import pandas as pd

data = {
  'scan_clause': '( {cash} ( monthly rsi( 14 ) > 60 and weekly rsi( 14 ) > 60 and latest rsi( 14 ) > 60 and 1 day ago  rsi( 14 ) <= 60 and latest volume > 100000 ) ) '
}

with requests.Session() as s:
    r = s.get('https://chartink.com/screener/time-pass-48')
    soup = bs(r.content, 'lxml')
    s.headers['X-CSRF-TOKEN'] = soup.select_one('[name=csrf-token]')['content']
    r = s.post('https://chartink.com/screener/process', data=data).json()
    #print(r.json())
    df = pd.DataFrame(r['data'])
    print(df)

Answer 2

您可以通过发出

post

请求来访问表数据。您可以查看 Chrome 开发工具网络选项卡，看看哪些元素是从其他地方加载的。

表中的数据是从

https://chartink.com/screener/process

post 请求加载的（查看网络选项卡中的“进程”名称）。您可以按照 QHarr

 建议使用

post 库发出

requests 请求。

或者，您可以通过使用

requests-html

库来实现这一点，而不会使事情变得复杂，尽管直接从源获取数据会快得多，例如提出

post

请求。

from requests_html import HTMLSession

session = HTMLSession()
response = session.get('https://chartink.com/screener/time-pass-48')
# renders javascript
response.html.render()

for result in response.html.xpath('//*[@id="DataTables_Table_0"]/tbody/tr'):
    print(f'{result.text}\n')

# part of the output:
'''
1
Kothari Products Limited
KOTHARIPRO
P&F | F.A
19.96%
106.7
262,997
'''

从那里开始，所有需要做的就是

split()

元素并获取所需的元素（

index

），例如：

for result in response.html.xpath('//*[@id="DataTables_Table_0"]/tbody/tr'):
    # getting text data, splitting by a new line and grabbing first index [1]
    # the process is the same for every other column
    stock_name = result.text.split('\n')[1]
    print(stock_name)

# part of the output:
'''
Kothari Products Limited
STEELXIND
Oswal Chemicals & Fertilizers Limited
Hbl Power Systems Limited
'''

Answer 3

import requests
import bs4
page = requests.get("https://chartink.com/screener/time-pass-48")
bs4.BeautifulSoup(page.text,'lxml')

我认为这应该可以做到。

Answer 4

我更改了数据扫描子句，如下所示。但是，我得到一个空数据框！应该更改哪部分代码？

data = {
  'scan_clause': '( {cash} ( latest close > 10 and latest tema(latest close,10) > latest tema( latest close,20) and latest volume > 50000 and market cap > 500) ) '
}

with requests.Session() as s:
    r = s.get('https://chartink.com/screener/tema-swing-buy')
    soup = bs(r.content, 'lxml')
    s.headers['X-CSRF-TOKEN'] = soup.select_one('[name=csrf-token]')['content']
    r = s.post('https://chartink.com/screener/process', data=data).json()
    #print(r.json())
    df = pd.DataFrame(r['data'])
    print(df)

Answer 5

我在这里意识到了一些事情，当我尝试复制该值时，它就像下面带有加号（+）一样。在将其添加到扫描子句之前。我将 + 替换为 ' ' 并且它有效。

scan_clause = '(+{cash}+(+latest+max(+5+,+latest+close+)+>+6+days+ago+max(+150+,+latest+close+)+*+1.05+and+latest+volume+>+latest+sma(+volume,5+)+and+latest+close+>+1+day+ago+close+)+)+'enter code here
scan_clause = scan_clause.replace('+',' ')

希望对大家有用。

chartink.com 上的网页抓取

问题描述投票：0回答：5

5个回答

最新问题

chartink.com 上的网页抓取

问题描述 投票：0回答：5

5个回答

最新问题

问题描述投票：0回答：5