从科学文献中提取摘要 /全文，doi或title

Question

有很多工具可以从PDF文件中提取文本[1-4]。但是，大多数科学论文的问题是很难直接访问PDF，这主要是由于需要为其付费。除了Bibtex信息之外，有一些工具可轻松访问论文的信息，例如元数据或Bibtex [5-6]。我想要的就是向前迈出一步，超越了Bibtex/Metadata：

确保无法直接访问出版物的PDF文件，考虑到该论文的DOI或标题，是否有任何方法至少可以获取科学论文的摘要？通过我的搜索，我发现出于某些相似目的有一些尝试[7]。有人知道可以帮助我获取/提取科学论文的抽象或全文的网站/工具吗？如果没有这样的工具，您能否给我一些建议解决此问题后应该如何解决的建议？

谢谢你

[1] http://stackoverflow.com/questions/1813427/extracting-information-from-pdfs-of-research-papers
[2] https://stackoverflow.com/questions/6731735/extracting-the-actual-in-text-title-from-a-pdf
[3] http://stackoverflow.com/questions/6731735/extracting-the-actual-in-text-title-from-a-pdf?lq=1
[4] http://stackoverflow.com/questions/14291856/extracting-article-contents-from-pdf-magazines?rq=1
[5] https://stackoverflow.com/questions/10507049/get-metadata-from-doi
[6] https://github.com/venthur/gscholar
[7] https://stackoverflow.com/questions/15768499/extract-text-from-google-scholar

Answer 1

您可以查看CrossRef文本和Datamining（TDM）服务（

Http：//tdmsupport.crossref.org/）。该组织免费提供一个宁静的API。有4000多个发布者为此TDM服务做出贡献。您可以从下面的链接中找到一些示例：

https：//github.com/crossref/rest-api-doc/blob/master/rest_api_tour.md

但给出一个非常简单的例子：

如果您转到链接

Http：//api.crossref.org/works/10.1080/10260220290013453 您会看到，除了一些基本的元数据外，还有另外两个元数据，即许可证和链接，前者在提供该出版物的何种许可下给出的链接，而后者则提供了全文的URL。在我们的示例中，您将在许可元数据上看到许可证是CreativeCommons（CC），这意味着它可以免费用于TDM。通过在Crossref中搜索具有CC许可证的出版物，您可以访问具有全文的数十万出版物。从我的最新研究中，我可以说印第安人出版物是最友好的出版商。即使他们提供了超过100K的出版物Witt CC许可证。最后一件事是，可以以XML或PDF格式提供全文。对于那些XML格式高度结构化，因此易于提取数据。

总结一下，您可以通过使用其API并简单地编写GET请求来自动通过CrossRef TDM服务访问许多全文。如果您有其他问题，请随时提出。 Cheers.

crossref可能值得检查。它们允许会员在元数据中包括摘要，但这是可选的，因此并非全面覆盖。根据我要求的帮助台，他们截至2016年6月，他们的摘要可供45万名DOI注册。

如果摘要存在于其元数据中，则可以使用其UNIXML格式获得它。这是一个具体示例：

Answer 2

如果文章在PubMed上（包含约2500万个文档），则可以使用Python软件包

Entrez

检索摘要。

Answer 3

使用卷曲（在我的Linux中工作）：

curl http://api.crossref.org/works/10.1080/10260220290013453 2>&1 | # doi after works grep -o -P '(?<=abstract":").*?(?=","DOI)' | # get text between abstract":" and ","DOI sed -E 's/<jats:p>|<\\\/jats:p>/\n/g' | # substitute paragraph tags sed 's/<[^>]*>/ /g' # remove other tags # add "echo" to show unicode characters echo -e $(curl http://api.crossref.org/works/10.1155/2016/3845247 2>&1 | # doi after works grep -o -P '(?<=abstract":").*?(?=","DOI)' | # get text between abstract":" and ","DOI sed -E 's/<jats:p>|<\\\/jats:p>/\n/g' | # substitute paragraph tags sed 's/<[^>]*>/ /g') # remove other tags

用R：

Answer 4

library(rcrossref)
cr_abstract(doi = '10.1109/TASC.2010.2088091')

我制作了一个适用于大多数情况的Python代码。有时会有一个连接错误，函数

get_abstract_from_doi

应通过尝试以外的方法运行。

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
import requests
from urllib.parse import urlparse

def main():
    URL = "https://doi.org/10.1016/j.compstruc.2012.09.003" # Specify the DOI here
    for _ in range(7):
        try:
            print(get_abstract_from_doi(URL))
        except:
            pass

def get_abstract_from_doi(doi):
    r = requests.get(doi,allow_redirects=True) # Redirects help follow to the actual domain
    
    # Setup Selenium WebDriver
    driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
    
    driver.get(r.url)
    
    # Find all elements that have "abstract" in their text, regardless of nesting
    elements = driver.find_elements(By.XPATH, '//*[contains(text(), "bstract:")]')
    if len(elements)==0:
        elements = driver.find_elements(By.XPATH, '//*[contains(text(), "bstract")]')
    
    # Extract text content while ensuring minimum length
    elements = [
        elem for elem in elements if len(elem.text.strip()) >= 1
    ]
    element=min(elements, key=lambda x:len(x.text.strip()))
    characters=len(element.text.strip())
    while len(element.text.strip())<2*characters:
        element=element.find_element(By.XPATH,"./..")
    abstract_text=element.text.strip()

    driver.quit()
    
    return abstract_text
if __name__ == "__main__":
    main()

从科学文献中提取摘要 /全文，doi或title

问题描述投票：0回答：5

5个回答

最新问题

从科学文献中提取摘要 /全文，doi或title

问题描述 投票：0回答：5

5个回答

最新问题

问题描述投票：0回答：5