Web抓取数据的Lemmatisation

Question

我们假设我有一个文本文档，如下所示：

document = '<p> I am a sentence. I am another sentence <p> I am a third sentence.'

（或更复杂的文本示例：

document = '<p>Forde Education are looking to recruit a Teacher of Geography for an immediate start in a Doncaster Secondary school.</p> <p>The school has a thriving and welcoming environment with very high expectations of students both in progress and behaviour.&nbsp; This position will be working&nbsp;until Easter with a&nbsp;<em><strong>likely extension until July 2011.</strong></em></p> <p>The successful candidates will need to demonstrate good practical subject knowledge  but also possess the knowledge and experience to teach to GCSE level with the possibility of teaching to A’Level to smaller groups of students.</p> <p>All our candidate will be required to hold a relevant teaching qualifications with QTS  successful applicants will be required to provide recent relevant references and undergo a Enhanced CRB check.</p> <p>To apply for this post or to gain information regarding similar roles please either submit your CV in application or Call Debbie Slater for more information.&nbsp;</p>'

)

我正在应用一系列预处理NLP技术来获得本文档的“更清晰”版本，同时也为每个单词取词干。

我使用以下代码：

stemmer_1 = PorterStemmer()
stemmer_2 = LancasterStemmer()
stemmer_3 = SnowballStemmer(language='english')

# Remove all the special characters
document = re.sub(r'\W', ' ', document)

# remove all single characters
document = re.sub(r'\b[a-zA-Z]\b', ' ', document)

# Substituting multiple spaces with single space
document = re.sub(r' +', ' ', document, flags=re.I)

# Converting to lowercase
document = document.lower()

# Tokenisation
document = document.split()

# Stemming
document = [stemmer_3.stem(word) for word in document]

# Join the words back to a single document
document = ' '.join(document)

这为上面的文本文档提供了以下输出：

'am sent am anoth sent am third sent'

（以及更复杂示例的输出：

'ford educ are look to recruit teacher of geographi for an immedi start in doncast secondari school the school has thrive and welcom environ with veri high expect of student both in progress and behaviour nbsp this posit will be work nbsp until easter with nbsp em strong like extens until juli 2011 strong em the success candid will need to demonstr good practic subject knowledg but also possess the knowledg and experi to teach to gcse level with the possibl of teach to level to smaller group of student all our candid will be requir to hold relev teach qualif with qts success applic will be requir to provid recent relev refer and undergo enhanc crb check to appli for this post or to gain inform regard similar role pleas either submit your cv in applic or call debbi slater for more inform nbsp'

)

我现在要做的是获得一个类似于上面的输出，但在我应用了lemmatisation而不是词干后。

但是，除非我遗漏了某些内容，否则这需要将原始文档拆分为（合理的）句子，应用POS标记然后实现lemmatisation。

但这里的事情有点复杂，因为文本数据来自网页抓取，因此你会遇到很多HTML标签，如 ，等。

我的想法是，每当一系列单词以一些常见的标点符号（fullstop，感叹号等）或使用 ，等HTML标签结束时，这应该被视为一个单独的句子。

例如，上面的原始文件：

document = '<p> I am a sentence. I am another sentence <p> I am a third sentence.'

应该分成这样的东西：

['I am a sentence', 'I am another sentence', 'I am a third sentence']

然后我想我们将在每个句子上应用POS标记，将每个句子分成单词，将词形结构和.join()应用于单个文档，因为我正在使用上面的代码。

我怎样才能做到这一点？

Answer 1

删除HTML标记是文本优化的常见部分。您可以使用自己的编写规则，如text.replace('', '.')，但有更好的解决方案：html2text。这个库可以为你做所有脏HTML精炼工作，例如：

>>> print h.handle("<p>Hello, <a href='http://earth.google.com/'>world</a>!")
Hello, world!

您可以在Python代码中导入此库，也可以将其用作独立程序。

编辑：以下是将文本拆分为句子的小链示例：

>>> document = '<p> I am a sentence. I am another sentence <p> I am a third sentence.'
>>> text_without_html = html2text.html2text(document)
>>> refined_text = re.sub(r'\n+', '. ', text_without_html)
>>> sentences = nltk.sent_tokenize(refined_text)
>>> sentences

['I am a sentence.', 'I am another sentence.', 'I am a third sentence..']

Web抓取数据的Lemmatisation

问题描述投票：3回答：1

1个回答

最新问题

Web抓取数据的Lemmatisation

问题描述 投票：3回答：1

1个回答

最新问题

问题描述投票：3回答：1