将列表分成大约5K个字符

问题描述 投票:-1回答:3

我有一个超过5000个字符的文本。我想将文本切成多个文本,以使其少于或等于5K个字符。

data2 = 'Wiki Loves Monuments (WLM) is an annual international photographic competition held during the month of September, organised worldwide by Wikipedia community members with the help of local Wikimedia affiliates across the globe. Participants take pictures of local historical monuments and heritage sites in their region, and upload them to Wikimedia Commons. The aim of event is to highlight the heritage sites of the participating countries with the goal to encourage people to capture pictures of these monuments, and to put them under a free licence which can then be re-used not only in Wikipedia but everywhere by everyone.\n\nThe first Wiki Loves Monuments competition was held in 2010 in the Netherlands as pilot project. The next year it spread to other countries in Europe and according to the Guinness Book of Records, the 2011 edition of the Wiki Loves Monuments broke the world record for the largest photography competition.[3] In 2012, the competition was extended beyond Europe, with a total of 35 participating countries.[4] During Wiki Loves Monuments 2012, more than 350,000 photographs of historic monuments were uploaded by more than 15,000 participants. In 2013, the Wiki Loves Monuments competition was held across six continents including Antarctica and had official participation from more than fifty countries around the world. The 2016 edition of WLM was supported by UNESCO and saw 10,700 contestants from 43 countries who submitted 277,000 photos.[5][6]WLM is the successor to Wiki Loves Art, which was held in the Netherlands in 2009. The original WLM contest for "Rijksmonuments" (Dutch for "national monuments") encouraged photographers to seek out Dutch National Heritage Sites. The Rijkmonuments include architecture and objects of general interest recognized for their beauty, scientific, and/or cultural importance. Such locations as the Drenthe archeological sites, the Noordeinde Royal Palace in The Hague, and the houses along the canals of Amsterdam were part of the more than 12,500 photographs submitted during the first event.[7]\n\nThis success generated interest in other European countries, and through a collaboration with the European Heritage Days, 18 states with the help of local Wikimedia chapters participated in the 2011 competition,[8][9] uploading nearly 170,000 images by its conclusion. The Guinness Book of Records recognizes the 2011 edition of Wiki Loves Monuments as the largest photography competition in the world with 168,208 pictures uploaded to Wikimedia Commons by more than 5,000 participants.[3] In total, some 171,000 photographs were contributed from 18 participants countries of Europe. Germany, France and Spain contributed highest number of photographs. Photo from Romania won the first international prize, whereas Estonia secured second and Germany third position in WLM 2011.\n\nIn 2012, the Wiki Loves Monuments competition had official participation of more than thirty countries and regions around the world: Andorra, Argentina, Austria, Belarus, Belgium, Canada, Catalonia, Chile, Colombia, the Czech Republic, Denmark, Estonia, France, Germany, Ghana, India, Israel, Italy, Kenya, Luxembourg, Mexico, the Netherlands, Norway, Panama, the Philippines, Poland, Romania, Russia, Serbia, Slovakia, Spain, South Africa, Sweden, Switzerland, Ukraine, and the United States. In total 363,000 photos were contributed from 35 participants countries. Germany, Spain and Poland contributed highest number of photographs.[4] A picture of Tomb of Safdarjung from Delhi, India, won the contest which saw more than 350,000 contributions.[10][11] Spain secured second and Philippines third position in 2012 edition of annual WLM photo contest.\n\nIn 2013, the Wiki Loves Monuments competition had official participation of more than fifty countries from all six continents including Antarctica. Among the new participant nations were Algeria, Chine, Azerbaijan, Hong Kong, Jordan, Venezuela, Thailand, Taiwan, Nepal, Tunisia, Egypt, the United Kingdom, war-torned Syria and many others. In total, some 370,000 photos were contributed from more than 52 participating countries. Germany, Ukraine and Poland contributed highest number of photographs. Switzerland won the first international prize, whereas Taiwan secured second and Hungary third position in the 2013 edition of WLM.\n\nThe 2014 version of the contests saw more than 8,750 contestants in 41 countries across the globe, who submitted more than 308,000 photographs. Pakistan, Macedonia, Ireland, Republic of Kosovo, Albania, Palestine, Lebanon, and Iraq made their debut in 2014. From Pakistan, more than 700 contestants from across the country submitted over 12,000 photographs.[12]\n\nThe 2015 edition saw more than 6,200 contestants participating from 33 countries, with over 220,000 photo submissions throughout the month of September.[13]\n\nThe 2016 edition of WLM was supported by UNESCO and saw 10,700 contestants from 43 countries who submitted 277,000 photos.[14]Overview\nWiki', 'Loves Monuments 2010 was a photo competition / scavenger hunt that took place throughout the Netherlands between September 1 and September 30, 2010. It\'s the successor to Wiki Loves Art in the Netherlands (in 2009).\n\nParticipants were encouraged to photograph some of the more than 50,000 national monuments (Rijksmonumenten) throughout the Netherlands. These are buildings or objects of general importance because of their beauty, importance to science, or cultural history – like archeological sites in Drenthe, the canal houses in Amsterdam, and the Royal Palace in The Hague.\n\nIn total 12,501 pictures were taken of monuments.\n\nRules\nThe only official rules description is to be found here\nPhotos are only admitted to the contest if:\n\nThe photo is taken by the uploader him/herself\nThe photo is freely licensed (default: cc-by-sa 3.0 nl)\nThe photo is uploaded between September 1 and September 30\nThe description includes the "rijksmonumentnummer" (a unique monument identifier) A description in Dutch how to find this identifier\nIf uploaded through Wikimedia Commons: the email address of the uploader is activated\nA jury will award the best photos with a prize. A separate prize will be awarded to the persons uploading photos of the most objects.\n\nUpload\nYou can upload files using the special Wiki Loves Monuments upload form\n\nAwards\nThere were three categories in which awards have been presented: a category for photos from Vlissingen (Flushing); a category for the contestants with the highest number of objects photographed and a category for the best photo over all, awarded by a jury. All prizes were announced at a prize ceremony in Utrecht on November 20, 2010. A motivation from the jury (in Dutch) is available here.\n\nJury prize\nThe top-10 of the jury selection can be found (in order 1-10) below:'

n = math.ceil(len(data2)/5000)
n = 5000

list2 = [data2[i:i+n] for i in range(0, len(data2), n)]

print(list2)

现在我需要在一个列表中包含大约5K字符,然后2列出另外5K字符,依此类推,最后一个列表包含所有剩余的字符。

输出:

[[[5K或更少字符],[5K或更少字符],... [剩余字符]]

从上面,我得到一个清单。另外,我正在寻找一种不会将句子分成两半而是整个句子的解决方案。

任何帮助表示赞赏。

谢谢

python-3.x list split list-comprehension slice
3个回答
0
投票

假设每个句子对应一行,可以执行以下操作:

text = ...

def special_split(text, limit=500):
    chunks = []
    chunk = ''
    for line in text.splitlines():
        if line:  # gets rid of empty lines
            if len(line) + len(chunk) < limit:
                chunk += line + '\n'
            else:
                if chunk:
                    chunks.append(chunk)
                chunk = line
    if chunk:
        chunks.append(chunk)
    return chunks


chunks = special_split(text, 5000)
print(all(chunk < 5000 for chunk in chunks))
# True

请注意,如果单行大于指定的limit,则以上代码将失败。


编辑

要获得每个chunk作为list-len字符串(也称为字符)的1,您只需将其转换为list即可:

chunks_list = [list(chunk) for chunk in chunks]

0
投票
str1 = """Wiki Loves Monuments (WLM) is an annual international photographic competition held during the month of September, organised worldwide by Wikipedia community members with the help of local Wikimedia affiliates across the globe. Participants take pictures of local historical monuments and heritage sites in their region, and upload them to Wikimedia Commons. The aim of event is to highlight the heritage sites of the participating countries with the goal to encourage people to capture pictures of these monuments, and to put them under a free licence which can then be re-used not only in Wikipedia but everywhere by everyone.\n\nThe first Wiki Loves Monuments competition was held in 2010 in the Netherlands as pilot project. The next year it spread to other countries in Europe and according to the Guinness Book of Records, the 2011 edition of the Wiki Loves Monuments broke the world record for the largest photography competition.[3] In 2012, the competition was extended beyond Europe, with a total of 35 participating countries.[4] During Wiki Loves Monuments 2012, more than 350,000 photographs of historic monuments were uploaded by more than 15,000 participants. In 2013, the Wiki Loves Monuments competition was held across six continents including Antarctica and had official participation from more than fifty countries around the world. The 2016 edition of WLM was supported by UNESCO and saw 10,700 contestants from 43 countries who submitted 277,000 photos.[5][6]WLM is the successor to Wiki Loves Art, which was held in the Netherlands in 2009. The original WLM contest for "Rijksmonuments" (Dutch for "national monuments") encouraged photographers to seek out Dutch National Heritage Sites. The Rijkmonuments include architecture and objects of general interest recognized for their beauty, scientific, and/or cultural importance. Such locations as the Drenthe archeological sites, the Noordeinde Royal Palace in The Hague, and the houses along the canals of Amsterdam were part of the more than 12,500 photographs submitted during the first event.[7]\n\nThis success generated interest in other European countries, and through a collaboration with the European Heritage Days, 18 states with the help of local Wikimedia chapters participated in the 2011 competition,[8][9] uploading nearly 170,000 images by its conclusion. The Guinness Book of Records recognizes the 2011 edition of Wiki Loves Monuments as the largest photography competition in the world with 168,208 pictures uploaded to Wikimedia Commons by more than 5,000 participants.[3] In total, some 171,000 photographs were contributed from 18 participants countries of Europe. Germany, France and Spain contributed highest number of photographs. Photo from Romania won the first international prize, whereas Estonia secured second and Germany third position in WLM 2011.\n\nIn 2012, the Wiki Loves Monuments competition had official participation of more than thirty countries and regions around the world: Andorra, Argentina, Austria, Belarus, Belgium, Canada, Catalonia, Chile, Colombia, the Czech Republic, Denmark, Estonia, France, Germany, Ghana, India, Israel, Italy, Kenya, Luxembourg, Mexico, the Netherlands, Norway, Panama, the Philippines, Poland, Romania, Russia, Serbia, Slovakia, Spain, South Africa, Sweden, Switzerland, Ukraine, and the United States. In total 363,000 photos were contributed from 35 participants countries. Germany, Spain and Poland contributed highest number of photographs.[4] A picture of Tomb of Safdarjung from Delhi, India, won the contest which saw more than 350,000 contributions.[10][11] Spain secured second and Philippines third position in 2012 edition of annual WLM photo contest.\n\nIn 2013, the Wiki Loves Monuments competition had official participation of more than fifty countries from all six continents including Antarctica. Among the new participant nations were Algeria, Chine, Azerbaijan, Hong Kong, Jordan, Venezuela, Thailand, Taiwan, Nepal, Tunisia, Egypt, the United Kingdom, war-torned Syria and many others. In total, some 370,000 photos were contributed from more than 52 participating countries. Germany, Ukraine and Poland contributed highest number of photographs. Switzerland won the first international prize, whereas Taiwan secured second and Hungary third position in the 2013 edition of WLM.\n\nThe 2014 version of the contests saw more than 8,750 contestants in 41 countries across the globe, who submitted more than 308,000 photographs. Pakistan, Macedonia, Ireland, Republic of Kosovo, Albania, Palestine, Lebanon, and Iraq made their debut in 2014. From Pakistan, more than 700 contestants from across the country submitted over 12,000 photographs.[12]\n\nThe 2015 edition saw more than 6,200 contestants participating from 33 countries, with over 220,000 photo submissions throughout the month of September.[13]\n\nThe 2016 edition of WLM was supported by UNESCO and saw 10,700 contestants from 43 countries who submitted 277,000 photos.[14]Overview\nWiki', 'Loves Monuments 2010 was a photo competition / scavenger hunt that took place throughout the Netherlands between September 1 and September 30, 2010. It\'s the successor to Wiki Loves Art in the Netherlands (in 2009).\n\nParticipants were encouraged to photograph some of the more than 50,000 national monuments (Rijksmonumenten) throughout the Netherlands. These are buildings or objects of general importance because of their beauty, importance to science, or cultural history – like archeological sites in Drenthe, the canal houses in Amsterdam, and the Royal Palace in The Hague.\n\nIn total 12,501 pictures were taken of monuments.\n\nRules\nThe only official rules description is to be found here\nPhotos are only admitted to the contest if:\n\nThe photo is taken by the uploader him/herself\nThe photo is freely licensed (default: cc-by-sa 3.0 nl)\nThe photo is uploaded between September 1 and September 30\nThe description includes the "rijksmonumentnummer" (a unique monument identifier) A description in Dutch how to find this identifier\nIf uploaded through Wikimedia Commons: the email address of the uploader is activated\nA jury will award the best photos with a prize. A separate prize will be awarded to the persons uploading photos of the most objects.\n\nUpload\nYou can upload files using the special Wiki Loves Monuments upload form\n\nAwards\nThere were three categories in which awards have been presented: a category for photos from Vlissingen (Flushing); a category for the contestants with the highest number of objects photographed and a category for the best photo over all, awarded by a jury. All prizes were announced at a prize ceremony in Utrecht on November 20, 2010. A motivation from the jury (in Dutch) is available here.\n\nJury prize\nThe top-10 of the jury selection can be found (in order 1-10) below:"""

g_index = 0
c=0
while g_index<=len(str1):
    tmp = [c for c in str1[g_index:g_index+5000]]

    if '.' in tmp:
        i = tmp.index('.')+1
        c=0
        while i<len(tmp):
            tmp.pop(i)
            i=i+1
            c=c+1

    list1.append(tmp)   
    g_index = g_index+(5000-c)

0
投票

如果我对您的理解正确,这听起来像textwrap.wrap的完美工作,它是标准python库的一部分。

from textwrap import wrap

long_string = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Praesent tristique et urna sed pretium. In hac habitasse platea dictumst. Donec nulla mi, dapibus ac massa vel, elementum fermentum libero. Sed venenatis faucibus ex ut rutrum. In rutrum justo id nunc suscipit, eu ullamcorper nibh tincidunt. Vivamus posuere, ipsum eget dapibus rutrum, sapien est hendrerit elit, in commodo neque ex in sem. Mauris ac libero at velit vehicula egestas et nec metus. Phasellus bibendum pretium libero, id feugiat est ultrices eu. Maecenas quis nulla eu mi lacinia pellentesque."

segments = wrap(long_string, width=50)

for index, segment in enumerate(segments):
    print(f"segment[{index}] (len == {len(segment)}): \"{segment}\"")

输出:

segment[0] (len == 50): "Lorem ipsum dolor sit amet, consectetur adipiscing"
segment[1] (len == 48): "elit. Praesent tristique et urna sed pretium. In"
segment[2] (len == 46): "hac habitasse platea dictumst. Donec nulla mi,"
segment[3] (len == 49): "dapibus ac massa vel, elementum fermentum libero."
segment[4] (len == 46): "Sed venenatis faucibus ex ut rutrum. In rutrum"
segment[5] (len == 43): "justo id nunc suscipit, eu ullamcorper nibh"
segment[6] (len == 46): "tincidunt. Vivamus posuere, ipsum eget dapibus"
segment[7] (len == 45): "rutrum, sapien est hendrerit elit, in commodo"
segment[8] (len == 42): "neque ex in sem. Mauris ac libero at velit"
segment[9] (len == 49): "vehicula egestas et nec metus. Phasellus bibendum"
segment[10] (len == 43): "pretium libero, id feugiat est ultrices eu."
segment[11] (len == 47): "Maecenas quis nulla eu mi lacinia pellentesque."

正如您从输出中看到的,段的长度不是完全固定的。这是因为textwrap.wrap默认情况下会考虑空格,并且不会创建会分割单词的新句段。但是,可以保证没有任何段会超出您指定的宽度。

您可能需要探索textwrap.TextWrapper类以适合您的需求。有关可选参数的完整列表,请查看documentation

© www.soinside.com 2019 - 2024. All rights reserved.