02 May 2013

BeautifulSoup是一个处理HTML、XML的python。它可以用“.”来访问HTML的元素。

下面是一段python 脚本, 用于从TED网站抓取视频字幕:

from urllib2 import urlopen
from BeautifulSoup import BeautifulSoup
import sys


def main():
    if 2 > len(sys.argv):
        print 'python {0:s} <ted_url>'.format(__file__)
        return

    url = sys.argv[1]
    print url
    soup = BeautifulSoup(urlopen(url).read())
    attr_map = soup.find(id="share_and_save").attrMap
    langs = ['en', 'zh-cn']
    for lang in langs:
        subtitle_url = 'http://www.ted.com/talks/subtitles/id/%s/lang/%s/format/html' % (attr_map['data-id'], lang)
        beautiful_soup = BeautifulSoup(urlopen(subtitle_url).read())

        subtitle = beautiful_soup.getText("\n").encode('utf-8')

        with open("%s-%s.txt" % ((attr_map['data-title']), lang), 'w') as f:
            f.write(subtitle)
if __name__ == '__main__':
    main()

将上述代码存为ted.py, 在命令行键入:python ted.py <ted_video_url>



blog comments powered by Disqus