异步协程学习笔记

Report 笔记

0x01:关于多线程和协程

Python因为GIL全局锁的原因,多线程执行CPU密集型任务时会显得很鸡肋,只有在爬虫这类IO密集型任务时候才会高效,毕竟爬虫大多数时间都在等待socket返回数据。简单点说python的多线程任务仍然是线性进行的,即使在多CPU中也只是分时切换。

而协程不同于多线程,它是允许暂停并且中途能恢复运行的单线程函数,因此它的开销相对来说非常小,并且不用再考虑全局锁带来的约束,在实现绝大多数任务的时候效率很高。

     

0x02:关于 yield

以经典的生产者消费者的模型举例:

import time

def consumer():
    r = ''
    while True:
        n = yield r
        if not n:
            return
    print('[CONSUMER]Consuming %s' % n)
    time.sleep(1)
    r = '200 ok'

def produce(c):
    c.__next__()
    n = 0
    while n<10:
        n = n + 1
        print('[PRODUCER]Producing %s' % n)
        r = c.send(n)
        print('[PRODUCER] Consumer return: %s' % r)
    c.close()

if __name__ == '__main__':
    c = consumer()
    produce(c)

生产者produce通过__next__()启动生成器,再用send()把n的值传递到消费者cosumer ,最后消费者再将值传递回生产者中。需要理解的地方就是consumer中yield作用,它是如何实现两者之间的通信,返回的值为什么是'200 ok'。

当生产者走到send()时该函数会暂停并转到消费者中,消费者通过yield获取到传递过来的值,处理完后消费者又通过yield将值传回去,生产者拿到消息后继续输出下一条消息。

     

0x03:关于装饰器和 next

在上节提到过协程需要用__next__()来启动生成器,它和send()不同地方就在于它不能传递数值,换句话说__next__()就是传递空值的send() 。但每次都要先调用 __next__()很麻烦,通过装饰器就能减少不必要的步骤。

装饰器能提取大量函数中与其功能不相关的代码,达到代码重用的目的。以记录函数运行时间为例:

import time
def remember(func):
    def check():
        startime = time.time()
        func()
        endtime = time.time()
        sec = (endtime - startime)
        print("The function run %s sec" % sec)
    return check

def myfunc():
    print("This ia s test")
    time.sleep(2)

myfunc = remember(myfunc)
myfunc()

可以看到记录时间并不是myfunc()本身的功能,但在装饰器的帮助下不用添加额外代码就实现了它。除此之外我们还可以用python的语法糖精简代码:

import time
def remember(func):
    def check():
        startime = time.time()
        func()
        endtime = time.time()
        sec = (endtime - startime)
        print("The function run %s sec" % sec)
    return check

@remember
def myfunc():
    print("This ia s test")
    time.sleep(2)

myfunc()

这里@remember相当于myfunc=remember(myfunc)。对于协程的启动器__next__()同样可以通过装饰器来减少必要的步骤:

def prime_corutine(func):
    def prime(*args,**kwargs):
        t = func(*args,**kwargs)
        t.__next__()
        return t
    return prime

     

0x04:异步与效率

既然提到线程和协程,那肯定要比较两者的效率,正如我在第一节提到的python因为GIL的原因,多线程只有在I/O密集型操作中才能发挥效果,那么真实情况是什么呢?以下代码依次为普通依次执行的:

import requests
import time

def time_check(func):
    def check():
        startime = time.time()
        func()
        endtime = time.time()
        run_time = (endtime - startime)
        print("\n[!]The function run time is %s sec" % run_time)
    return check

urls = [
    'http://www.baidu.com',
    'http://www.lioneijonson.cn',
    'http://www.zhihu.com',
    'https://paper.seebug.org',
    'https://www.taobao.com',
    'http://www.gamersky.com',
    'https://www.github.com',
]

def show_result(results):
    for url,length in results.items():
        print("Length:{} \t URL:{}".format(length,url))

@time_check
def spider():
    results = {}
    for target in urls:
        r = requests.get(target)
        length = len(r.content)
        results[target] = length
        show_result(results)

if __name__ == '__main__':
    spider()

多线程执行:

import requests
import time
import threading 

results = {}

def time_check(func):
    def check(*args,**kwargs):
        startime = time.time()
        func(*args,**kwargs)
        endtime = time.time()
        run_time = (endtime - startime)
        print("\n[!]The function run time is %s sec" % run_time)
    return check

urls = [
    'http://www.baidu.com',
    'http://www.lioneijonson.cn',
    'http://www.zhihu.com',
    'https://paper.seebug.org',
    'https://www.taobao.com',
    'http://www.gamersky.com',
    'https://www.github.com',
]

def show_result(results):
    for url,length in results.items():
        print("Length:{} \t URL:{}".format(length,url))

def spider(url):
    r = requests.get(url)
    length = len(r.content)
    results[url] = length

@time_check
def main():
    ts = []
    for target in urls:
        t = threading.Thread(target=spider,args=(target,))
        ts.append(t)
        t.start()
    for t in ts:
        t.join()
    show_result(results)

if __name__ == '__main__':
    main()

异步协程的方式:

import requests
import time 
import asyncio
import aiohttp

results = {}

def time_check(func):
    def check(*args,**kwargs):
        startime = time.time()
        func(*args,**kwargs)
        endtime = time.time()
        run_time = (endtime - startime)
        print("\n[!]The function run time is %s sec" % run_time)
    return check

urls = [
    'http://www.baidu.com',
    'http://www.lioneijonson.cn',
    'http://www.zhihu.com',
    'https://paper.seebug.org',
    'https://www.taobao.com',
    'http://www.gamersky.com',
    'https://www.github.com',
]

def show_result(results):
    for url,length in results.items():
        print("Length:{} \t URL:{}".format(length,url))

async def get_content(url):
    async with aiohttp.ClientSession() as session:
        async with session.get(url) as resp:
            content = await resp.read()
            return len(content)

async def spider(url):
    length = await get_content(url)
    results[url] = length
    return True

@time_check
def main():
    loop = asyncio.get_event_loop()
    cor = [spider(url) for url in urls]
    result = loop.run_until_complete(asyncio.gather(*cor))
    show_result(results)

if __name__ == '__main__':
    main()

异步协程里的async def func等效于@asyncio.coroutine,awiat等效于yield from。普通请求方式的时间不太稳定,正确时间预估在 17 秒左右,而协程执行时间在两秒钟左右。