SlideShare ist ein Scribd-Unternehmen logo
1 von 78
Downloaden Sie, um offline zu lesen
Webscraping
with Asyncio
José Manuel Ortega
@jmortegac
Python conferences
https://speakerdeck.com/jmortega
Python conferences
http://jmortega.github.io/
Github repository
https://github.com/jmortega/webscraping_asyncio_2016
Agenda
▶ Webscraping python tools
▶ Requests vs aiohttp
▶ Introduction to asyncio
▶ Async client/server
▶ Building a webcrawler with asyncio
▶ Alternatives to asyncio
Webscraping
Python tools
➢ Requests
➢ Beautiful Soup 4
➢ Pyquery
➢ Webscraping
➢ Scrapy
Python tools
➢ Mechanize
➢ Robobrowser
➢ Selenium
Requests http://docs.python-requests.org/en/latest
Web scraping with Python
1. Download webpage with HTTP
module(requests,urllib,aiohttp)
2. Parse the page with
BeautifulSoup/lxml
3. Select elements with Regular
expressions,XPath or css selectors
4. Store results in a database,csv,json
BeautifulSoup
BeautifulSoup
▶ soup =
BeautifulSoup(html_doc,’html.parser’)
▶ Print all: print(soup.prettify())
▶ Print text: print(soup.get_text())
from bs4 import BeautifulSoup
BeautifulSoup functions
▪ find_all(‘a’)→Returns all links
▪ find(‘title’)→Returns the first element <title>
▪ get(‘href’)→Returns the attribute href value
▪ (element).text → Returns the text inside an
element
for link in soup.find_all('a'):
print(link.get('href'))
External/internal links
External/internal links
http://python.ie/pycon-2016/
BeautifulSoup PyCon
BeautifulSoup PyCon Output
Parsers Comparison
PyQuery
PyQuery
PyQuery output
Spiders /crawlers
▶ A Web crawler is an Internet bot that
systematically browses the World Wide Web,
typically for the purpose of Web indexing. A
Web crawler may also be called a Web
spider.
https://en.wikipedia.org/wiki/Web_crawler
Spiders /crawlers
scrapinghub.com
Scrapy
https://pypi.python.org/pypi/Scrapy/1.1.2
Scrapy
▶ Uses a mechanism based on XPath
expressions called Xpath
Selectors.
▶ Uses Parser LXML to find elements
▶ Twisted for asynchronous
operations
Scrapy advantages
▶ Faster than mechanize because it
uses twisted for asynchronous operations.
▶ Scrapy has better support for html
parsing.
▶ Scrapy has better support for unicode
characters, redirections, gzipped
responses, encodings.
▶ You can export the extracted data directly
to JSON,XML and CSV.
Export data
▶ scrapy crawl <spider_name>
▶ $ scrapy crawl <spider_name> -o items.json -t json
▶ $ scrapy crawl <spider_name> -o items.csv -t csv
▶ $ scrapy crawl <spider_name> -o items.xml -t xml
▶
Scrapy concurrency
The concurrency problem
▶ Different approaches:
▶ Multiple processes
▶ Threads
▶ Separate distributed machines
▶ Asynchronous programming(event
loop)
Requests problems
▶ Requests operations are blocking the
main thread
▶ It pauses until operation completed
▶ We need one thread for each request if
we want non-blocking operations
Threads problems
▶ Get Overhead
▶ Stack size
▶ Context changes
▶ Synchronization
Solution
▶NOT USE THREADS
▶USE ONE THREAD
▶+ EVENT LOOP
New concepts
▶ Event loop
▶ Async
▶ Await
▶ Futures
▶ Coroutines
▶ Tasks
▶ Executors
Event loop implementations
▶ Asyncio
▶ https://docs.python.org/3.4/library/asyncio.html
▶ Tornado web server
▶ http://www.tornadoweb.org/en/stable
▶ Twisted
▶ https://twistedmatrix.com
▶ Gevent
▶ http://www.gevent.org
Asyncio def.
Asyncio
▶ Python >=3.3
▶ Event-loop framework
▶ I/O Asynchronous
▶ Non-blocking approach with sockets
▶ All requests in one thread
▶ Event-driven switching
▶ aio-http module for make requests
asynchronously
Asyncio
▶ Interoperatibility with other frameworks
Requests vs aiohttp
#!/usr/local/bin/python3.5
import asyncio
from aiohttp import ClientSession
async def hello():
async with ClientSession() as session:
async with session.get("http://httpbin.org/headers") as response:
response = await response.read()
print(response)
loop = asyncio.get_event_loop()
loop.run_until_complete(hello())
import requests
def hello()
return requests.get("http://httpbin.org/get")
print(hello())
Event Loop
▶ An event loop allow us to write asynchronous
code using callbacks or coroutines.
▶ Event loop function like task switcher,just the way
operating systems switch between active tasks on the
CPU.
▶ The idea is that we have an event loop running until all
tasks scheduled are completed.
▶ Features and tasks are created through the event loop.
Event Loop
▶ An event loop is used to orchestrate the
execution of the coroutines.
▶ asyncio.get_event_loop()
▶ asyncio.run_until_complete(coroutines,futures)
▶ asyncio.run_forever()
▶ asyncio.stop()
Starting Event Loop
Coroutines
▶ Coroutines are functions that allow for
multitasking without requiring multiple
threads or processes.
▶ Coroutines are like functions, but they can be
suspended or resumed at certain points in the
code.
▶ Coroutines allow write asynchronous code that
combines the efficiency of callbacks with the
classic good looks of multithreaded.
Coroutines 3.4 vs 3.5
import asyncio
@asyncio.coroutine
def fetch(self, url):
response = yield from self.session.get(url)
body = yield from response.read()
import asyncio
async def fetch(self, url):
response = await self.session.get(url)
body = await response.read()
Coroutines in event loop
#!/usr/local/bin/python3.5
import asyncio
import aiohttp
async def get_page(url):
response = await aiohttp.request('GET', url)
body = await response.read()
print(body)
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait([get_page('http://python.org'),
get_page('http://pycon.org')]))
Requests in event loop
async def getpage_with_requests(url):
return await loop.run_in_executor(None,requests.get,url)
#methods equivalents
async def getpage_with_aiohttp(url):
with aitohttp.ClientSession() as session:
async with session.get(url) as response:
return await response.read()
Tasks
▶ The asyncio.Task class is a subclass of
asyncio.Future to encapsulate and manage
coroutines.
▶ Allow independently running tasks to run
concurrently with other tasks on the same event
loop.
▶ When a coroutine is wrapped in a task, it
connects the task to the event loop.
Tasks
Tasks
Tasks
Tasks execution
Futures
▶ To manage an object Future in Asyncio, we
must declare the following:
▶ import asyncio
▶ future = asyncio.Future()
▶ https://docs.python.org/3/library/asyncio
-task.html#future
▶ https://docs.python.org/3/library/concurr
ent.futures.html
Futures
▶ The asyncio.Future class is essentially a
promise of a result.
▶ A Future will returns the results when they
are available, and once it receives results, it
will pass them along to all the registered
callbacks.
▶ Each future is a task to be executed in the
event loop
Futures
Semaphores
▶ Adding synchronization
▶ Limiting number of concurrent requests.
▶ The argument indicates the number of
simultaneous requests we want to allow.
▶ sem = asyncio.Semaphore(5)
with (await sem):
page = await get(url, compress=True)
Async Client /server
▶ asyncio.start_server
▶ server =
asyncio.start_server(handle_connection,host=HOST,port=PORT)
Async Client /server
▶ asyncio.open_connection
Async Client /server
Async Web crawler
Async Web crawler
▶ Send asynchronous requests to all the links
on a web page and add the responses to a
queue to be processed as we go.
▶ Coroutines allow running independent tasks and
processing their results in 3 ways:
▶ Using asyncio.as_completed →by processing
the results as they come.
▶ Using asyncio.gather→ only once they have all
finished loading.
▶ Using asyncio.ensure_future
Async Web crawler
import asyncio
import random
@asyncio.coroutine
def get_url(url):
wait_time = random.randint(1, 4)
yield from asyncio.sleep(wait_time)
print('Done: URL {} took {}s to get!'.format(url, wait_time))
return url, wait_time
@asyncio.coroutine
def process_results_as_come_in():
coroutines = [get_url(url) for url in ['URL1', 'URL2', 'URL3']]
for coroutine in asyncio.as_completed(coroutines):
url, wait_time = yield from coroutine
print('Coroutine for {} is done'.format(url))
def main():
loop = asyncio.get_event_loop()
print(“Process results as they come in:")
loop.run_until_complete(process_results_as_come_in())
if __name__ == '__main__':
main()
asyncio.as_completed
Async Web crawler execution
Async Web crawler
import asyncio
import random
@asyncio.coroutine
def get_url(url):
wait_time = random.randint(1, 4)
yield from asyncio.sleep(wait_time)
print('Done: URL {} took {}s to get!'.format(url, wait_time))
return url, wait_time
@asyncio.coroutine
def process_once_everything_ready():
coroutines = [get_url(url) for url in ['URL1', 'URL2', 'URL3']]
results = yield from asyncio.gather(*coroutines)
print(results)
def main():
loop = asyncio.get_event_loop()
print(“Process results once they are all ready:")
loop.run_until_complete(process_once_everything_ready())
if __name__ == '__main__':
main()
asyncio.gather
asyncio.gather
From Python documentation, this is what asyncio.gather does:
asyncio.gather(*coros_or_futures, loop=None,
return_exceptions=False)
Return a future aggregating results from the given coroutine
objects or futures.
All futures must share the same event loop. If all the tasks
are done successfully, the returned future’s result is the
list of results (in the order of the original sequence, not
necessarily the order of results arrival). If
return_exceptions is True, exceptions in the tasks are
treated the same as successful results, and gathered in the
result list; otherwise, the first raised exception will be
immediately propagated to the returned future.
Async Web crawler
import asyncio
import random
@asyncio.coroutine
def get_url(url):
wait_time = random.randint(1, 4)
yield from asyncio.sleep(wait_time)
print('Done: URL {} took {}s to get!'.format(url, wait_time))
return url, wait_time
@asyncio.coroutine
def process_ensure_future():
tasks= [asyncio.ensure_future(get_url(url) )for url in ['URL1',
'URL2', 'URL3']]
results = yield from asyncio.wait(tasks)
print(results)
def main():
loop = asyncio.get_event_loop()
print(“Process ensure future:")
loop.run_until_complete(process_ensure_future())
if __name__ == '__main__':
main()
asyncio.ensure_future
Async Web crawler execution
Async Web downloader
Async Web downloader faster
Async Web downloader
▶ With get_partial_content
▶ With download_coroutine
Async Extracting links with r.e
Async Extracting links with bs4
Async Extracting links execution
▶ With bs4
▶ With regex
Alternatives to asyncio
▶ ThreadPoolExecutor
▶ https://docs.python.org/3.5/library/concurrent.futures.html#concurrent.fut
ures.ThreadPoolExecutor
▶ ProcessPoolExecutor
▶ https://docs.python.org/3.5/library/concurrent.futures.html#concur
rent.futures.ProcessPoolExecutor
▶ Parallel python
▶ http://www.parallelpython.com
Parallel python
▶ SMP(symmetric multiprocessing)
architecture with multiple cores in the same
machine
▶ Distribute tasks in multiple machines
▶ Cluster
ProcessPoolExecutor
number_of_cpus = cpu_count()
References
▶ http://www.crummy.com/software/BeautifulSoup
▶ http://scrapy.org
▶ http://docs.webscraping.com
▶ https://github.com/KeepSafe/aiohttp
▶ http://aiohttp.readthedocs.io/en/stable/
▶ https://docs.python.org/3.4/library/asyncio.html
▶ https://github.com/REMitchell/python-scraping
Books
Books
Thank you!
@jmortegac
http://speakerdeck.com/jmortega
http://github.com/jmortega

Weitere ähnliche Inhalte

Was ist angesagt?

Logstash for SEO: come monitorare i Log del Web Server in realtime
Logstash for SEO: come monitorare i Log del Web Server in realtimeLogstash for SEO: come monitorare i Log del Web Server in realtime
Logstash for SEO: come monitorare i Log del Web Server in realtimeAndrea Cardinale
 
Scrapy talk at DataPhilly
Scrapy talk at DataPhillyScrapy talk at DataPhilly
Scrapy talk at DataPhillyobdit
 
From zero to hero - Easy log centralization with Logstash and Elasticsearch
From zero to hero - Easy log centralization with Logstash and ElasticsearchFrom zero to hero - Easy log centralization with Logstash and Elasticsearch
From zero to hero - Easy log centralization with Logstash and ElasticsearchRafał Kuć
 
Got Logs? Get Answers with Elasticsearch ELK - PuppetConf 2014
Got Logs? Get Answers with Elasticsearch ELK - PuppetConf 2014Got Logs? Get Answers with Elasticsearch ELK - PuppetConf 2014
Got Logs? Get Answers with Elasticsearch ELK - PuppetConf 2014Puppet
 
HTTP For the Good or the Bad
HTTP For the Good or the BadHTTP For the Good or the Bad
HTTP For the Good or the BadXavier Mertens
 
Mining Ruby Gem vulnerabilities for Fun and No Profit.
Mining Ruby Gem vulnerabilities for Fun and No Profit.Mining Ruby Gem vulnerabilities for Fun and No Profit.
Mining Ruby Gem vulnerabilities for Fun and No Profit.Larry Cashdollar
 
Selenium sandwich-3: Being where you aren't.
Selenium sandwich-3: Being where you aren't.Selenium sandwich-3: Being where you aren't.
Selenium sandwich-3: Being where you aren't.Workhorse Computing
 
Testing your infrastructure with litmus
Testing your infrastructure with litmusTesting your infrastructure with litmus
Testing your infrastructure with litmusBram Vogelaar
 
Http capturing
Http capturingHttp capturing
Http capturingEric Ahn
 
Py conkr 20150829_docker-python
Py conkr 20150829_docker-pythonPy conkr 20150829_docker-python
Py conkr 20150829_docker-pythonEric Ahn
 
Legacy applications - 4Developes konferencja, Piotr Pasich
Legacy applications  - 4Developes konferencja, Piotr PasichLegacy applications  - 4Developes konferencja, Piotr Pasich
Legacy applications - 4Developes konferencja, Piotr PasichPiotr Pasich
 
Deploying E.L.K stack w Puppet
Deploying E.L.K stack w PuppetDeploying E.L.K stack w Puppet
Deploying E.L.K stack w PuppetColin Brown
 
Ansible tips & tricks
Ansible tips & tricksAnsible tips & tricks
Ansible tips & tricksbcoca
 
{{more}} Kibana4
{{more}} Kibana4{{more}} Kibana4
{{more}} Kibana4琛琳 饶
 
Rugged Driven Development with Gauntlt
Rugged Driven Development with GauntltRugged Driven Development with Gauntlt
Rugged Driven Development with GauntltJames Wickett
 
Building A Poor man’s Fir3Ey3 Mail Scanner
Building A Poor man’s Fir3Ey3 Mail ScannerBuilding A Poor man’s Fir3Ey3 Mail Scanner
Building A Poor man’s Fir3Ey3 Mail ScannerXavier Mertens
 
Asynchronous PHP and Real-time Messaging
Asynchronous PHP and Real-time MessagingAsynchronous PHP and Real-time Messaging
Asynchronous PHP and Real-time MessagingSteve Rhoades
 

Was ist angesagt? (20)

Logstash for SEO: come monitorare i Log del Web Server in realtime
Logstash for SEO: come monitorare i Log del Web Server in realtimeLogstash for SEO: come monitorare i Log del Web Server in realtime
Logstash for SEO: come monitorare i Log del Web Server in realtime
 
Scrapy talk at DataPhilly
Scrapy talk at DataPhillyScrapy talk at DataPhilly
Scrapy talk at DataPhilly
 
From zero to hero - Easy log centralization with Logstash and Elasticsearch
From zero to hero - Easy log centralization with Logstash and ElasticsearchFrom zero to hero - Easy log centralization with Logstash and Elasticsearch
From zero to hero - Easy log centralization with Logstash and Elasticsearch
 
Designing net-aws-glacier
Designing net-aws-glacierDesigning net-aws-glacier
Designing net-aws-glacier
 
Got Logs? Get Answers with Elasticsearch ELK - PuppetConf 2014
Got Logs? Get Answers with Elasticsearch ELK - PuppetConf 2014Got Logs? Get Answers with Elasticsearch ELK - PuppetConf 2014
Got Logs? Get Answers with Elasticsearch ELK - PuppetConf 2014
 
HTTP For the Good or the Bad
HTTP For the Good or the BadHTTP For the Good or the Bad
HTTP For the Good or the Bad
 
dotCloud and go
dotCloud and godotCloud and go
dotCloud and go
 
Mining Ruby Gem vulnerabilities for Fun and No Profit.
Mining Ruby Gem vulnerabilities for Fun and No Profit.Mining Ruby Gem vulnerabilities for Fun and No Profit.
Mining Ruby Gem vulnerabilities for Fun and No Profit.
 
Selenium sandwich-3: Being where you aren't.
Selenium sandwich-3: Being where you aren't.Selenium sandwich-3: Being where you aren't.
Selenium sandwich-3: Being where you aren't.
 
Testing your infrastructure with litmus
Testing your infrastructure with litmusTesting your infrastructure with litmus
Testing your infrastructure with litmus
 
Http capturing
Http capturingHttp capturing
Http capturing
 
Py conkr 20150829_docker-python
Py conkr 20150829_docker-pythonPy conkr 20150829_docker-python
Py conkr 20150829_docker-python
 
Legacy applications - 4Developes konferencja, Piotr Pasich
Legacy applications  - 4Developes konferencja, Piotr PasichLegacy applications  - 4Developes konferencja, Piotr Pasich
Legacy applications - 4Developes konferencja, Piotr Pasich
 
Deploying E.L.K stack w Puppet
Deploying E.L.K stack w PuppetDeploying E.L.K stack w Puppet
Deploying E.L.K stack w Puppet
 
Ansible tips & tricks
Ansible tips & tricksAnsible tips & tricks
Ansible tips & tricks
 
{{more}} Kibana4
{{more}} Kibana4{{more}} Kibana4
{{more}} Kibana4
 
Using Logstash, elasticsearch & kibana
Using Logstash, elasticsearch & kibanaUsing Logstash, elasticsearch & kibana
Using Logstash, elasticsearch & kibana
 
Rugged Driven Development with Gauntlt
Rugged Driven Development with GauntltRugged Driven Development with Gauntlt
Rugged Driven Development with Gauntlt
 
Building A Poor man’s Fir3Ey3 Mail Scanner
Building A Poor man’s Fir3Ey3 Mail ScannerBuilding A Poor man’s Fir3Ey3 Mail Scanner
Building A Poor man’s Fir3Ey3 Mail Scanner
 
Asynchronous PHP and Real-time Messaging
Asynchronous PHP and Real-time MessagingAsynchronous PHP and Real-time Messaging
Asynchronous PHP and Real-time Messaging
 

Andere mochten auch

OSINT tools for security auditing with python
OSINT tools for security auditing with pythonOSINT tools for security auditing with python
OSINT tools for security auditing with pythonJose Manuel Ortega Candel
 
Dive into Python Class
Dive into Python ClassDive into Python Class
Dive into Python ClassJim Yeh
 
A deep dive into PEP-3156 and the new asyncio module
A deep dive into PEP-3156 and the new asyncio moduleA deep dive into PEP-3156 and the new asyncio module
A deep dive into PEP-3156 and the new asyncio moduleSaúl Ibarra Corretgé
 
Comandos para ubuntu 400 que debes conocer
Comandos para ubuntu 400 que debes conocerComandos para ubuntu 400 que debes conocer
Comandos para ubuntu 400 que debes conocerGeek Advisor Freddy
 
Practical continuous quality gates for development process
Practical continuous quality gates for development processPractical continuous quality gates for development process
Practical continuous quality gates for development processAndrii Soldatenko
 
Async Tasks with Django Channels
Async Tasks with Django ChannelsAsync Tasks with Django Channels
Async Tasks with Django ChannelsAlbert O'Connor
 
The Awesome Python Class Part-4
The Awesome Python Class Part-4The Awesome Python Class Part-4
The Awesome Python Class Part-4Binay Kumar Ray
 
Async programming and python
Async programming and pythonAsync programming and python
Async programming and pythonChetan Giridhar
 
What is the best full text search engine for Python?
What is the best full text search engine for Python?What is the best full text search engine for Python?
What is the best full text search engine for Python?Andrii Soldatenko
 
Python as number crunching code glue
Python as number crunching code gluePython as number crunching code glue
Python as number crunching code glueJiahao Chen
 

Andere mochten auch (20)

Scraping the web with python
Scraping the web with pythonScraping the web with python
Scraping the web with python
 
OSINT tools for security auditing with python
OSINT tools for security auditing with pythonOSINT tools for security auditing with python
OSINT tools for security auditing with python
 
Faster Python, FOSDEM
Faster Python, FOSDEMFaster Python, FOSDEM
Faster Python, FOSDEM
 
Dive into Python Class
Dive into Python ClassDive into Python Class
Dive into Python Class
 
Python on Rails 2014
Python on Rails 2014Python on Rails 2014
Python on Rails 2014
 
Scrapy-101
Scrapy-101Scrapy-101
Scrapy-101
 
Python class
Python classPython class
Python class
 
The future of async i/o in Python
The future of async i/o in PythonThe future of async i/o in Python
The future of async i/o in Python
 
A deep dive into PEP-3156 and the new asyncio module
A deep dive into PEP-3156 and the new asyncio moduleA deep dive into PEP-3156 and the new asyncio module
A deep dive into PEP-3156 and the new asyncio module
 
Python, do you even async?
Python, do you even async?Python, do you even async?
Python, do you even async?
 
Comandos para ubuntu 400 que debes conocer
Comandos para ubuntu 400 que debes conocerComandos para ubuntu 400 que debes conocer
Comandos para ubuntu 400 que debes conocer
 
Python master class 3
Python master class 3Python master class 3
Python master class 3
 
Python Async IO Horizon
Python Async IO HorizonPython Async IO Horizon
Python Async IO Horizon
 
Practical continuous quality gates for development process
Practical continuous quality gates for development processPractical continuous quality gates for development process
Practical continuous quality gates for development process
 
Async Tasks with Django Channels
Async Tasks with Django ChannelsAsync Tasks with Django Channels
Async Tasks with Django Channels
 
The Awesome Python Class Part-4
The Awesome Python Class Part-4The Awesome Python Class Part-4
The Awesome Python Class Part-4
 
Async programming and python
Async programming and pythonAsync programming and python
Async programming and python
 
Regexp
RegexpRegexp
Regexp
 
What is the best full text search engine for Python?
What is the best full text search engine for Python?What is the best full text search engine for Python?
What is the best full text search engine for Python?
 
Python as number crunching code glue
Python as number crunching code gluePython as number crunching code glue
Python as number crunching code glue
 

Ähnlich wie Webscraping with asyncio

HOW TO DEAL WITH BLOCKING CODE WITHIN ASYNCIO EVENT LOOP
HOW TO DEAL WITH BLOCKING CODE WITHIN ASYNCIO EVENT LOOPHOW TO DEAL WITH BLOCKING CODE WITHIN ASYNCIO EVENT LOOP
HOW TO DEAL WITH BLOCKING CODE WITHIN ASYNCIO EVENT LOOPMykola Novik
 
Binary Studio Academy: Concurrency in C# 5.0
Binary Studio Academy: Concurrency in C# 5.0Binary Studio Academy: Concurrency in C# 5.0
Binary Studio Academy: Concurrency in C# 5.0Binary Studio
 
Presentation: Everything you wanted to know about writing async, high-concurr...
Presentation: Everything you wanted to know about writing async, high-concurr...Presentation: Everything you wanted to know about writing async, high-concurr...
Presentation: Everything you wanted to know about writing async, high-concurr...Baruch Sadogursky
 
Everything you wanted to know about writing async, concurrent http apps in java
Everything you wanted to know about writing async, concurrent http apps in java Everything you wanted to know about writing async, concurrent http apps in java
Everything you wanted to know about writing async, concurrent http apps in java Baruch Sadogursky
 
HTML5: huh, what is it good for?
HTML5: huh, what is it good for?HTML5: huh, what is it good for?
HTML5: huh, what is it good for?Remy Sharp
 
HTML5 tutorial: canvas, offfline & sockets
HTML5 tutorial: canvas, offfline & socketsHTML5 tutorial: canvas, offfline & sockets
HTML5 tutorial: canvas, offfline & socketsRemy Sharp
 
Plack perl superglue for web frameworks and servers
Plack perl superglue for web frameworks and serversPlack perl superglue for web frameworks and servers
Plack perl superglue for web frameworks and serversTatsuhiko Miyagawa
 
Denys Serhiienko "ASGI in depth"
Denys Serhiienko "ASGI in depth"Denys Serhiienko "ASGI in depth"
Denys Serhiienko "ASGI in depth"Fwdays
 
asyncio community, one year later
asyncio community, one year laterasyncio community, one year later
asyncio community, one year laterVictor Stinner
 
Developing cacheable PHP applications - Confoo 2018
Developing cacheable PHP applications - Confoo 2018Developing cacheable PHP applications - Confoo 2018
Developing cacheable PHP applications - Confoo 2018Thijs Feryn
 
Leverage HTTP to deliver cacheable websites - Codemotion Rome 2018
Leverage HTTP to deliver cacheable websites - Codemotion Rome 2018Leverage HTTP to deliver cacheable websites - Codemotion Rome 2018
Leverage HTTP to deliver cacheable websites - Codemotion Rome 2018Thijs Feryn
 
Leverage HTTP to deliver cacheable websites - Thijs Feryn - Codemotion Rome 2018
Leverage HTTP to deliver cacheable websites - Thijs Feryn - Codemotion Rome 2018Leverage HTTP to deliver cacheable websites - Thijs Feryn - Codemotion Rome 2018
Leverage HTTP to deliver cacheable websites - Thijs Feryn - Codemotion Rome 2018Codemotion
 
CP3108B (Mozilla) Sharing Session on Add-on SDK
CP3108B (Mozilla) Sharing Session on Add-on SDKCP3108B (Mozilla) Sharing Session on Add-on SDK
CP3108B (Mozilla) Sharing Session on Add-on SDKMifeng
 
Implementing Comet using PHP
Implementing Comet using PHPImplementing Comet using PHP
Implementing Comet using PHPKing Foo
 
Lucio Grenzi - Building serverless applications on the Apache OpenWhisk platf...
Lucio Grenzi - Building serverless applications on the Apache OpenWhisk platf...Lucio Grenzi - Building serverless applications on the Apache OpenWhisk platf...
Lucio Grenzi - Building serverless applications on the Apache OpenWhisk platf...Codemotion
 
Using Flow-based programming to write tools and workflows for Scientific Comp...
Using Flow-based programming to write tools and workflows for Scientific Comp...Using Flow-based programming to write tools and workflows for Scientific Comp...
Using Flow-based programming to write tools and workflows for Scientific Comp...Samuel Lampa
 
Developing cacheable PHP applications - PHPLimburgBE 2018
Developing cacheable PHP applications - PHPLimburgBE 2018Developing cacheable PHP applications - PHPLimburgBE 2018
Developing cacheable PHP applications - PHPLimburgBE 2018Thijs Feryn
 

Ähnlich wie Webscraping with asyncio (20)

AsyncIO To Speed Up Your Crawler
AsyncIO To Speed Up Your CrawlerAsyncIO To Speed Up Your Crawler
AsyncIO To Speed Up Your Crawler
 
HOW TO DEAL WITH BLOCKING CODE WITHIN ASYNCIO EVENT LOOP
HOW TO DEAL WITH BLOCKING CODE WITHIN ASYNCIO EVENT LOOPHOW TO DEAL WITH BLOCKING CODE WITHIN ASYNCIO EVENT LOOP
HOW TO DEAL WITH BLOCKING CODE WITHIN ASYNCIO EVENT LOOP
 
Binary Studio Academy: Concurrency in C# 5.0
Binary Studio Academy: Concurrency in C# 5.0Binary Studio Academy: Concurrency in C# 5.0
Binary Studio Academy: Concurrency in C# 5.0
 
Presentation: Everything you wanted to know about writing async, high-concurr...
Presentation: Everything you wanted to know about writing async, high-concurr...Presentation: Everything you wanted to know about writing async, high-concurr...
Presentation: Everything you wanted to know about writing async, high-concurr...
 
Everything you wanted to know about writing async, concurrent http apps in java
Everything you wanted to know about writing async, concurrent http apps in java Everything you wanted to know about writing async, concurrent http apps in java
Everything you wanted to know about writing async, concurrent http apps in java
 
aiohttp intro
aiohttp introaiohttp intro
aiohttp intro
 
HTML5: huh, what is it good for?
HTML5: huh, what is it good for?HTML5: huh, what is it good for?
HTML5: huh, what is it good for?
 
HTML5 tutorial: canvas, offfline & sockets
HTML5 tutorial: canvas, offfline & socketsHTML5 tutorial: canvas, offfline & sockets
HTML5 tutorial: canvas, offfline & sockets
 
Plack perl superglue for web frameworks and servers
Plack perl superglue for web frameworks and serversPlack perl superglue for web frameworks and servers
Plack perl superglue for web frameworks and servers
 
Python at Facebook
Python at FacebookPython at Facebook
Python at Facebook
 
Denys Serhiienko "ASGI in depth"
Denys Serhiienko "ASGI in depth"Denys Serhiienko "ASGI in depth"
Denys Serhiienko "ASGI in depth"
 
asyncio community, one year later
asyncio community, one year laterasyncio community, one year later
asyncio community, one year later
 
Developing cacheable PHP applications - Confoo 2018
Developing cacheable PHP applications - Confoo 2018Developing cacheable PHP applications - Confoo 2018
Developing cacheable PHP applications - Confoo 2018
 
Leverage HTTP to deliver cacheable websites - Codemotion Rome 2018
Leverage HTTP to deliver cacheable websites - Codemotion Rome 2018Leverage HTTP to deliver cacheable websites - Codemotion Rome 2018
Leverage HTTP to deliver cacheable websites - Codemotion Rome 2018
 
Leverage HTTP to deliver cacheable websites - Thijs Feryn - Codemotion Rome 2018
Leverage HTTP to deliver cacheable websites - Thijs Feryn - Codemotion Rome 2018Leverage HTTP to deliver cacheable websites - Thijs Feryn - Codemotion Rome 2018
Leverage HTTP to deliver cacheable websites - Thijs Feryn - Codemotion Rome 2018
 
CP3108B (Mozilla) Sharing Session on Add-on SDK
CP3108B (Mozilla) Sharing Session on Add-on SDKCP3108B (Mozilla) Sharing Session on Add-on SDK
CP3108B (Mozilla) Sharing Session on Add-on SDK
 
Implementing Comet using PHP
Implementing Comet using PHPImplementing Comet using PHP
Implementing Comet using PHP
 
Lucio Grenzi - Building serverless applications on the Apache OpenWhisk platf...
Lucio Grenzi - Building serverless applications on the Apache OpenWhisk platf...Lucio Grenzi - Building serverless applications on the Apache OpenWhisk platf...
Lucio Grenzi - Building serverless applications on the Apache OpenWhisk platf...
 
Using Flow-based programming to write tools and workflows for Scientific Comp...
Using Flow-based programming to write tools and workflows for Scientific Comp...Using Flow-based programming to write tools and workflows for Scientific Comp...
Using Flow-based programming to write tools and workflows for Scientific Comp...
 
Developing cacheable PHP applications - PHPLimburgBE 2018
Developing cacheable PHP applications - PHPLimburgBE 2018Developing cacheable PHP applications - PHPLimburgBE 2018
Developing cacheable PHP applications - PHPLimburgBE 2018
 

Mehr von Jose Manuel Ortega Candel

Asegurando tus APIs Explorando el OWASP Top 10 de Seguridad en APIs.pdf
Asegurando tus APIs Explorando el OWASP Top 10 de Seguridad en APIs.pdfAsegurando tus APIs Explorando el OWASP Top 10 de Seguridad en APIs.pdf
Asegurando tus APIs Explorando el OWASP Top 10 de Seguridad en APIs.pdfJose Manuel Ortega Candel
 
PyGoat Analizando la seguridad en aplicaciones Django.pdf
PyGoat Analizando la seguridad en aplicaciones Django.pdfPyGoat Analizando la seguridad en aplicaciones Django.pdf
PyGoat Analizando la seguridad en aplicaciones Django.pdfJose Manuel Ortega Candel
 
Ciberseguridad en Blockchain y Smart Contracts: Explorando los Desafíos y Sol...
Ciberseguridad en Blockchain y Smart Contracts: Explorando los Desafíos y Sol...Ciberseguridad en Blockchain y Smart Contracts: Explorando los Desafíos y Sol...
Ciberseguridad en Blockchain y Smart Contracts: Explorando los Desafíos y Sol...Jose Manuel Ortega Candel
 
Evolution of security strategies in K8s environments- All day devops
Evolution of security strategies in K8s environments- All day devops Evolution of security strategies in K8s environments- All day devops
Evolution of security strategies in K8s environments- All day devops Jose Manuel Ortega Candel
 
Evolution of security strategies in K8s environments.pdf
Evolution of security strategies in K8s environments.pdfEvolution of security strategies in K8s environments.pdf
Evolution of security strategies in K8s environments.pdfJose Manuel Ortega Candel
 
Implementing Observability for Kubernetes.pdf
Implementing Observability for Kubernetes.pdfImplementing Observability for Kubernetes.pdf
Implementing Observability for Kubernetes.pdfJose Manuel Ortega Candel
 
Seguridad en arquitecturas serverless y entornos cloud
Seguridad en arquitecturas serverless y entornos cloudSeguridad en arquitecturas serverless y entornos cloud
Seguridad en arquitecturas serverless y entornos cloudJose Manuel Ortega Candel
 
Construyendo arquitecturas zero trust sobre entornos cloud
Construyendo arquitecturas zero trust sobre entornos cloud Construyendo arquitecturas zero trust sobre entornos cloud
Construyendo arquitecturas zero trust sobre entornos cloud Jose Manuel Ortega Candel
 
Tips and tricks for data science projects with Python
Tips and tricks for data science projects with Python Tips and tricks for data science projects with Python
Tips and tricks for data science projects with Python Jose Manuel Ortega Candel
 
Sharing secret keys in Docker containers and K8s
Sharing secret keys in Docker containers and K8sSharing secret keys in Docker containers and K8s
Sharing secret keys in Docker containers and K8sJose Manuel Ortega Candel
 
Python para equipos de ciberseguridad(pycones)
Python para equipos de ciberseguridad(pycones)Python para equipos de ciberseguridad(pycones)
Python para equipos de ciberseguridad(pycones)Jose Manuel Ortega Candel
 
Shodan Tips and tricks. Automatiza y maximiza las búsquedas shodan
Shodan Tips and tricks. Automatiza y maximiza las búsquedas shodanShodan Tips and tricks. Automatiza y maximiza las búsquedas shodan
Shodan Tips and tricks. Automatiza y maximiza las búsquedas shodanJose Manuel Ortega Candel
 
ELK para analistas de seguridad y equipos Blue Team
ELK para analistas de seguridad y equipos Blue TeamELK para analistas de seguridad y equipos Blue Team
ELK para analistas de seguridad y equipos Blue TeamJose Manuel Ortega Candel
 
Monitoring and managing Containers using Open Source tools
Monitoring and managing Containers using Open Source toolsMonitoring and managing Containers using Open Source tools
Monitoring and managing Containers using Open Source toolsJose Manuel Ortega Candel
 
Python memory managment. Deeping in Garbage collector
Python memory managment. Deeping in Garbage collectorPython memory managment. Deeping in Garbage collector
Python memory managment. Deeping in Garbage collectorJose Manuel Ortega Candel
 

Mehr von Jose Manuel Ortega Candel (20)

Asegurando tus APIs Explorando el OWASP Top 10 de Seguridad en APIs.pdf
Asegurando tus APIs Explorando el OWASP Top 10 de Seguridad en APIs.pdfAsegurando tus APIs Explorando el OWASP Top 10 de Seguridad en APIs.pdf
Asegurando tus APIs Explorando el OWASP Top 10 de Seguridad en APIs.pdf
 
PyGoat Analizando la seguridad en aplicaciones Django.pdf
PyGoat Analizando la seguridad en aplicaciones Django.pdfPyGoat Analizando la seguridad en aplicaciones Django.pdf
PyGoat Analizando la seguridad en aplicaciones Django.pdf
 
Ciberseguridad en Blockchain y Smart Contracts: Explorando los Desafíos y Sol...
Ciberseguridad en Blockchain y Smart Contracts: Explorando los Desafíos y Sol...Ciberseguridad en Blockchain y Smart Contracts: Explorando los Desafíos y Sol...
Ciberseguridad en Blockchain y Smart Contracts: Explorando los Desafíos y Sol...
 
Evolution of security strategies in K8s environments- All day devops
Evolution of security strategies in K8s environments- All day devops Evolution of security strategies in K8s environments- All day devops
Evolution of security strategies in K8s environments- All day devops
 
Evolution of security strategies in K8s environments.pdf
Evolution of security strategies in K8s environments.pdfEvolution of security strategies in K8s environments.pdf
Evolution of security strategies in K8s environments.pdf
 
Implementing Observability for Kubernetes.pdf
Implementing Observability for Kubernetes.pdfImplementing Observability for Kubernetes.pdf
Implementing Observability for Kubernetes.pdf
 
Computación distribuida usando Python
Computación distribuida usando PythonComputación distribuida usando Python
Computación distribuida usando Python
 
Seguridad en arquitecturas serverless y entornos cloud
Seguridad en arquitecturas serverless y entornos cloudSeguridad en arquitecturas serverless y entornos cloud
Seguridad en arquitecturas serverless y entornos cloud
 
Construyendo arquitecturas zero trust sobre entornos cloud
Construyendo arquitecturas zero trust sobre entornos cloud Construyendo arquitecturas zero trust sobre entornos cloud
Construyendo arquitecturas zero trust sobre entornos cloud
 
Tips and tricks for data science projects with Python
Tips and tricks for data science projects with Python Tips and tricks for data science projects with Python
Tips and tricks for data science projects with Python
 
Sharing secret keys in Docker containers and K8s
Sharing secret keys in Docker containers and K8sSharing secret keys in Docker containers and K8s
Sharing secret keys in Docker containers and K8s
 
Implementing cert-manager in K8s
Implementing cert-manager in K8sImplementing cert-manager in K8s
Implementing cert-manager in K8s
 
Python para equipos de ciberseguridad(pycones)
Python para equipos de ciberseguridad(pycones)Python para equipos de ciberseguridad(pycones)
Python para equipos de ciberseguridad(pycones)
 
Python para equipos de ciberseguridad
Python para equipos de ciberseguridad Python para equipos de ciberseguridad
Python para equipos de ciberseguridad
 
Shodan Tips and tricks. Automatiza y maximiza las búsquedas shodan
Shodan Tips and tricks. Automatiza y maximiza las búsquedas shodanShodan Tips and tricks. Automatiza y maximiza las búsquedas shodan
Shodan Tips and tricks. Automatiza y maximiza las búsquedas shodan
 
ELK para analistas de seguridad y equipos Blue Team
ELK para analistas de seguridad y equipos Blue TeamELK para analistas de seguridad y equipos Blue Team
ELK para analistas de seguridad y equipos Blue Team
 
Monitoring and managing Containers using Open Source tools
Monitoring and managing Containers using Open Source toolsMonitoring and managing Containers using Open Source tools
Monitoring and managing Containers using Open Source tools
 
Python Memory Management 101(Europython)
Python Memory Management 101(Europython)Python Memory Management 101(Europython)
Python Memory Management 101(Europython)
 
SecDevOps containers
SecDevOps containersSecDevOps containers
SecDevOps containers
 
Python memory managment. Deeping in Garbage collector
Python memory managment. Deeping in Garbage collectorPython memory managment. Deeping in Garbage collector
Python memory managment. Deeping in Garbage collector
 

Kürzlich hochgeladen

How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxAndreas Kunz
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...Akihiro Suda
 
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxReal-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxRTS corp
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)jennyeacort
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfInnovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfYashikaSharma391629
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringHironori Washizaki
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Mater
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf31events.com
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecturerahul_net
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 

Kürzlich hochgeladen (20)

How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
 
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxReal-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfInnovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their Engineering
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecture
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 

Webscraping with asyncio

  • 5. Agenda ▶ Webscraping python tools ▶ Requests vs aiohttp ▶ Introduction to asyncio ▶ Async client/server ▶ Building a webcrawler with asyncio ▶ Alternatives to asyncio
  • 7. Python tools ➢ Requests ➢ Beautiful Soup 4 ➢ Pyquery ➢ Webscraping ➢ Scrapy
  • 8. Python tools ➢ Mechanize ➢ Robobrowser ➢ Selenium
  • 10. Web scraping with Python 1. Download webpage with HTTP module(requests,urllib,aiohttp) 2. Parse the page with BeautifulSoup/lxml 3. Select elements with Regular expressions,XPath or css selectors 4. Store results in a database,csv,json
  • 12. BeautifulSoup ▶ soup = BeautifulSoup(html_doc,’html.parser’) ▶ Print all: print(soup.prettify()) ▶ Print text: print(soup.get_text()) from bs4 import BeautifulSoup
  • 13. BeautifulSoup functions ▪ find_all(‘a’)→Returns all links ▪ find(‘title’)→Returns the first element <title> ▪ get(‘href’)→Returns the attribute href value ▪ (element).text → Returns the text inside an element for link in soup.find_all('a'): print(link.get('href'))
  • 22. Spiders /crawlers ▶ A Web crawler is an Internet bot that systematically browses the World Wide Web, typically for the purpose of Web indexing. A Web crawler may also be called a Web spider. https://en.wikipedia.org/wiki/Web_crawler
  • 25. Scrapy ▶ Uses a mechanism based on XPath expressions called Xpath Selectors. ▶ Uses Parser LXML to find elements ▶ Twisted for asynchronous operations
  • 26. Scrapy advantages ▶ Faster than mechanize because it uses twisted for asynchronous operations. ▶ Scrapy has better support for html parsing. ▶ Scrapy has better support for unicode characters, redirections, gzipped responses, encodings. ▶ You can export the extracted data directly to JSON,XML and CSV.
  • 27. Export data ▶ scrapy crawl <spider_name> ▶ $ scrapy crawl <spider_name> -o items.json -t json ▶ $ scrapy crawl <spider_name> -o items.csv -t csv ▶ $ scrapy crawl <spider_name> -o items.xml -t xml ▶
  • 29. The concurrency problem ▶ Different approaches: ▶ Multiple processes ▶ Threads ▶ Separate distributed machines ▶ Asynchronous programming(event loop)
  • 30. Requests problems ▶ Requests operations are blocking the main thread ▶ It pauses until operation completed ▶ We need one thread for each request if we want non-blocking operations
  • 31. Threads problems ▶ Get Overhead ▶ Stack size ▶ Context changes ▶ Synchronization
  • 32. Solution ▶NOT USE THREADS ▶USE ONE THREAD ▶+ EVENT LOOP
  • 33. New concepts ▶ Event loop ▶ Async ▶ Await ▶ Futures ▶ Coroutines ▶ Tasks ▶ Executors
  • 34. Event loop implementations ▶ Asyncio ▶ https://docs.python.org/3.4/library/asyncio.html ▶ Tornado web server ▶ http://www.tornadoweb.org/en/stable ▶ Twisted ▶ https://twistedmatrix.com ▶ Gevent ▶ http://www.gevent.org
  • 36. Asyncio ▶ Python >=3.3 ▶ Event-loop framework ▶ I/O Asynchronous ▶ Non-blocking approach with sockets ▶ All requests in one thread ▶ Event-driven switching ▶ aio-http module for make requests asynchronously
  • 38. Requests vs aiohttp #!/usr/local/bin/python3.5 import asyncio from aiohttp import ClientSession async def hello(): async with ClientSession() as session: async with session.get("http://httpbin.org/headers") as response: response = await response.read() print(response) loop = asyncio.get_event_loop() loop.run_until_complete(hello()) import requests def hello() return requests.get("http://httpbin.org/get") print(hello())
  • 39. Event Loop ▶ An event loop allow us to write asynchronous code using callbacks or coroutines. ▶ Event loop function like task switcher,just the way operating systems switch between active tasks on the CPU. ▶ The idea is that we have an event loop running until all tasks scheduled are completed. ▶ Features and tasks are created through the event loop.
  • 40. Event Loop ▶ An event loop is used to orchestrate the execution of the coroutines. ▶ asyncio.get_event_loop() ▶ asyncio.run_until_complete(coroutines,futures) ▶ asyncio.run_forever() ▶ asyncio.stop()
  • 42. Coroutines ▶ Coroutines are functions that allow for multitasking without requiring multiple threads or processes. ▶ Coroutines are like functions, but they can be suspended or resumed at certain points in the code. ▶ Coroutines allow write asynchronous code that combines the efficiency of callbacks with the classic good looks of multithreaded.
  • 43. Coroutines 3.4 vs 3.5 import asyncio @asyncio.coroutine def fetch(self, url): response = yield from self.session.get(url) body = yield from response.read() import asyncio async def fetch(self, url): response = await self.session.get(url) body = await response.read()
  • 44. Coroutines in event loop #!/usr/local/bin/python3.5 import asyncio import aiohttp async def get_page(url): response = await aiohttp.request('GET', url) body = await response.read() print(body) loop = asyncio.get_event_loop() loop.run_until_complete(asyncio.wait([get_page('http://python.org'), get_page('http://pycon.org')]))
  • 45. Requests in event loop async def getpage_with_requests(url): return await loop.run_in_executor(None,requests.get,url) #methods equivalents async def getpage_with_aiohttp(url): with aitohttp.ClientSession() as session: async with session.get(url) as response: return await response.read()
  • 46. Tasks ▶ The asyncio.Task class is a subclass of asyncio.Future to encapsulate and manage coroutines. ▶ Allow independently running tasks to run concurrently with other tasks on the same event loop. ▶ When a coroutine is wrapped in a task, it connects the task to the event loop.
  • 47. Tasks
  • 48. Tasks
  • 49. Tasks
  • 51. Futures ▶ To manage an object Future in Asyncio, we must declare the following: ▶ import asyncio ▶ future = asyncio.Future() ▶ https://docs.python.org/3/library/asyncio -task.html#future ▶ https://docs.python.org/3/library/concurr ent.futures.html
  • 52. Futures ▶ The asyncio.Future class is essentially a promise of a result. ▶ A Future will returns the results when they are available, and once it receives results, it will pass them along to all the registered callbacks. ▶ Each future is a task to be executed in the event loop
  • 54. Semaphores ▶ Adding synchronization ▶ Limiting number of concurrent requests. ▶ The argument indicates the number of simultaneous requests we want to allow. ▶ sem = asyncio.Semaphore(5) with (await sem): page = await get(url, compress=True)
  • 55. Async Client /server ▶ asyncio.start_server ▶ server = asyncio.start_server(handle_connection,host=HOST,port=PORT)
  • 56. Async Client /server ▶ asyncio.open_connection
  • 59. Async Web crawler ▶ Send asynchronous requests to all the links on a web page and add the responses to a queue to be processed as we go. ▶ Coroutines allow running independent tasks and processing their results in 3 ways: ▶ Using asyncio.as_completed →by processing the results as they come. ▶ Using asyncio.gather→ only once they have all finished loading. ▶ Using asyncio.ensure_future
  • 60. Async Web crawler import asyncio import random @asyncio.coroutine def get_url(url): wait_time = random.randint(1, 4) yield from asyncio.sleep(wait_time) print('Done: URL {} took {}s to get!'.format(url, wait_time)) return url, wait_time @asyncio.coroutine def process_results_as_come_in(): coroutines = [get_url(url) for url in ['URL1', 'URL2', 'URL3']] for coroutine in asyncio.as_completed(coroutines): url, wait_time = yield from coroutine print('Coroutine for {} is done'.format(url)) def main(): loop = asyncio.get_event_loop() print(“Process results as they come in:") loop.run_until_complete(process_results_as_come_in()) if __name__ == '__main__': main() asyncio.as_completed
  • 61. Async Web crawler execution
  • 62. Async Web crawler import asyncio import random @asyncio.coroutine def get_url(url): wait_time = random.randint(1, 4) yield from asyncio.sleep(wait_time) print('Done: URL {} took {}s to get!'.format(url, wait_time)) return url, wait_time @asyncio.coroutine def process_once_everything_ready(): coroutines = [get_url(url) for url in ['URL1', 'URL2', 'URL3']] results = yield from asyncio.gather(*coroutines) print(results) def main(): loop = asyncio.get_event_loop() print(“Process results once they are all ready:") loop.run_until_complete(process_once_everything_ready()) if __name__ == '__main__': main() asyncio.gather
  • 63. asyncio.gather From Python documentation, this is what asyncio.gather does: asyncio.gather(*coros_or_futures, loop=None, return_exceptions=False) Return a future aggregating results from the given coroutine objects or futures. All futures must share the same event loop. If all the tasks are done successfully, the returned future’s result is the list of results (in the order of the original sequence, not necessarily the order of results arrival). If return_exceptions is True, exceptions in the tasks are treated the same as successful results, and gathered in the result list; otherwise, the first raised exception will be immediately propagated to the returned future.
  • 64. Async Web crawler import asyncio import random @asyncio.coroutine def get_url(url): wait_time = random.randint(1, 4) yield from asyncio.sleep(wait_time) print('Done: URL {} took {}s to get!'.format(url, wait_time)) return url, wait_time @asyncio.coroutine def process_ensure_future(): tasks= [asyncio.ensure_future(get_url(url) )for url in ['URL1', 'URL2', 'URL3']] results = yield from asyncio.wait(tasks) print(results) def main(): loop = asyncio.get_event_loop() print(“Process ensure future:") loop.run_until_complete(process_ensure_future()) if __name__ == '__main__': main() asyncio.ensure_future
  • 65. Async Web crawler execution
  • 68. Async Web downloader ▶ With get_partial_content ▶ With download_coroutine
  • 71. Async Extracting links execution ▶ With bs4 ▶ With regex
  • 72. Alternatives to asyncio ▶ ThreadPoolExecutor ▶ https://docs.python.org/3.5/library/concurrent.futures.html#concurrent.fut ures.ThreadPoolExecutor ▶ ProcessPoolExecutor ▶ https://docs.python.org/3.5/library/concurrent.futures.html#concur rent.futures.ProcessPoolExecutor ▶ Parallel python ▶ http://www.parallelpython.com
  • 73. Parallel python ▶ SMP(symmetric multiprocessing) architecture with multiple cores in the same machine ▶ Distribute tasks in multiple machines ▶ Cluster
  • 75. References ▶ http://www.crummy.com/software/BeautifulSoup ▶ http://scrapy.org ▶ http://docs.webscraping.com ▶ https://github.com/KeepSafe/aiohttp ▶ http://aiohttp.readthedocs.io/en/stable/ ▶ https://docs.python.org/3.4/library/asyncio.html ▶ https://github.com/REMitchell/python-scraping
  • 76. Books
  • 77. Books