It’s that time of year again! Google Summer of Code 2016 applications are upon us and we welcome any and all students who are interested in open source and web scraping. For those just hearing about this program, Google Summer of Code provides stipends to students who are interested in writing code for open source projects. This is a full-time commitment and a great way to get involved in the open source community .
This is our third year participating in this prestigious program and we’re excited to announce projects around Scrapy , Portia ,Splash, and Frontera . We feature projects that range from “Easy” to “Advanced” and we’re happy to have students with different levels of technical skills.
Student applications are accepted from March 14 to 25 and students accepted to our projects will be announced on April 22, 2016. We’re very excited to mentor Python enthusiasts!
Here are our available open source project ideas. Browse around:
Scrapy is our Python-based web scraping framework.
- Asyncio Prototype: build a working prototype of an asyncio-based Scrapy.
- IPython IDE for Scrapy : make it possible to develop Scrapy spiders interactively through IPython Notebooks.
- Scrapy benchmarking suite : build a more comprehensive benchmarking suite to profile and address CPU bottlenecks and memory issues.
- New HTTP/1.1 download handler : replace current HTTP/1.1 downloader handler with a in-house solution easily customizable to crawling needs.
- Support for spiders in other languages : allow Scrapy to run Scrapy spiders defined in languages other than Python.
- Scrapy integration tests : add integration tests for different networking scenarios.
- New Scrapy signal dispatching : replace pyspider-based signal dispatcher backend by an alternative.
Portia is our visual web scraper. This tool allows you to get the data you need from websites without needing to write a single line of code.
- Portia spider generation : make Portia spiders less sensitive to layout changes on websites by detecting when the layout of a website changes and using crawled datasets and the new page structure to auto-repair the spiders.
- Web scraping helpers: provide an easy way to click a link, submit a form, and extract data from a webpage using Splash Scripts.
- Migrate to QtWebEngine : migrate the Splash rendering engine from QtWebKit to QtWebEngine.
Frontera is a web crawling framework consisting of crawl frontier and distribution/scaling primitives. It allows you to build a large scale online web crawler. Frontera takes care of the logic and policies to follow during a crawl. It stores and prioritises links extracted by the crawler to decide which pages to visit next and is capable of doing this in a distributed manner.
- Reliable Queue|Spider communication : provide a reliable communication between ZeroMQ queue and spiders and fix known issues with message queues being consumed when there are no spiders running.
- Frontera Web UI: create a web management UI for Frontera. This would allow people to see and control errors, download speed, and storage contents and to also do advanced crawler management.
- Frontera cluster provisioning service : build a service to monitor host resources and Frontera processes in a cluster, automatically restarting failed processes and providing an easy way to configure each component.
- Python 3 support : migrate the Framework code to Python 3 while maintaining compatibility with Python 2.
- Docker support: provide Docker containers for all Frontera components allowing an easier setup of the distributed mode.
Scrapinghub is a sub-organization under the umbrella of the Python Software Foundation so please take a moment to read through their guidelines and expectations.
This is a great opportunity to not only hone your coding skills on some popular open source projects, but we’ve actually hired two of our previous participants. You never know what might come of participating in Scrapinghub’s Google Summer of Code 2016 projects !