2015-08-17 17:00:26 -07:00
Goutte, a simple PHP Web Scraper
================================
Goutte is a screen scraping and web crawling library for PHP.
Goutte provides a nice API to crawl websites and extract data from the HTML/XML
responses.
Requirements
------------
2015-08-27 12:03:05 -07:00
Goutte depends on PHP 5.5+ and Guzzle 6+.
2015-08-17 17:00:26 -07:00
.. tip ::
2015-08-27 12:03:05 -07:00
If you need support for PHP 5.4 or Guzzle 4-5, use Goutte 2.x.
If you need support for PHP 5.3 or Guzzle 3, use Goutte 1.x.
2015-08-17 17:00:26 -07:00
Installation
------------
Add `` fabpot/goutte `` as a require dependency in your `` composer.json `` file:
.. code-block :: bash
2015-08-27 12:03:05 -07:00
composer require fabpot/goutte
2015-08-17 17:00:26 -07:00
.. tip ::
You can also download the `Goutte.phar`_ file:
.. code-block :: php
require_once '/path/to/goutte.phar';
2015-08-27 12:03:05 -07:00
The phars for Goutte 1.x are also available for `download
<http://get.sensiolabs.org/goutte-v1.0.7.phar>`.
2015-08-17 17:00:26 -07:00
Usage
-----
Create a Goutte Client instance (which extends
`` Symfony\Component\BrowserKit\Client `` ):
.. code-block :: php
use Goutte\Client;
$client = new Client();
Make requests with the `` request() `` method:
.. code-block :: php
// Go to the symfony.com website
$crawler = $client->request('GET', 'http://www.symfony.com/blog/');
The method returns a `` Crawler `` object
(`` Symfony\Component\DomCrawler\Crawler `` ).
Fine-tune cURL options:
.. code-block :: php
$client->getClient()->setDefaultOption('config/curl/'.CURLOPT_TIMEOUT, 60);
Click on links:
.. code-block :: php
// Click on the "Security Advisories" link
$link = $crawler->selectLink('Security Advisories')->link();
$crawler = $client->click($link);
Extract data:
.. code-block :: php
// Get the latest post in this category and display the titles
2015-08-27 12:03:05 -07:00
$crawler->filter('h2 > a')->each(function ($node) {
2015-08-17 17:00:26 -07:00
print $node->text()."\n";
});
Submit forms:
.. code-block :: php
$crawler = $client->request('GET', 'http://github.com/');
$crawler = $client->click($crawler->selectLink('Sign in')->link());
$form = $crawler->selectButton('Sign in')->form();
$crawler = $client->submit($form, array('login' => 'fabpot', 'password' => 'xxxxxx'));
$crawler->filter('.flash-error')->each(function ($node) {
print $node->text()."\n";
});
More Information
----------------
2015-08-27 12:03:05 -07:00
Read the documentation of the BrowserKit and `DomCrawler
<http://symfony.com/doc/any/components/dom_crawler.html> `_ Symfony Components
for more information about what you can do with Goutte.
Pronunciation
-------------
Goutte is pronounced `` goot `` i.e. it rhymes with `` boot `` and not `` out `` .
2015-08-17 17:00:26 -07:00
Technical Information
---------------------
Goutte is a thin wrapper around the following fine PHP libraries:
* Symfony Components: BrowserKit, CssSelector and DomCrawler;
* `Guzzle`_ HTTP Component.
License
-------
Goutte is licensed under the MIT license.
.. _`Composer`: http://getcomposer.org
.. _`Goutte.phar`: http://get.sensiolabs.org/goutte.phar
.. _`Guzzle`: http://docs.guzzlephp.org