2015-08-17 17:00:26 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								Goutte, a simple PHP Web Scraper
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								================================
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								Goutte is a screen scraping and web crawling library for PHP.
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								Goutte provides a nice API to crawl websites and extract data from the HTML/XML
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								responses.
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								Requirements
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								------------
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2015-08-27 12:03:05 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								Goutte depends on PHP 5.5+ and Guzzle 6+.
 
							 
						 
					
						
							
								
									
										
										
										
											2015-08-17 17:00:26 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								..  tip :: 
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2015-10-08 11:40:12 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								    If you need support for PHP 5.4 or Guzzle 4-5, use Goutte 2.x (latest `phar
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    <https://github.com/FriendsOfPHP/Goutte/releases/download/v2.0.4/goutte-v2.0.4.phar>`_).
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    If you need support for PHP 5.3 or Guzzle 3, use Goutte 1.x (latest `phar
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    <https://github.com/FriendsOfPHP/Goutte/releases/download/v1.0.7/goutte-v1.0.7.phar>`_).
 
							 
						 
					
						
							
								
									
										
										
										
											2015-08-17 17:00:26 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								Installation
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								------------
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								Add `` fabpot/goutte ``  as a require dependency in your `` composer.json ``  file:
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								..  code-block ::  bash
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2015-08-27 12:03:05 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								    composer require fabpot/goutte
 
							 
						 
					
						
							
								
									
										
										
										
											2015-08-17 17:00:26 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								Usage
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								-----
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								Create a Goutte Client instance (which extends
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								`` Symfony\Component\BrowserKit\Client `` ):
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								..  code-block ::  php
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    use Goutte\Client;
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    $client = new Client();
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								Make requests with the `` request() ``  method:
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								..  code-block ::  php
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    // Go to the symfony.com website
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    $crawler = $client->request('GET', 'http://www.symfony.com/blog/');
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								The method returns a `` Crawler ``  object
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								(`` Symfony\Component\DomCrawler\Crawler `` ).
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								Fine-tune cURL options:
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								..  code-block ::  php
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    $client->getClient()->setDefaultOption('config/curl/'.CURLOPT_TIMEOUT, 60);
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								Click on links:
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								..  code-block ::  php
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    // Click on the "Security Advisories" link
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    $link = $crawler->selectLink('Security Advisories')->link();
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    $crawler = $client->click($link);
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								Extract data:
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								..  code-block ::  php
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    // Get the latest post in this category and display the titles
 
							 
						 
					
						
							
								
									
										
										
										
											2015-08-27 12:03:05 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								    $crawler->filter('h2 > a')->each(function ($node) {
 
							 
						 
					
						
							
								
									
										
										
										
											2015-08-17 17:00:26 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								        print $node->text()."\n";
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    });
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								Submit forms:
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								..  code-block ::  php
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    $crawler = $client->request('GET', 'http://github.com/');
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    $crawler = $client->click($crawler->selectLink('Sign in')->link());
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    $form = $crawler->selectButton('Sign in')->form();
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    $crawler = $client->submit($form, array('login' => 'fabpot', 'password' => 'xxxxxx'));
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    $crawler->filter('.flash-error')->each(function ($node) {
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								        print $node->text()."\n";
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    });
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								More Information
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								----------------
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2015-08-27 12:03:05 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								Read the documentation of the BrowserKit and `DomCrawler
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								<http://symfony.com/doc/any/components/dom_crawler.html> `_ Symfony Components
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								for more information about what you can do with Goutte.
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								Pronunciation
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								-------------
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								Goutte is pronounced `` goot ``  i.e. it rhymes with `` boot ``  and not `` out `` .
 
							 
						 
					
						
							
								
									
										
										
										
											2015-08-17 17:00:26 -07:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								Technical Information
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								---------------------
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								Goutte is a thin wrapper around the following fine PHP libraries:
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								*  Symfony Components: BrowserKit, CssSelector and DomCrawler;
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								*   `Guzzle`_  HTTP Component.
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								License
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								-------
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								Goutte is licensed under the MIT license.
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2015-10-08 11:40:12 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								..  _`Composer`:  http://getcomposer.org
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								..  _`Guzzle`:    http://docs.guzzlephp.org