NGA Advanced Python Programming for GIS, GLGI 3001-1

Sites Robots.txt

PrintPrint

Believe it or not, the internet does try to have some rule and order and 'a code of conduct'. After all, without rules, there is chaos. The code of conduct is up to the websites to create and provide. This code of conduct sets rules on what bots, crawlers, and scrapers can access within their site. This conduct request can be a legal binding document so it is a good idea to make sure you are within allowance. The code of conduct is the robots.txt file.

You can view the sites robots.txt file to see what endpoints they allow for scraping by adding /robots.txt to the end of the domain in the browser. For example: 

http://www.timeanddate.com/robots.txt 

returns a long list of endpoints they are ok with and are not ok with as well as rules for certain crawlers and bots.

# http://web.nexor.co.uk/mak/doc/robots/norobots.html 
# 
# internal note, this file is in git now! 

User-agent: MSIECrawler 
Disallow: /
User-agent: PortalBSpider 
Disallow: / 
User-agent: Bytespider 
Disallow: / 
User-agent: Mediapartners-Google* 
Disallow: /
User-agent: Yahoo Pipes 1.0 
Disallow: / 
# disallow any urls with ? in 
User-Agent: AhrefsBot 
Disallow: /*? 
Disallow: /information/privacy.html 
Disallow: /information/terms-conditions.html 
User-Agent: dotbot 
Disallow: /*? 
Disallow: /worldclock/results.html 
Disallow: /scripts/tzq.php 
User-agent: ScoutJet 
Disallow: /worldclock/distanceresult.html 
Disallow: /worldclock/distanceresult.html* 
Disallow: /information/privacy.html 
Disallow: /information/terms-conditions.html 
Crawl-delay: 1 

User-agent: * 
Allow: /calendar/create.html?*year=2022 
Allow: /calendar/create.html?*year=2023 
Allow: /calendar/create.html?*year=2024 
Disallow: /adverts/ 
Disallow: /advgifs/ 
Disallow: /eclipse/in/*?starty= 
Disallow: /eclipse/in/*?iso 
Allow:    /eclipse/in/*?iso=202 
… 
Disallow: /worldclock/sunearth.html? 
Disallow: /worldclock/sunearth.html?iso 

Give that a try in a browser for your favorite website and read through it to see what they are requesting to not be scanned by bots/ scrappers and what they do allow.

In addition, some things can be done to keep the load on the server produced by web scraping as low as possible, e.g. by making sure the results are stored/cached when the program is running and not constantly being queried again unless the result may have changed. In this example, while the time changes constantly, one could still only run the query once, calculate the offset to the local computer’s current time once, and then always recalculate the current time for State College based on this information and the current local time.

The examples we have seen so far all used simple URLs, although this last example was already an example where parameters of the query are encoded in the URL (country and place name), and the response was always an html page intended to be displayed in a browser. In addition, there exist web APIs that realize a form of programming interface that can be used via URLs and HTTP requests. Such web APIs are, for instance, available by Twitter to search within recent tweets, by Google Maps, and by Esri. Often there is a business model behind these APIs that requires license fees and some form of authorization. 

Web APIs often allow for providing additional parameters for a particular request that have to be included in the URL. This works very similar to a function call, just the syntax is a bit different with the special symbol ? used to separate the base URL of a particular web API call from its parameters and the special symbol & used to separate different parameters. Here is an example of using a URL for querying the Google Books API for the query parameter “Zandbergen Python”: 

Example of using a URL for querying Google Books API

www.googleapis.com/books/v1/volumes is the base URL for using the web API to perform this kind of query and q=Zandbergen%20Python is the query parameter specifying what terms we want to search for. The %20 encodes a single space in a URL. If there would be more parameters, they would be separated by & symbols like this: 

<parameter 1>=<value 1>&<parameter 2>=<value 2>&…  

Portions of lesson content developed by Jan Wallgrun and James O’Brien