Open Source Data

One of the best ways to increase your effectiveness as a GIS programmer is to learn how to manipulate text-based information. GIS data is often collected and shared in more "raw" formats such as a spreadsheet in CSV (comma-separated value), a list of coordinates in a text file, or a response received through a web service such as XML, JSON or GEOJSON.

When faced with these files, you should first understand if your GIS software already comes with a tool or script that can read or convert the data to a format it can use. If no tool or script exists, you'll need to do some programmatic work to read the file and separate out the pieces of text that you really need.

For example, a Web service may return you many lines of XML describing all the readings at a weather station, when all you're really interested in are the coordinates of the weather station and the annual average temperature. Parsing the response involves writing some code to read through the lines and tags in the XML and isolating only those three values. Similarly, many APIs provide data in a JSON (Javascript Serialized Object Notation) format and parsing the response includes accessing the desired keys. JSON and GeoJSON both have documentation: JSON documentation and GeoJSON documentation. There are some differences between the two and require the use of different packages.

There are several different approaches to parsing. Usually, the wisest is to see if some Python module exists that will examine the text for you and turn it into an object that you can then work with. In this lesson, you will work with the Python "csv" module that can read comma-delimited values and turn them into a Python list. Other helpful libraries such as this include lxml and xml.dom for parsing XML, BeautifulSoup for parsing HTML, and the built in module json for working with JSON.

If a module or library doesn't exist that fits your parsing needs, then you'll have to extract the information from the data yourself using a combinations of Python's string manipulation methods and packages. It is common to translate the structure from one package to another or one type to another to be able to extract what you need. One of the most helpful string manipulation methods is string.split(), which turns a big string into a list of smaller strings based on some delimiting character, such as a space or comma. When you write your own parser, however, it's hard to anticipate all the exceptional cases you might run across. For example, sometimes a comma-separated value file might have substrings that naturally contain commas, such as dates or addresses. In these cases, splitting the string using a simple comma as the delimiter is not sufficient and you need to add extra logic or use Regex.

text = "green,red,blue"
text.split(",")

['green', 'red', 'blue']

Another pitfall when parsing is the use of "magic numbers" to slice off a particular number of characters in a string, to refer to a specific column number in a spreadsheet, and so on. If the structure of the data changes, or if the script is applied to data with a slightly different structure, the code could be rendered inoperable and would require some precision surgery to fix. People who read your code and see a number other than 0 (to begin a series) or 1 (to increase a counter) will often be left wondering how the number was derived and what it refers to. In programming, numbers other than 0 or 1 are magic numbers that should typically be avoided, or at least accompanied by a comment explaining what the number refers to.

There are an infinite number of parsing scenarios that you can encounter. This lesson will attempt to teach you the general approach by walking through a couple examples.

Lesson content developed by Jan Wallgrun and James O’Brien

Navigation

EMS

Programs

Related Links