Python has firmly established itself as one of the main programming languages used in Data Science [1]. There are many, freely available Python packages for working with all kinds of data and performing different kinds of analysis, from general statistics to very domain-specific procedures. The same holds true for spatial data that we are dealing with in typical GIS projects. There are various packages for importing and exporting data coming in different GIS formats into a Python project and manipulating, analyzing and visualizing the data with Python code--and you will get to know quite a few of these packages in this lesson. We provide a short overview on the packages we consider most important below.
In Data Science, one common principle is that projects should be cleanly and exhaustively documented, including all data used, how the data has been processed and analyzed, and the results of the analyses. The underlying point of view is that science should be easily reproducible to assure a high quality and to benefit future research as well as application in practice. One idea to achieve full transparency and reproducibility is to combine describing text, code, and analysis results into a single report that can be published, shared, and used by anyone to rerun the steps of the analysis.
In the Python world, such executable reports are very commonly created in the form of Jupyter Notebooks. Jupyter Notebook [2] is an open-source web-based software tool that allows you to create documents that combine runnable Python code (and code from other languages as well), its output, as well as formatted text, images, etc.,. in a normal text document. Figure 3.1 shows you a brief part of a Jupyter Notebook, the one we are going to create in this lesson’s walkthrough.
While Jupyter Notebook has been developed within the Python ecosystem, it can be used with other programming languages. For instance, the R language that you may have experience in or heard about as one of the main languages used for statistical computing can be used in a Jupyter Notebook. One of the things you will see in this lesson is how one can actually combine Python and R code within a Jupyter notebook to realize a somewhat complex spatial data science project in the area of species distribution modeling, also termed ecological niche modeling [3].
It may be interesting for you to know that Esri is also supporting Jupyter Notebook [4] as a platform for conducting GIS projects with the help of their ArcGIS API for Python [5] library and Jupyter Notebook has been integrated into several Esri products including ArcGIS Pro [6].
After a quick look at the Python packages most commonly used in the context of data science projects, we will provide a more detailed overview on what is coming in the remainder of the lesson, so that you will be able to follow along easily without getting confused by all the different software packages we are going to use.
It would be impossible to introduce or even just list all the packages available for conducting spatial data analysis projects in Python here, so the following is just a small selection of those that we consider most important.
numpy (Python numpy page [7], Wikipedia numpy page [8]) stands for “Numerical Python” and is a library that adds support for efficiently dealing with large and multi-dimensional arrays and matrices to Python together with a large number of mathematical operations to apply to these arrays, including many matrix and linear algebra operations. Many other Python packages are built on top of the functionality provided by numpy.
matplotlib (Python matplotlib page [9], Wikipedia matplot page [10]) is an example of a Python library that builds on numpy. Its main focus is on producing plots and embedding them into Python applications. Take a quick look at its Wikipedia page to see a few examples of plots that can be generated with matplotlib. We will be using matplotlib a few times in this lesson’s walkthrough to quickly create simple map plots of spatial data.
SciPy (Python SciPy page [11], Wikipedia SciPy page [12]) is a large Python library for application in mathematics, science, and engineering. It is built on top of both numpy and matplotlib, providing methods for optimization, integration, interpolation, signal processing and image processing. Together numpy, matplotlib, and SciPy roughly provide a similar functionality as the well known software Matlab. While we won’t be using SciPy in this lesson, it is definitely worth checking out if you're interested in advanced mathematical methods.
pandas (Python pandas page [13], Wikipedia pandas software page [14]) provides functionality to efficiently work with tabular data, so-called data frames, in a similar way as this is possible in R. Reading and writing tabular data, e.g. to and from .csv files, manipulating and subsetting data frames, merging and joining multiple data frames, and time series support are key functionalities provided by the library. A more detailed overview on pandas will be given in Section 3.8.
Shapely (Python Shapely page [15], Shapely User Manual [16]) adds the functionality to work with planar geometric features in Python, including the creation and manipulation of geometries such as points, polylines, and polygons, as well as set-theoretic analysis capabilities (intersection, union, …). It is based on the widely used GEOS [17] library, the geometry engine that is used in PostGIS [18], which in turn is based on the Java Topology Suite [19] (JTS) and largely follows the OGC’s Simple Features Access Specification [20].
geopandas (Python geopandas page [21], GeoPandas page [22]) combines pandas and Shapely to facilitate working with geospatial vector data sets in Python. While we will mainly use it to create a shapefile from Python, the provided functionality goes significantly beyond that and includes geoprocessing operations, spatial join, projections, and map visualizations.
GDAL/OGR (Python GDAL page [23], GDAL/OGR Python [24]) is a powerful library for working with GIS data in many different formats widely used from different programming languages. Originally, it consisted of two separated libraries, GDAL (‘Geospatial Data Abstraction Library‘) for working with raster data and OGR (used to stand for ‘OpenGIS Simple Features Reference Implementation’) for working with vector data, but these have now been merged. The gdal Python package provides an interface to the GDAL/OGR library written in C++. In Section 3.9 and the lesson’s walkthrough, you will see some examples of applying GDAL/OGR.
As we already mentioned at the beginning, Esri provides its own Python API (ArcGIS for Python page [5]) for working with maps and GIS data via their ArcGIS Online and Portal for ArcGIS web platforms. The API allows for conducting administrative tasks, performing vector and raster analyses, running geocoding tasks, creating map visualizations, and more. While some services can be used autonomously, many are tightly coupled to Esri’s web platforms and you will at least need a free ArcGIS Online account. The Esri API for Python will be further discussed in Section 3.10.
In this lesson, we will start to work with some software that you probably are not familiar with. We will be using Python packages extensively that we have not used before to demonstrate how a complex GIS project can be solved in Python by combining different languages and packages within a Jupyter Notebook. Therefore, it is probably a good idea to prepare you a bit with an overview of what will happen in the remainder of the lesson.
Links
[1] https://en.wikipedia.org/wiki/Data_science
[2] http://jupyter.org/
[3] https://en.wikipedia.org/wiki/Environmental_niche_modelling
[4] https://developers.arcgis.com/python/guide/using-the-jupyter-notebook-environment/
[5] https://developers.arcgis.com/python/
[6] https://pro.arcgis.com/en/pro-app/latest/arcpy/get-started/pro-notebooks.htm
[7] https://pypi.python.org/pypi/numpy
[8] https://en.wikipedia.org/wiki/NumPy
[9] https://pypi.python.org/pypi/matplotlib
[10] https://en.wikipedia.org/wiki/Matplotlib
[11] https://pypi.python.org/pypi/scipy
[12] https://en.wikipedia.org/wiki/SciPy
[13] https://pypi.python.org/pypi/pandas/
[14] https://en.wikipedia.org/wiki/Pandas_(software)
[15] https://pypi.python.org/pypi/Shapely
[16] https://shapely.readthedocs.io/en/latest/
[17] https://trac.osgeo.org/geos
[18] https://postgis.net/
[19] https://live.osgeo.org/en/overview/jts_overview.html
[20] https://en.wikipedia.org/wiki/Simple_Features
[21] https://pypi.python.org/pypi/geopandas/
[22] http://geopandas.org/
[23] https://pypi.python.org/pypi/GDAL/
[24] https://gdal.org/api/index.html#python-api