NGA Advanced Python Programming for GIS, GLGI 3001-1

Multiprocessing with vector data

PrintPrint

The best practices of multiprocessing that we introduced earlier are even more important when we are working with vector data than they are with raster data. The geodatabase locking issue is likely to become much more of a factor as typically we use more vector data than raster and often geodatabases are used more with feature classes.

The example we’re going to use here involves clipping a feature layer by polygons in another feature layer. A sample use case of this might be if you need to segment one or several infrastructure layers by state or county (or even a smaller subdivision). If I want to provide each state or county with a version of the roads, sewer, water or electricity layers (for example) this would be a helpful script. To test out the code in this section (and also the first homework assignment), you can again use the data from the USA.gdb geodatabase (Section 1.5) we provided. The application then is to clip the data from the roads, cities, or hydrology data sets to the individual state polygons from the States data set in the geodatabase. 

To achieve this task, one could run the Clip tool manually in ArcGIS Pro but if there are a lot of polygons in the clip data set, it will be more effective to write a script that performs the task. As each state/county is unrelated to the others, this is an example of an operation that can be run in parallel.

The code example below has been adapted from a code example written by Duncan Hornby at the University of Southampton in the United Kingdom that has been used to demonstrate multiprocessing and also how to create a script tool that supports multiprocessing. We will take advantage of Mr. Hornby’s efforts and make use of his code (with attribution of course) but we have also reorganized and simplified it quite a bit and added some enhancements.

Let us examine the code’s logic and then we’ll dig into the syntax.

The code has two Python files. This is important because when we want to be able to run it as a script tool in ArcGIS, it is required that the worker function for running the individual tasks be defined in its own module file, not in the main script file for the script tool with the multiprocessing code that calls the worker function. The first file called scripttool.py imports arcpy, multiprocessing, and the worker code contained in the second python file called multicode.py and it contains the definition of the main function mp_handler() responsible for managing the multiprocessing operations similar to the hi_ho_cherry_o multiprocessing version. It uses two script tool parameters, the file containing the polygons to use for clipping (variable clipper) and the file to be clipped (variable tobeclipped). The main function mp_handler() calls the worker(...) function located in the multicode file, passing it the files to be used and other information needed to perform the clipping operation. This will be further explained below . The code for the first file including the main function is shown below.

import arcpy
import multiprocessing as mp
from WorkerScript import clipper

# Input parameters
clipping_fc = arcpy.GetParameterAsText(0) if arcpy.GetParameterAsText(0) else r"C:\489\USA.gdb\States"
data_to_be_clipped = arcpy.GetParameterAsText(1) if arcpy.GetParameterAsText(1) else r"C:\489\USA.gdb\Roads"

def mp_handler():
    try:
        # Create a list of object IDs for clipper polygons
        arcpy.AddMessage("Creating Polygon OID list...")
        clipperDescObj = arcpy.Describe(clipping_fc)
        field = clipperDescObj.OIDFieldName

        # Create the idList by list comprehension and SearchCursor
        idList = [row[0] for row in arcpy.da.SearchCursor(clipping_fc, [field])]

        arcpy.AddMessage(f"There are {len(idList)} object IDs (polygons) to process.")

        # Create a task list with parameter tuples for each call of the worker function. Tuples consist of the clippper, tobeclipped, field, and oid values.
        # adds tuples of the parameters that need to be given to the worker function to the jobs list
        # using list comprehension
        jobs = [(clipping_fc, data_to_be_clipped, field, id) for id in idList]

        arcpy.AddMessage(f"Job list has {len(jobs)} elements.\n Sending to pool")

        cpuNum = mp.cpu_count()  # determine number of cores to use
        arcpy.AddMessage(f"There are: {cpuNum} cpu cores on this machine")

        # Create and run multiprocessing pool.
        with mp.Pool(processes=cpuNum) as pool:  # Create the pool object
            # run jobs in job list; res is a list with return dictionary values from the worker function
            res = pool.starmap(clipper, jobs)

        # After the threads are complete, iterate over the results and check for errors.
        for r in res:
            if r['errorMsg'] != None:
                arcpy.AddError(f'Task {r["name"]} Failed with: {r["errorMsg"]}')

        arcpy.AddMessage("Finished multiprocessing!")

    except Exception as ex:
        arcpy.AddError(ex)


if __name__ == '__main__':
    mp_handler()

Let's now have a close look at the logic of the two main functions which will do the work. The first one is the mp_handler() function shown in the code section above. It takes the input variables and has the job of processing the polygons in the clipping file to get a list of their unique IDs, building a job list of parameter tuples that will be given to the individual calls of the worker function, setting up the multiprocessing pool and running it, and taking care of error handling.

The second function is the worker function called by the pool (named worker in this example) located in the WorkerScript.py file (code shown below). This function takes the name of the clipping feature layer, the name of the layer to be clipped, the name of the field that contains the unique IDs of the polygons in the clipping feature layer, and the feature ID identifying the particular polygon to use for the clipping as parameters. This function will be called from the pool constructed in mp_handler().

The worker function will then make a selection from the clipping layer. This has to happen in the worker function because all parameters given to that function in a multiprocessing scenario need to be of a simple type that can be "pickled.Pickling data means converting it to a byte-stream which in the simplest terms means that the data is converted to a sequence of simple Python types (string, number etc.).  As feature classes are much more complicated than that containing spatial and non-spatial data, they cannot be readily converted to a simple type. That means feature classes cannot be "pickled" and any selections that might have been made in the calling function are not shared with the worker functions. Therefore, we need to think about creative ways of getting our data shared with our sub-processes. In this case, that means we’re not going to do the selection in the master module and pass the polygon to the worker module. Instead, we’re going to create a list of feature IDs that we want to process and we’ll pass an ID from that list as a parameter with each call of the worker function that can then do the selection with that ID on its own before performing the clipping operation. For this, the worker function selects the polygon matching the OID field parameter when creating a layer with MakeFeatureLayer_management() and uses this selection to clip the feature layer to be clipped. The results are saved in a shapefile including the OID in the file's name to ensure that the names are unique.

def clipper(clipper, tobeclipped, field, oid):
    """
       This is the function that gets called and does the work of clipping the input feature class to one of the
       polygons from the clipper feature class. Note that this function does not try to write to arcpy.AddMessage() as
       nothing is ever displayed.
       param: clipper
       param: tobeclipped
       param: field
       param: oid
    """
    # Create result dictionary that will be exclusive to the thread if ran in parallel.
    result_dict = {'name': None, 'errorMsg': None}
    try:
        # Set the name of the clipped to the result dictionary name
        result_dict['name'] = tobeclipped

        # Create a layer with only the polygon with ID oid. Each clipper layer needs a unique name, so we include oid in the layer name.
        query = f"{field} = {oid}"
        tmp_flayer = arcpy.MakeFeatureLayer_management(clipper, f"clipper_{oid}", query)

        # Do the clip. We include the oid in the name of the output feature class.
        outFC = fr"C:\NGA\Lesson 1 Data\output\clip_{oid}.shp"
        arcpy.Clip_analysis(tobeclipped, tmp_flayer, outFC)

        print(f"finished clipping: {oid}")
        return result_dict  # everything went well so we return the dictionary

    except Exception as ex:
        result_dict['errorMsg'] = ex
        # Some error occurred so return the exception thrown.
        print(f"error condition: {ex}")
        return result_dict

Having covered the logic of the code, let's review the specific syntax used to make it all work. While you’re reading this, try visualizing how this code might run sequentially first – that is one polygon being used to clip the to-be-clipped feature class, then another polygon being used to clip the to-be-clipped feature class and so on (maybe through 4 or 5 iterations). Then once you have an understanding of how the code is running sequentially try to visualize how it might run in parallel with the worker function being called 4 times simultaneously and each worker performing its task independently of the other workers.

We’ll start with exploring the syntax within the mp_handler(...) function.

The mp_handler(...) function begins by determining the name of the field that contains the unique IDs of the clipper feature class using the arcpy.Describe(...) function (line 13). The code then uses a Search Cursor to get a list of all of the object (feature) IDs from within the clipper polygon feature class (line 17). This gives us a list of IDs that we can pass to our worker function along with the other parameters. As a check, the length of that list is printed out (line 26).

Next, we create the job list with one entry for each call of the clipper() function we want to make (line 24). Each element in this list is a tuple of the parameters that should be given to that particular call of clipper(). This list will be required when we set up the pool by calling pool.starmap(...). To construct the list, we simply loop through the ID list and append a parameter tuple to the list in variable jobs. The first three parameters will always be the same for all tuples in the job list; only the polygon ID will be different. In the homework assignment for this lesson, you will adapt this code to work with multiple input files to be clipped. As a result, the parameter tuples will vary in both the values for the oid parameter and for the tobeclipped parameter.

To prepare the multiprocessing pool, we start it using the with statement. The code then sets up the size of the pool using the maximum number of processors in line 28 (as we have done in previous examples) and then, using the starmap() method of Pool, calls the worker function clipper(...) once for each parameter tuple in the jobs list (line 34). 

Any outputs from the worker function will be stored in a list of results dictionaries. These are values returned by the clipper() function. The results of each process is iterated over and checked for the Exceptions by checking if the r[‘errorMsg’] key holds a value other than None. 

Let's now look at the code in our worker function clipper(...). As we noted in the logic section above, it receives four parameters: the full paths of the clipping and to-be-clipped feature classes, the name of the field that contains the unique IDs in the clipper feature class, and the OID of the polygon it is to use for the clipping.

We create a results dictionary to help with returning valuable information back to the main processes. The 'name' key will be set to the tobeclipped parameter in line 15. The errorMsg is set to None to indicate that everything went ok as default (line 12 of the clipper function) and would be set to an Exception message to indicate that the operation failed (line 29). In the main function, the results are iterated over to print out any error messages that were encountered during the  clipping process.

Notice that the MakeFeatureLayer_management(...) function in line 19 is used to create an memory layer which is a copy of the original clipper layer. This use of the memory layer is important in three ways: The first is performance – memory layers are faster; second, the use of an memory layer can help prevent any chance of file locking (although not if we were writing back to the file); third, selection only works on layers so even if we wanted to, we couldn’t get away without creating this layer.

The call of MakeFeatureLayer_management(...) also includes an SQL query string defined one line earlier in line 11 to create the layer with just the polygon that matches the oid that was passed as a parameter. The name of the layer we are producing here should be unique; this is why we’re adding str(oid) to the name in the first parameter.

Now with our selection held in our memory, uniquely named feature layer, we perform the clip against our to-be-clipped layer (line 16) and store the result in outFC which we define in line 15 to be a hardcoded folder with a unique name starting with "clip_" followed by the oid. To run the code, you will most likely have to adapt the path used in variable outFC.

The process then returns from the worker function and will be supplied with another oid. This will repeat until a call has been made for each polygon in the clipping feature class.

We are going to use this code as the basis for our Lesson 1 homework project. Have a look at the Assignment Page for full details.

You can test this code out by running it in a number of ways. If you run it from ArcGIS Pro as a script tool, you will have to swap the hashmarks for the clipper and tobeclipped input variables so that GetParameterAsText() is called instead of using hardcoded paths and file names. Be sure to set your parameter type for both parameters to Feature Class. If you make changes to the code and have problems with the changes to the code not being reflected in Pro, delete your Script tool from the Toolbox, restart Pro and re-add the script tool. 

You can run your code from Pyscripter as a stand alone script. Make sure you're running the scripttool.py in PyScripter (not the multicode.py). You can also run your code from the Command Prompt which is the fastest way with the smallest resource overhead.

The final thing to remember about this code is that it has a hardcoded output path defined in variable outFC in the worker() function - which you will want to change, create and/or parameterize etc. so that you have some output to investigate. If you do none of these things then no output will be created.

When the code runs it will create a shapefile for every unique object identifier in the "clipper" shapefile (there are 51 in the States data set from the sample data) named using the OID (that is clip_1.shp - clip_59.shp).

Lesson content developed by Jan Wallgrun and James O’Brien