NGA Advanced Python Programming for GIS, GLGI 3001-1

Converting from sequential to multiprocessing

PrintPrint

So with that in mind, let us examine how we can convert a simple program like a programmatic version of the game Hi Ho Cherry-O from sequential to multiprocessing.

You can download the Hi Ho Cherry-O script.

There are a couple of basic steps we need to add to our code in order to support multiprocessing. The first is that our code needs to import multiprocessing which is a Python library which as you will have guessed from the name enables multiprocessing support. We’ll add that as the first line of our code.

The second thing our code needs to have is a __main__ method defined. We’ll add that into our code at the very bottom with:

if __name__ == '__main__': 
    play_a_game()

With this, we make sure that the code in the body of the if-statement is only executed for the main process we start by running our script file in Python, not the subprocesses we will create when using multiprocessing, which also are loading this file. Otherwise, this would result in an infinite creation of subprocesses, subsubprocesses, and so on. Next, we need to have that play_a_game() function we are calling defined. This is the function that will set up our pool of processors and also assign (map) each of our tasks onto a worker (usually a processor) in that pool.

Our play_a_game() function is very simple. It has two main lines of code based on the multiprocessing module:

The first instantiates a pool with a number of workers (usually our number of processors or a number slightly less than our number of processors). There’s a function to determine how many processors we have, multiprocessing.cpu_count(), so that our code can take full advantage of whichever machine it is running on. That first line is:

with multiprocessing.Pool(multiprocessing.cpu_count()) as myPool:
   ... # code for setting up the pool of jobs

You have probably already seen this notation from working with arcpy cursors. This with ... as ... statement creates an object of the Pool class defined in the multiprocessing module and assigns it to variable myPool. The parameter given to it is the number of processors on my machine (which is the value that multiprocessing.cpu_count() is returning), so here we are making sure that all processor cores will be used. All code that uses the variable myPool (e.g., for setting up the pool of multiprocessing jobs) now needs to be indented relative to the "with" and the construct makes sure that everything is cleaned up afterwards. The same could be achieved with the following lines of code:

myPool = multiprocessing.Pool(multiprocessing.cpu_count())
... # code for setting up the pool of jobs
myPool.close()
myPool.join()

Here the Pool variable is created without the with ... as ... statement. As a result, the statements in the last two lines are needed for telling Python that we are done adding jobs to the pool and for cleaning up all sub-processes when we are done to free up resources. We prefer to use the version using the with ... as ... construct in this course.

The next line that we need in our code after the with ... as ... line is for adding tasks (also called jobs) to that pool:

    res = myPool.map(hi_ho_cherry_o, range(10000))

What we have here is the name of another function, hi_ho_cherry_o(), which is going to be doing the work of running a single game and returning the number of turns as the result. The second parameter given to map() contains the parameters that should be given to the calls of thecherryO() function as a simple list. So this is how we are passing data to process to the worker function in a multiprocessing application. In this case, the worker function hi_ho_cherry_o() does not really need any input data to work with. What we are providing is simply the number of the game this call of the function is for, so we use the range from 0-9,999 for this. That means we will have to introduce a parameter into the definiton of the hi_ho_cherry_o() function for playing a single game. While the function will not make any use of this parameter, the number of elements in the list (10000 in this case) will determine how many times hi_ho_cherry_o() will be run in our multiprocessing pool and, hence, how many games will be played to determine the average number of turns. In the final version, we will replace the hard-coded number by an argument called numGames. Later in this part of the lesson, we will show you how you can use a different function called starmap(...) instead of map(...) that works for worker functions that do take more than one argument so that we can pass different parameters to it.

Python will now run the pool of calls of the hi_ho_cherry_o() worker function by distributing them over the number of cores that we provided when creating the Pool object. The returned results, so the number of turns for each game played, will be collected in a single list and we store this list in variable res. We’ll average those turns per game to get an average using the Python library statistics and the function mean().

To prepare for the multiprocessing version, we’ll take our Cherry-O code from before and make a couple of small changes. We’ll define function hi_ho_cherry_o() around this code (taking the game number as parameter as explained above) and we’ll remove the while loop that currently executes the code 10,000 times (our map range above will take care of that) and we’ll therefore need to “dedent“ the code.

Here’s what our revised function will look like :

def hi_ho_cherry_o(game): 
    spinnerChoices = [-1, -2, -3, -4, 2, 2, 10] 
    turns = 0 
    cherriesOnTree = 10 
 
    # Take a turn as long as you have more than 0 cherries  
    while cherriesOnTree > 0: 
        # Spin the spinner  
        spinIndex = random.randrange(0, 7) 
        spinResult = spinnerChoices[spinIndex] 
        # Print the spin result      
        # print ("You spun " + str(spinResult) + ".")  
        # Add or remove cherries based on the result  
        cherriesOnTree += spinResult 
        # Make sure the number of cherries is between 0 and 10     
        if cherriesOnTree > 10: 
            cherriesOnTree = 10 
        elif cherriesOnTree < 0: 
            cherriesOnTree = 0 
            # Print the number of cherries on the tree         
        # print ("You have " + str(cherriesOnTree) + " cherries on your tree.")  
        turns += 1 
    # return the number of turns it took to win the game  
    return turns   

Now lets put it all together. We’ve made a couple of other changes to our code to define a variable at the very top called numGames = 10000 to define the size of our range.

import random
import multiprocessing
import statistics
import time

def hi_ho_cherry_o(game):
    spinnerChoices = [-1, -2, -3, -4, 2, 2, 10]
    turns = 0
    cherriesOnTree = 10

    # Take a turn as long as you have more than 0 cherries
    while cherriesOnTree > 0:
        # Spin the spinner
        spinIndex = random.randrange(0, 7)
        spinResult = spinnerChoices[spinIndex]
        # Print the spin result
        # print ("You spun " + str(spinResult) + ".")
        # Add or remove cherries based on the result
        cherriesOnTree += spinResult
        # Make sure the number of cherries is between 0 and 10
        if cherriesOnTree > 10:
            cherriesOnTree = 10
        elif cherriesOnTree < 0:
            cherriesOnTree = 0
            # Print the number of cherries on the tree
        # print ("You have " + str(cherriesOnTree) + " cherries on your tree.")
        turns += 1
        # return the number of turns it took to win the game
    return turns


def play_a_game(numGames):
    with multiprocessing.Pool(multiprocessing.cpu_count()) as myPool:
        ## The Map function part of the MapReduce is on the right of the = and the Reduce part on the left where we are aggregating the results to a list.
        turns = myPool.map(hi_ho_cherry_o, range(numGames))
        # Uncomment this line to print out the list of total turns (but note this will slow down your code's execution)
        # print(turns)
    # Use the statistics library function mean() to calculate the mean of turns
    print(f'Average turns for {len(turns)} games is {statistics.mean(turns)}')


if __name__ == '__main__':
    start_time = time.time()
    play_a_game(10000)
    # Output how long the process took.
    print(f" Process took {time.time() - start_time} seconds")

You will also see that we have the list of results returned on the left side of the = before our map function (~line 35). We’re taking all of the returned results and putting them into a list called turns (feel free to add a print or type statement here to check that it's a list). Once all of the workers have finished playing the games, we will use the Python library statistics function mean, which we imported at the very top of our code (right after multiprocessing) to calculate the mean of our list in variable turns. The call to mean() will act as our reduce as it takes our list and returns the single value that we're really interested in.

When you have finished writing the code in PyScripter, you can run it.

Lesson content developed by Jan Wallgrun and James O’Brien