Again, we are going to close out the lesson with a few practice exercises that focus on the new Python concepts introduced in this lesson (regular expressions and higher order functions) as well as on working with tabular data with pandas as a preparation for this lesson's homework assignment. In the homework assignment, you are also going to use geopandas, the Esri ArcGIS for Python API, and GDAL/OGR again to get some more practice with these libraries, too. What was said in the introduction to the practice exercises of Lesson 2 holds here as well: don't worry if you have troubles finding the perfect solution on your own. Studying the solutions carefully is another way of learning and improving your skills. The solutions of the three practice exercises pages can again be found in the following subsections.
Write a function that tests whether an entered string is a valid date using the format "YYYY-MM-DD". The function takes the string to test as a parameter and then returns True or False. The YYYY can be any 4-digit number, but the MM needs to be a valid 2-digit number for a month (with a leading 0 for January to September). The DD needs to be a number between 01 and 31 but you don’t have to check whether this is a valid number for the given month. Your function should use a single regular expression to solve this task.
Here are a few examples you can test your implementation with:
"1977-01-01" -> True "1977-00-01" -> False (00 not a valid month) "1977-23-01" -> False (23 not a valid month) "1977-12-31" -> True "1977-11-01asdf" -> False (you need to make sure there are no additional characters after the date) "asdf1977-11-01" -> False (you need to make sure there are no additional characters before the date) "9872-12-31" -> True "0000-12-33" -> False (33 is not a valid day) "0000-12-00" -> False (00 not a valid day) "9872-15-31" -> False (15 is not a valid month)
We mentioned that the higher-order function reduce(...) can be used to do things like testing whether all elements in a list of Booleans are True. This exercise has three parts:
Below is an imaginary list of students and scores for three different assignments.
Name | Assignment 1 | Assignment 2 | Assignment 3 | |
---|---|---|---|---|
1 | Mike | 7 | 10 | 5.5 |
2 | Lisa | 6.5 | 9 | 8 |
3 | George | 4 | 3 | 7 |
4 | Maria | 7 | 9.5 | 4 |
5 | Frank | 5 | 5 | 5 |
Create a pandas data frame for this data (e.g. in a fresh Jupyter notebook). The column and row labels should be as in the table above.
Now, use pandas operations to add a new column to that data frame and assign it the average score over all assignments for each row.
Next, perform the following subsetting operations using pandas filtering with Boolean indexing:
1 2 3 4 5 6 | import re datePattern = re. compile ( '\d\d\d\d-(0[1-9]|1[0-2])-(0[1-9]|[12]\d|3[01])$' ) def isValidDate(s): return datePattern.match(s) ! = None |
Explanation: Since we are using match(…) to compare the compiled pattern in variable datePattern to the string in parameter s given to our function isValidDate(…), we don’t have to worry about additional characters before the start of the date because match(…) will always try to match the pattern to the start of the string. However, we use $ as the last character in our pattern to make sure there are no additional characters following the date. That means the pattern has the form
“…-…-…$”
where the dots have to be replaced with some regular expression notation for the year, month, and day parts. The year part is easy, since we allow for any 4-digit number here. So we can use \d\d\d\d here, or alternatively \d{4,4} (remember that \d stands for the predefined class of all digits).
For the month, we need to distinguish two cases: either it is a 0 followed by one of the digits 1-9 (but not another 0) or a 1 followed by one of the digits 0-2. We therefore write this part as a case distinction (…|…) with the left part 0[1-9] representing the first option and the second part 1[0-2] representing the second option.
For the day, we need to distinguish three cases: (1) a 0 followed by one of the digits 1-9, (2) a 1 or 2 followed by any digit, or (3) a 3 followed by a 0 or a 1. Therefore we use a case-distinction with three options (…|…|…) for this part. The first part 0[1-9] is for option (1), the second part [12]\d for option (2), and the third part 3[01] for the third option.
1 2 3 4 5 6 7 8 9 | import operator from functools import reduce l = [ True , False , True ] r = reduce (operator.and_, l, True ) print (r) # output will be False in this case |
To check whether or not at least one element is True, the call has to be changed to:
1 | r = reduce (operator.or_, l, False ) |
1 2 3 4 5 6 7 8 9 | import operator from functools import reduce l = [ - 4 , 2 , 1 , - 6 ] r = reduce (operator.and_, map ( lambda n: n > 0 , l), True ) print (r) # will print False in this case |
We use map(…) with a lambda expression for checking whether or not an individual element from the list is >0. Then we apply the reduce(…) version from part 1 to the resulting list of Boolean values we get from map(…) to check whether or not all elements are True.
1 2 3 4 5 6 7 8 9 10 11 12 | import operator l = [ True , False , True ] def myReduce(f, l, i): intermediateResult = i for element in l: intermediateResult = f(intermediateResult, element) return intermediateResult r = myReduce(operator.and_, l, True ) print (r) # output will be False in this case |
Maybe you were expecting that an implementation of reduce would be more complicated, but it’s actually quite simple. We set up a variable to always contain the intermediate result while working through the elements in the list and initialize it with the initial value provided in the third parameter i. When looping through the elements, we always apply the function given in parameter f to the intermediate result and the element itself and update the intermediate result variable with the result of this operation. At the end, we return the value of this variable as the result of the entire reduce operation.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | import pandas as pd # create the data frame from a list of tuples data = pd.DataFrame( [( 'Mike' , 7 , 10 , 5.5 ), ( 'Lisa' , 6.5 , 9 , 8 ), ( 'George' , 4 , 3 , 7 ), ( 'Maria' , 7 , 9.5 , 4 ), ( 'Frank' , 5 , 5 , 5 ) ] ) # set column names data.columns = [ 'Name' , 'Assignment 1' , 'Assignment 2' , 'Assignment 3' ] # set row names data.index = range ( 1 , len (data) + 1 ) # show table print (data) # add column with averages data[ 'Average' ] = (data[ 'Assignment 1' ] + data[ 'Assignment 2' ] + data[ 'Assignment 3' ]) / 3 # part a (all students with a1 score < 7) print (data[ data[ 'Assignment 1' ] < 7 ]) # part b (all students with a1 and a2 score > 6) print (data[ (data[ 'Assignment 1' ] > 6 ) & (data[ 'Assignment 2' ] > 6 )]) # part c (at least one assignment < 5) print ( data[ data[ [ 'Assignment 1' , 'Assignment 2' , 'Assignment 3' ] ]. min (axis = 1 ) < 5 ] ) # part d (name starts with M, only Name and Average columns) print (data [ data [ 'Name' ]. map ( lambda x: x.startswith( 'M' )) ] [ [ 'Name' , 'Average' ] ]) # sort by Name print (data.sort_values(by = [ 'Name' ])) |
If any of these steps is unclear to you, please ask for further explanation on the forums.