In this homework assignment, we want you to practice working with pandas and the other Python packages introduced in this lesson some more by creating and submitting a nice-looking Jupyter Notebook that includes well-formatted explanations of each step as Markdown. The assignment could be solved using pandas, geopandas, and the Esri Python API alone, but we are asking you to use GDAL/OGR for one of the steps involved so that you get some further practice with that library as well. To solve the task, you will occasionally have to use the packages in ways that we did not show in the lesson materials. In addition, you will have to work with the Python datetime module for representing dates & times in Python code. That means you will also have to practice working with the respective documentations and complementing web resources a bit. However, we did include some pointers in the instructions below, so that you have an idea of where to look, and also provided some examples.
The situation is the following: You have been hired by a company active in the northeast of the United States to analyze and produce different forms of summaries for the traveling activities of their traveling salespersons. Unfortunately, the way the company has been keeping track of the related information leaves a lot to be desired. The information is spread out over numerous .csv files. Please download the .zip file containing all (imaginary) data [1] you need for this assignment and extract it to a new folder. Then open the files in a text editor and read the explanations below.
File employees.csv: Most data files of the company do not use the names of their salespersons directly but instead refer to them through an employee ID. This file maps employee name to employee ID number. It has two columns, the first contains the full names in the format first name, last name and the second contains the ID number. The double quotes around the names are needed in the csv file to signal that this is the content of a single cell containing a comma rather than two cells separated by a comma.
"Smith, Richard",1234421 "Moore, Lisa",1231233 "Jones, Frank",2132222 "Brown, Justine",2132225 "Samulson, Roger",3981232 "Madison, Margaret",1876541
Files travel_???.csv: each of these files describes a single trip by a salesperson. The number in the file name is not the employee ID but a trip number. There are 75 such files with numbers from 1001 to 1075. Each file contains just a single row; here is the content of one of the files, the one named travel_1001.csv:
2132222,2016-01-07 16:00:00,2016-01-26 12:00:00,Cleveland;Bangor;Erie;Philadelphia;New York;Albany;Cleveland;Syracuse
The four columns (separated by comma) have the following content:
File ne_cities.shp: You already know this shapefile from the lesson content. It contains larger cities in the northeast U.S.A. as WGS84 points. The only attribute relevant for this exercise in addition to the point geometry is the NAME field containing the city names.
There are a few more files in the folder. They are actually empty, but you are not allowed to delete these from the folder. This is to make sure that you have to be as specific as possible when using regular expressions for file names in your solution.
You should develop your solution as a Jupyter Notebook with nicely formatted explanations of each step in your solution, similar to the L3 walkthrough notebook from Section 3.11 [2]. Your notebook should contain a map widget from the Esri Python API that displays the polyline feature service as a layer (similar to the lesson walkthrough notebook) at the end. You will submit this notebook file together with your write-up to the L3 assignment drop box on Canvas. The two images below have been produced using the example input values below. You can use this example for testing your solution.
The Python code you are supposed to write should take three things as input:
It should then produce two output files:
The assignment will require you to work with objects of the classes datetime and timedelta defined in the module datetime of the Python standard library to represent time stamps (combinations of date & time) and differences between them. The official documentation for the module is available at this Python documentation page [3]. In addition, you can find two links to introductions to datetime below that may be a bit easier to digest. Please check these out and make sure you understand how to work with the datetime class, how to compare datetime objects to see whether one is earlier than the other and how to calculate a timedelta object for the difference between two datetime objects. Time zones won’t matter in this assignment.
Below are a few examples illustrating how to create datetime objects representing concrete dates, how to calculate the time difference (datetime object timedelta) between two datetime objects, and how to compare two datetime objects using the < or > operators. These examples should be easy to understand, in particular when you have read through the documentation linked above. If you have any remaining questions on using datetime, please ask them on the course forums.
1 2 3 4 5 6 7 8 | import datetime # create datetime objects for specific dates date1 = datetime.datetime( 2019 , 1 , 31 , 17 , 55 ) # create datetime object for January 31, 2019, 17:55pm date2 = datetime.datetime( 2019 , 3 , 12 , 0 , 0 ) # create datetime object for March 12, 2019, 0am print (date1) print ( type (date1)) print (date2) |
2019-01-31 17:55:00 <class 'datetime.datetime'> 2019-03-12 00:00:00
1 2 3 4 5 6 | # calculate the time difference between two datetime objects delta = date2 - date1 print (delta) print ( type (delta)) print (delta.days) # difference in days print (delta.seconds) # difference in seconds |
39 days, 6:05:00 <class 'datetime.timedelta'> 39 21900
1 2 3 4 5 | # comparing datetime objects if (date2 < date1): print ( 'before' ) else : print ( 'after' ) |
after
Your notebook should roughly follow the steps below; in particular you should use the APIs mentioned for performing each of the steps:
Pandas provides functions for reading and writing csv files (and quite a few other file formats). They are called read_csv(...) and to_csv(...). See this Pandas documentation site [6] for more information. When your input file contains dates that you want to become datetime objects, you should use the parse_dates and date_format keyword arguments of read_csv(…) to let the method know which columns contain dates and how to interpret them. Here is an example of how this kind of command should look. The None for the header argument signals that the table in the csv file does not contain column names as the first row. The [...] for the parse_dates argument needs to be replaced by a list of column indices for the columns that contain dates in the csv file. The ??? needs to be replaced by the desired date format token that will produce the date in the YYYY-MM-DD HH:MM:SS format.
1 2 3 | import pandas as pd import datetime df = pd.read_csv(r 'C:\489\test.csv' , sep = "," , header = None , parse_dates = [...], date_format = '???' ) |
The pandas concat(…) [7] function can be used to combine several data frames with the same columns stored in a list to form a single data frame. This can be a good approach for this step. Let's say you have the individual data frames stored in a list variable called dataframes. You'd then simply call concat like this:
1 | combinedDataFrame = pd.concat(dataframes) |
This means your main task will be to create the list of pandas data frames, one for each travel_???.csv file before calling concat(...). For this, you will first need to use a regular expression to filter the list of all files in the input folder you get from calling os.listdir(inputFolder) to only the travel_???.csv files and then use read_csv(...) as described under Hint 1 to create a pandas DataFrame object from the csv file and add this to the data frame list.
You can compare a datetime object (e.g. the start or end date) to a datetime column in a pandas data frame resulting in a Boolean vector that tells you whether the comparison is true or not for each row. Furthermore, you can use the pandas method isin(…) [8] to check whether the string in the cells of a data frame or single column are contained in a given list of strings. The result is again a Boolean data frame/column. Together this allows you to select the desired rows via Boolean indexing as shown in Section 3.8.6. Here is a simple example showing how isin(...) is used to create a Boolean vector based on whether the name of each row is from a given list of names:
1 2 3 4 5 6 7 8 9 | import pandas as pd names = [ 'Frank' , 'James' , 'Jane' , 'Stevie' ] # names we are interested in df = pd.DataFrame([[ 'Martin' , 5 ], # simple data frame with two columns [ 'James' , 3 ], [ 'Sue' , 1 ], [ 'Mark' , 11 ], [ 'Stevie' , 3 ]] , columns = [ 'Name' , 'Grade' ]) booleanVector = df.Name.isin(names) print (booleanVector) |
Output:
0 False 1 True 2 False 3 False 4 True Name: Name, dtype: bool
The GDAL cookbook contains several examples of creating a polyline geometry from a WKT LineString that should be helpful to implement this step. In principle, the entire translation of the semi-colon-separated city list into a WKT LineString can be done with the following expression using two nested list comprehensions, but it is also ok if you break this down into several steps.
1 | wkt = [ 'LineString (' + ',' .join([ f '{cities[cities.NAME == city].geometry.x.iloc[0]} {cities[cities.NAME == city].geometry.y.iloc[0]}' for city in r.split( ';' ) ]) + ')' for r in fullInfoSorted.Route] |
This code assumes that the geopandas data frame with the city data is stored in variable cities and that the combined trip data from step (5) is stored in variable fullInfoSorted such that fullInfoSorted.Route refers to the column with the route information consisting of city names separated by semicolons. In the outer list comprehension, we have variable r go through the cells (= rows) in the Route column. In the inner list comprehension
[ f'{cities[cities.NAME == city].geometry.x.iloc[0]} {cities[cities.NAME == city].geometry.y.iloc[0]}' for city in r.split(';') ]
We then split the cell content at all semicolons with r.split(';') and have variable city go through all the cities in the given route. With the expression cities[cities.Name == city] we get the row for the given city from the cities data frame and, by appending .geometry.x.iloc[0] or .geometry.y.iloc[0], we get the corresponding x and y coordinates from the content of the geometry column of that row.
The result of this inner list comprehension is a list of strings in which the x and y coordinates for each city are separated by a space, e.g. ['cx1 cy1', 'cx2 cy2', ... 'cxn cyx'] where cxi / cyi stands for the x/y coordinate of the i-th city in the trip. By using 'LineString (' + ','.join(...) + ')' in the outer list comprehension, we turn this list into a single string separated by comma, so 'cx1 cy1,cx2 cy2,...,cxn cyx' and add the prefix "LineString (" at the beginning and the closing ")" at the end producing the WKT string expression "LineString (cx1 cy1,cx2 cy2,...,cxn cyx)" for each trip.
The resulting list of WKT LineStrings in variable wkt can now be added as a new column to the fullInfoSorted data frame as a basis for creating the GDAL features for the new shapefile by using ogr.CreateGeometryFromWkt(...) for each individual WKT LineString.
The criteria your notebook submission will be graded on will include how elegant and efficient your code is (e.g. try to make use of regular expressions and use list comprehension instead of for-loops where a simple list comprehension is sufficient) and how well your notebook documents and describes your solution in a visually appealing way.
Successful completion of the above requirements and the write-up discussed below is sufficient to earn 95% of the credit on this project. The remaining 5% is reserved for "over and above" efforts which could include, but are not limited to, the following:
Produce a 400-word write-up on how the assignment went for you; reflect on and briefly discuss the issues and challenges you encountered and what you learned from the assignment. Please also briefly mention what you did for "over and above" points in the write-up.
Submit a single .zip file to the corresponding drop box on Canvas; the zip file should contain:
Links
[1] https://www.e-education.psu.edu/geog489/sites/www.e-education.psu.edu.geog489/files/assignment3_data_March19_3.zip
[2] https://www.e-education.psu.edu/geog489/node/2412
[3] https://docs.python.org/3/library/datetime.html
[4] https://www.guru99.com/date-time-and-datetime-classes-in-python.html
[5] https://howchoo.com/g/ywi5m2vkodk/working-with-datetime-objects-and-timezones-in-python
[6] https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html
[7] https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
[8] https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isin.html