NGA Advanced Python Programming for GIS, GLGI 3001-1

Practice Exercises

PrintPrint

Practice Exercise 2: Requests and BeautifulSoup4

We want to write a script to extract the text from the three text paragraphs from section 1.7.2 on profiling without the heading and following list of subsections. Write a script that does that using the requests module to load the html code and BeautifulSoup4 to extract the text (Section 2.3).

Finally, use a list comprehension (Section 2.2) to create a list that contains the number of characters for each word in the three paragraphs. The output should start like this:

[2, 4, 12, 4, 4, 4, 6, 4, 9, 2, 11… ]

Hint 1

If you use Inspect in your browser, you will see that the text is the content of a <div> element within another <div> element within an <article> element with a unique id attribute (“node-book-2269”). This should help you write a call of the soup.select(…) method to get the <div> element you are interested in. An <article> element with this particular id would be written as “article#node-book-2269” in the string given to soup.select(…).

Hint 2:

Remember that you can get the plain text content of an element you get from BeautifulSoup from its .text property (as in the www.timeanddate.com example in Section 2.3).

Hint 3

It’s ok not to care about punctuation marks, etc. in this exercise and simply use the string method split() to split the text into words at any whitespace character. The number of characters in a string can be computed with the Python function len(...).