CS 100 HW 4: Python Lists and Web Searching
- Due: 5:00 pm Friday 10/16/2014
- Submit to Blackboard by deadline
CHANGELOG:
- Tue Oct 13 08:54:25 EDT 2015
- Include the line
from urllib.request import *
above the code for Problem 3 in your
hw4.py
file. This line now appears in the code for problem 3 as well. - Thu Oct 8 09:46:03 EDT 2015
- Updated web scores to match what is currently on the pages online. Fixed a few minor bugs in calls.
Table of Contents
1 Instructions
- Complete the problems in each section below which is labelled Problem
- This is a pair assignment: you may select one partner to work with on this assignment as a group of 2. You may also opt to work alone as a group of 1.
- You may collaborate with your group members on all aspects of the assignment including writing code and creating your HW write-up document. All members of your group should retain copies of the code and writeup document. Your group is not allowed to collaborate with other groups. If you are struggling, ask questions on Piazza, seek help from the TA or professor, and review the class notes and tutorial links provided.
- Write your code in a file called
hw4.py
. You must submit this file to blackboard for full credit on the assignment. - You must also submit a HW Writeup which is an electronic document
in one of the following formats.
- A Microsoft Word Document (.doc or .docx extension)
- A PDF (portable document format, .pdf)
You may work in other programs (Apple Words, Google Docs, etc.) but make sure you export/"save as" your work to an acceptable format before submitting it.
- Only 1 member of your group should submit the hw1.py and the HW Writeup on Blackboard
- Make sure the HW Writeup has all group member's information in it:
CS 100 HW 4 Writeup Group of 2 Turanga Leela tleela4 G07019321 Philip J Fry pfry99 G00000001
- Make sure that
hw4.py
has all group member's information in it in comments# CS 100 HW 4 source code # Group of 2 # Turanga Leela tleela4 G07019321 # Philip J Fry pfry99 G00000001
- Submit 2 files to our Blackboard site.
- Log into Blackboard
- Click on CS 100->HW Assignments->HW 4
- Press the button marked "Attach File: Browse My Computer"
- Select your HW Writeup file (
.doc .docx .pdf .html .txt
only) - Again: Press the button marked "Attach File: Browse My Computer"
- This time select your
hw4.py
and attach it. - Click Submit
- If you want verbose feedback print a copy of your HW writeup and submit it in class on the due date. You must still submit an electronic version.
2 Returning Values
Python functions can return values to a calling function. This allows functions to communicate with one another. For example, the following function counts how many numbers are odd in a list of numbers and returns the value to whoever asked.
# Count how many odd numbers appear in alist # # Parameters # alist: a list of numbers like [1,2] or [8,6,7,5,3,0,9] # # Return: a number like 1 or 4 def count_odds(alist): oddcount = 0 for x in alist: if x % 2 == 1: oddcount = oddcount+1 return oddcount
Notice the return oddcount
line at the end which returns a value to
whoever called the function.
Return values can be assigned to variables in Python's interactive loop.
>>> how_many_odds = count_odds([1,2]) >>> how_many_odds 1 >>> how_many_odds = count_odds([8,6,7,5,3,0,9]) >>> how_many_odds 4
Return values can also be used in functions to figure things out. The following function figures out which of two list has more odd numbers and returns it.
# Calculate which list has more odds and return it # Parameters # list1: a list of numbers like [1,2] or [8,6,7,5,3,0,9] # list2: a list of numbers like [1,2] or [8,6,7,5,3,0,9] # # Return: the list which has more odd numbers in it def list_with_more_odds(list1,list2): oddcount1 = count_odds(list1) oddcount2 = count_odds(list2) if oddcount1 > oddcount2: return list1 else: return list2
In the interactive loop one can use it like the following
>>> odder_list = list_with_more_odds([1,2], [8,6,7,5,3,0,9])
>>> odder_list
[8, 6, 7, 5, 3, 0, 9]
or build further functions on top of it.
3 Problem 1: Count Words in a List
Part of Google's ranking of web pages is to determine how many times a query word appears on a given page. The computer will break all the words on a page into a list of words and then check a query word to see if it is there.
Define a function called count_words(word, word_list)
to count how
many times word
appears in word_list
.
- Start a counter variable at 0
- Use a for loop of some kind to look at each element of
word_list
- You can detect if two things are equal using an
if
statement and the == (equality) operator. Examplesif i % 2 == 0: # checks if i is even if w == "kitty": # checks if w is the word "kitty" if word1 == word2: # checks if word1 is equal to word2
- Check if each element of the list is equal to the given
word
and if it is, add one onto your counter variable - After the loop, return the counter variable using
return
Start your code can using the following comments to help you remember what the function is supposed to do.
# Count how many times the given word appears in the word_list. Return # a number # # Parameters: # word: a word like "red", "fish", or "python" # word_list: a list of words like ["this","is","some","text"] # # Return: a number like 0 or 10 def count_words(word, word_list):
Examples:
>>> count_words("a", ["b","a","r","a","z"]) 2 >>> count_words("r", ["b","a","r","a","z"]) 1 >>> count_words("w", ["b","a","r","a","z"]) 0 >>> lyrics = ['A-well-a', "everybody's", 'heard', 'about', 'the', 'bird', 'bird', 'bird', 'bird', 'b-bird', 'is', 'the', 'word'] >>> count_words("bird",lyrics) 4 >>> count_words("word",lyrics) 1
The full lyrics of Surfin' Bird by the Trashmen are contained
are contained in the below list of words which you can paste into your
hw4.py
file.
full_lyrics = ['A', 'well', 'a', 'everybody', 'is', 'heard', 'about', 'the', 'bird', 'Bird', 'bird', 'bird', 'b', 'bird', 'is', 'the', 'word', 'A', 'well', 'a', 'bird', 'bird', 'bird', 'the', 'bird', 'is', 'the', 'word', 'A', 'well', 'a', 'bird', 'bird', 'bird', 'well', 'the', 'bird', 'is', 'the', 'word', 'A', 'well', 'a', 'bird', 'bird', 'bird', 'b', 'bird', 'is', 'the', 'word', 'A', 'well', 'a', 'bird', 'bird', 'bird', 'well', 'the', 'bird', 'is', 'the', 'word', 'A', 'well', 'a', 'bird', 'bird', 'b', 'bird', 'is', 'the', 'word', 'A', 'well', 'a', 'bird', 'bird', 'bird', 'b', 'bird', 'is', 'the', 'word', 'A', 'well', 'a', 'bird', 'bird', 'bird', 'well', 'the', 'bird', 'is', 'the', 'word', 'A', 'well', 'a', 'bird', 'bird', 'b', 'bird', 'is', 'the', 'word', 'A', 'well', 'a', "don't", 'you', 'know', 'about', 'the', 'bird', 'Well', 'everybody', 'knows', 'that', 'the', 'bird', 'is', 'the', 'word', 'A', 'well', 'a', 'bird', 'bird', 'b', 'bird', 'is', 'the', 'word', 'A', 'well', 'a', 'A', 'well', 'a', 'everybody', 'is', 'heard', 'about', 'the', 'bird', 'Bird', 'bird', 'bird', 'b', 'bird', 'is', 'the', 'word', 'A', 'well', 'a', 'bird', 'bird', 'bird', 'b', 'bird', 'is', 'the', 'word', 'A', 'well', 'a', 'bird', 'bird', 'bird', 'b', 'bird', 'is', 'the', 'word', 'A', 'well', 'a', 'bird', 'bird', 'b', 'bird', 'is', 'the', 'word', 'A', 'well', 'a', 'bird', 'bird', 'bird', 'b', 'bird', 'is', 'the', 'word', 'A', 'well', 'a', 'bird', 'bird', 'bird', 'b', 'bird', 'is', 'the', 'word', 'A', 'well', 'a', 'bird', 'bird', 'bird', 'b', 'bird', 'is', 'the', 'word', 'A', 'well', 'a', 'bird', 'bird', 'bird', 'b', 'bird', 'is', 'the', 'word', 'A', 'well', 'a', "don't", 'you', 'know', 'about', 'the', 'bird', 'Well', 'everybody', 'is', 'talking', 'about', 'the', 'bird', 'A', 'well', 'a', 'bird', 'bird', 'b', 'bird', 'is', 'the', 'word', 'A', 'well', 'a', 'bird', "Surfin'", 'bird', 'Bbbbbbbbbbbbbbbbbb', 'aaah', 'Pa', 'pa', 'pa', 'pa', 'pa', 'pa', 'pa', 'pa', 'pa', 'pa', 'pa', 'pa', 'pa', 'pa', 'pa', 'pa', 'Pa', 'pa', 'pa', 'pa', 'pa', 'pa', 'pa', 'pa', 'pa', 'pa', 'pa', 'pa', 'pa', 'pa', 'ooma', 'mow', 'mow', 'Papa', 'ooma', 'mow', 'mow', 'Papa', 'ooma', 'mow', 'mow', 'papa', 'ooma', 'mow', 'mow', 'Papa', 'ooma', 'mow', 'mow', 'papa', 'ooma', 'mow', 'mow', 'Ooma', 'mow', 'mow', 'papa', 'ooma', 'mow', 'mow', 'Papa', 'ooma', 'mow', 'mow', 'papa', 'ooma', 'mow', 'mow', 'Papa', 'ooma', 'mow', 'mow', 'papa', 'ooma', 'mow', 'mow', 'Oom', 'oom', 'oom', 'oom', 'ooma', 'mow', 'mow', 'Papa', 'ooma', 'mow', 'mow', 'papa', 'oom', 'oom', 'oom', 'Oom', 'ooma', 'mow', 'mow', 'papa', 'ooma', 'mow', 'mow', 'Ooma', 'mow', 'mow', 'papa', 'ooma', 'mow', 'mow', 'Papa', 'a', 'mow', 'mow', 'papa', 'ooma', 'mow', 'mow', 'Papa', 'ooma', 'mow', 'mow', 'ooma', 'mow', 'mow', 'Papa', 'ooma', 'mow', 'mow', 'ooma', 'mow', 'mow', 'Papa', 'oom', 'oom', 'oom', 'oom', 'ooma', 'mow', 'mow', 'Oom', 'oom', 'oom', 'oom', 'ooma', 'mow', 'mow', 'Ooma', 'mow', 'mow', 'papa', 'ooma', 'mow', 'mow', 'Papa', 'ooma', 'mow', 'mow', 'ooma', 'mow', 'mow', 'Well', "don't" , 'you', 'know', 'about', 'the', 'bird', 'Well', 'everybody', 'knows', 'that', 'the', 'bird', 'is', 'the', 'word', 'A', 'well', 'a', 'bird', 'bird', 'b', 'bird', 'is', 'the', 'word']
Mondern renditions of the song are also quite entertaining.
What to put in your HW Write-up
- Paste in your code for
count_words(word,word_list)
- Answer the following questions
- How many words are in Surfin' Bird (Hint: use the
len()
function to count how long the list is in the interactive loop). - How many times does the word
bird
occur infull_lyrics
?
- How many words are in Surfin' Bird (Hint: use the
4 Appending To Lists
Lists can be modified in a number of ways. One of the most useful ways
is to append things to the end of them for which the append()
function is used. The syntax is a little funny but it makes it clear
which list is having elements added to it. Examples
>>> my_list = ['a','b','c'] >>> print(my_list) ['a', 'b', 'c'] >>> my_list.append('d') >>> print(my_list) ['a', 'b', 'c', 'd'] >>> my_list.append('e') >>> my_list.append('f') >>> print(my_list) ['a', 'b', 'c', 'd', 'e', 'f'] >>> my_other_list = [] >>> print(my_other_list) [] >>> my_other_list.append("bird") >>> my_other_list.append("word") >>> print(my_other_list) ['bird', 'word'] >>> print(my_list) ['a', 'b', 'c', 'd', 'e', 'f']
Appending is useful when one wants to build up a list of answers in a function. The following function creates a list of all odd numbers that appear in a list of numbers.
# Parameters # num_list: a list of numbers like [2,4,6] or [1,2,5], or [3,3,2,2,1,3] # # Return: A list of all the odd numbers that appear in num_list like # [], [1,5], or [3,3,1,3] def get_all_odds(num_list): odds = [] for x in num_list: if x % 2 == 1: odds.append(x) return odds
Note the use of odds.append(x)
which appends a number to the growing
list of odds. Ultimately the list of odd numbers, called odds
is
returned from the function.
Examples of use are below.
>>> get_all_odds([2,4,6]) [] >>> get_all_odds([1,2,5]) [1, 5] >>> get_all_odds([3,3,2,2,1,3]) [3, 3, 1, 3] >>> odd_list = get_all_odds([3,3,2,2,1,3]) >>> print(odd_list) [3, 3, 1, 3]
Append is useful to grow a new list while visiting elements of another list.
5 Problem 2: Count All Words
A web search often has a several words in it which can be represented as a list of words. Define a function that takes two lists
- A list of query words to search for
- A list of words that appear on a web page
You will count all occurrences of all the query words on the list of page words and return a list of those counts.
Call your function count_all_words(query_words, page_words)
and use
the following comments/code to start your function.
# Count how many times each word in the list query words occurs in the # list page_words. Return a list of numbers that are the counts. Use # the function count_words to make this easier. Use # alistg.append(number) to add a number at the end of a list. # # Parameters: # query_words: a list of words to search for like ["hello","kitty"] # page_words: a list of words that appears on a page like ["my","favorite","kitty","says","hello"] # # Return: a list of counts like [1,1] or [0,0,4,8,7] def count_all_words(query_words, page_words):
Your basic approach should be as follows.
- Start with an empty list of word counts
- Use a loop to go through all words in
query_words
- For each query word
q
, use a call tocount_words(q,p)
to count how many times the query word appears in the listpage_words
. - Append the count for query word onto the word counts using the
append()
function.
Examples
>>> count_all_words(["hello","kitty"], ["here", "kitty", "kitty","kitty","kitty","cat"]) [0, 4] >>> count_all_words(["cat","dog","mouse"], ["here", "kitty", "kitty","kitty","kitty","cat"]) [1, 0, 0] >>> count_all_words(["word","heard","ooma","mow"], full_lyrics) [22, 2, 26, 60]
What to put in your HW Write-up
- Paste in your code for
count_all_words(query_words,page_words)
- Calculate how many times the words
Papa
,everybody
, andthe
appear in thefull_lyrics
list from Problem 1. Show how you would callcount_all_words()
to count those three words and the output list of numbers like what is shown in the Examples above.
6 Problem 3: A Simple Web Page Scoring System
The following two functions build on your answers to Problem 1 and Problem 2 to create a simple web page ranking system. Paste these functions into your HW 4.
# Import some libraries to download web pages from urllib.request import * # Download the given url, split up the query string, return a list of # how many times each word in the query string appears in on the web # page. # # Parameters: # query_string: string of words to search for like "hello kitty cartoon" # url: a web page like http://www.sanrio.com/ or http://www.tyrusbooks.com/books/hello-kitty-must-die # # Return: a list of how many times each query word occurs on page at # the given url. The count is generated by the function # count_all_words(q,p) def count_web_words(query_string,url): connection = urlopen(url) # connect to google thebytes = connection.read() # read whole page text = thebytes.decode("UTF-8").lower() # decode and make all lower case page_words = text.split() query_words = query_string.split() return count_all_words(query_words,page_words) # For the given query string, print the web_score of each page in # url_list. The web score is the sum of how many times each query # word in the query_string appears on given url. Use the function # count_web_words(q,u) to generate a list of how many times each query # word occurs and sum them. # # Parameters # query_string: string of words to search for like "hello kitty cartoon" # url_list: a list of url strings like ["http://www.google.com", "http://www.gmu.edu/"] # # Return: no return value, just print results to the screen def print_all_web_scores(query_string, url_list): for url in url_list: query_counts = count_web_words(query_string,url) web_score = sum(query_counts) print(str(web_score)+" : "+url)
The system is very simple: grab all words from a web page and count
how many times the words in a query string occur in each page. The
function print_all_web_scores(query_string,urls)
is used to generate
scores for a list of web pages for the given query. For example, here
is a list of urls that are among the top Google hits for the query
computer science with python
. Paste the list of urls into your
hw4.py
.
# Some test urls to use with the query "computer science with python" python_urls=[ "https://www.khanacademy.org/computing/computer-science", "http://www.openbookproject.net/thinkcs/python/english2e/", "http://www.amazon.com/Python-Programming-Introduction-Computer-Science/dp/1887902996", "https://www.edx.org/course/mitx/mitx-6-00-1x-introduction-computer-2841#.VCsIS3VdVhE", "http://neopythonic.blogspot.com/2013/10/book-review-charles-dierbach.html", "http://www.greenteapress.com/thinkpython/", "https://www.udacity.com/course/cs101", "http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-00sc-introduction-to-computer-science-and-programming-spring-2011/unit-1/lecture-5-objects-in-python/", "http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-01sc-introduction-to-electrical-engineering-and-computer-science-i-spring-2011/python-tutorial/", "http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.111.6062" ]
You can calculate the ranking of the pages for our system using the
following call to print_all_web_scores()
.
>>> print_all_web_scores("computer science with python",python_urls) 84 : https://www.khanacademy.org/computing/computer-science 17 : http://www.openbookproject.net/thinkcs/python/english2e/ 99 : http://www.amazon.com/Python-Programming-Introduction-Computer-Science/dp/1887902996 15 : https://www.edx.org/course/mitx/mitx-6-00-1x-introduction-computer-2841#.VCsIS3VdVhE 35 : http://neopythonic.blogspot.com/2013/10/book-review-charles-dierbach.html 17 : http://www.greenteapress.com/thinkpython/ 63 : https://www.udacity.com/course/cs101 25 : http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-00sc-introduction-to-computer-science-and-programming-spring-2011/unit-1/lecture-5-objects-in-python/ 32 : http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-01sc-introduction-to-electrical-engineering-and-computer-science-i-spring-2011/python-tutorial/ 16 : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.111.6062
The scores are printed on the left and the url with that score on the right. The best scoring url has score 99 and is a link to an amazon textbook.
- Check that your results match above. If not, there might be something wrong with one of your functions from Problem 1 or Problem 2
- You'll need to be connected to the internet for
print_all_web_scores()
to work because it downloads web pages and processes them. This can take a few seconds depeneding on how good your connection is.
Here is another list of urls generated from searching google for the query "web search engine".
web_search_urls = [ "http://en.wikipedia.org/wiki/Web_search_engine", "http://www.dogpile.com/", "http://www.bing.com/", "https://duckduckgo.com/", "http://search.yahoo.com/", "http://www.thesearchenginelist.com/", "https://www.ixquick.com/", "http://www.wordstream.com/articles/internet-search-engines-history" ]
Paste them into your hw4.py
and use them in the questions below.
What to put in your HW Write-up Answer the following questions
- The results for the query
computer science with python
among the urls inpython_urls
are given above. Do a google search with the same query and report whether our word counting system gives the same or different results on how Google ranks pages. If you see differences in the way Google ranks pages, describe why you think the rankings differ from our system? - Call the
print_all_web_scores()
function with the querycomputer science python
and the url listpython_urls
. Note that the query is missing the word with compared to the first version of it. Copy the results you see into your write up. Do the relative ranks of the pages change with this small change to the query? Does the Amazon textbook site still have the top score or does some other site? - Call the
print_all_web_scores()
function with the queryweb search engine
and the url listweb_search_urls
. Paste the results into your HW writeup. Remember that the urls inweb_search_urls
were selected from the first page of Google results. Describe any differences you perceive with how Google must determine "good" pages to report versus our simple word counting scheme. Are there any sites missing fromweb_search_urls
that you would expect to see there?