Last Updated: 2015-10-13 Tue 08:55

CS 100 HW 4: Python Lists and Web Searching

CHANGELOG:

Tue Oct 13 08:54:25 EDT 2015
Include the line
from urllib.request import *

above the code for Problem 3 in your hw4.py file. This line now appears in the code for problem 3 as well.

Thu Oct 8 09:46:03 EDT 2015
Updated web scores to match what is currently on the pages online. Fixed a few minor bugs in calls.

Table of Contents

1 Instructions

  • Complete the problems in each section below which is labelled Problem
  • This is a pair assignment: you may select one partner to work with on this assignment as a group of 2. You may also opt to work alone as a group of 1.
  • You may collaborate with your group members on all aspects of the assignment including writing code and creating your HW write-up document. All members of your group should retain copies of the code and writeup document. Your group is not allowed to collaborate with other groups. If you are struggling, ask questions on Piazza, seek help from the TA or professor, and review the class notes and tutorial links provided.
  • Write your code in a file called hw4.py. You must submit this file to blackboard for full credit on the assignment.
  • You must also submit a HW Writeup which is an electronic document in one of the following formats.
    • A Microsoft Word Document (.doc or .docx extension)
    • A PDF (portable document format, .pdf)

    You may work in other programs (Apple Words, Google Docs, etc.) but make sure you export/"save as" your work to an acceptable format before submitting it.

  • Only 1 member of your group should submit the hw1.py and the HW Writeup on Blackboard
  • Make sure the HW Writeup has all group member's information in it:
    CS 100 HW 4 Writeup
    Group of 2
    Turanga Leela tleela4 G07019321
    Philip J Fry pfry99 G00000001
    
  • Make sure that hw4.py has all group member's information in it in comments
    # CS 100 HW 4 source code
    # Group of 2
    # Turanga Leela tleela4 G07019321
    # Philip J Fry pfry99 G00000001
    
  • Submit 2 files to our Blackboard site.
    • Log into Blackboard
    • Click on CS 100->HW Assignments->HW 4
    • Press the button marked "Attach File: Browse My Computer"
    • Select your HW Writeup file (.doc .docx .pdf .html .txt only)
    • Again: Press the button marked "Attach File: Browse My Computer"
    • This time select your hw4.py and attach it.
    • Click Submit
  • If you want verbose feedback print a copy of your HW writeup and submit it in class on the due date. You must still submit an electronic version.

2 Returning Values

Python functions can return values to a calling function. This allows functions to communicate with one another. For example, the following function counts how many numbers are odd in a list of numbers and returns the value to whoever asked.

# Count how many odd numbers appear in alist
#
# Parameters
#   alist: a list of numbers like [1,2] or [8,6,7,5,3,0,9]
#
# Return: a number like 1 or 4
def count_odds(alist):
    oddcount = 0
    for x in alist:
        if x % 2 == 1:
            oddcount = oddcount+1
    return oddcount

Notice the return oddcount line at the end which returns a value to whoever called the function.

Return values can be assigned to variables in Python's interactive loop.

>>> how_many_odds = count_odds([1,2])
>>> how_many_odds
1
>>> how_many_odds = count_odds([8,6,7,5,3,0,9])
>>> how_many_odds
4

Return values can also be used in functions to figure things out. The following function figures out which of two list has more odd numbers and returns it.

# Calculate which list has more odds and return it
# Parameters
#   list1: a list of numbers like [1,2] or [8,6,7,5,3,0,9]
#   list2: a list of numbers like [1,2] or [8,6,7,5,3,0,9]
#
# Return: the list which has more odd numbers in it
def list_with_more_odds(list1,list2):
    oddcount1 = count_odds(list1)
    oddcount2 = count_odds(list2)
    if oddcount1 > oddcount2:
        return list1
    else:
        return list2

In the interactive loop one can use it like the following

>>> odder_list = list_with_more_odds([1,2], [8,6,7,5,3,0,9])
>>> odder_list
[8, 6, 7, 5, 3, 0, 9]

or build further functions on top of it.

3 Problem 1: Count Words in a List

Part of Google's ranking of web pages is to determine how many times a query word appears on a given page. The computer will break all the words on a page into a list of words and then check a query word to see if it is there.

Define a function called count_words(word, word_list) to count how many times word appears in word_list.

  • Start a counter variable at 0
  • Use a for loop of some kind to look at each element of word_list
  • You can detect if two things are equal using an if statement and the == (equality) operator. Examples
    if i % 2 == 0:  # checks if i is even
    if w == "kitty": # checks if w is the word "kitty"
    if word1 == word2: # checks if word1 is equal to word2
    
  • Check if each element of the list is equal to the given word and if it is, add one onto your counter variable
  • After the loop, return the counter variable using return

Start your code can using the following comments to help you remember what the function is supposed to do.

# Count how many times the given word appears in the word_list. Return
# a number
#
# Parameters:
#   word: a word like "red", "fish", or "python"
#   word_list: a list of words like ["this","is","some","text"]
#
# Return: a number like 0 or 10
def count_words(word, word_list):

Examples:

>>> count_words("a", ["b","a","r","a","z"])
2
>>> count_words("r", ["b","a","r","a","z"])
1
>>> count_words("w", ["b","a","r","a","z"])
0

>>> lyrics = ['A-well-a', "everybody's", 'heard', 'about', 'the',
              'bird', 'bird', 'bird', 'bird', 'b-bird', 'is',
              'the', 'word']
>>> count_words("bird",lyrics)
4
>>> count_words("word",lyrics)
1

The full lyrics of Surfin' Bird by the Trashmen are contained are contained in the below list of words which you can paste into your hw4.py file.

full_lyrics = ['A', 'well', 'a', 'everybody', 'is', 'heard', 'about',
               'the', 'bird', 'Bird', 'bird', 'bird', 'b', 'bird',
               'is', 'the', 'word', 'A', 'well', 'a', 'bird', 'bird',
               'bird', 'the', 'bird', 'is', 'the', 'word', 'A', 'well',
               'a', 'bird', 'bird', 'bird', 'well', 'the', 'bird', 'is',
               'the', 'word', 'A', 'well', 'a', 'bird', 'bird', 'bird',
               'b', 'bird', 'is', 'the', 'word', 'A', 'well', 'a', 'bird',
               'bird', 'bird', 'well', 'the', 'bird', 'is', 'the', 'word',
               'A', 'well', 'a', 'bird', 'bird', 'b', 'bird', 'is', 'the',
               'word', 'A', 'well', 'a', 'bird', 'bird', 'bird', 'b',
               'bird', 'is', 'the', 'word', 'A', 'well', 'a', 'bird',
               'bird', 'bird', 'well', 'the', 'bird', 'is', 'the', 'word',
               'A', 'well', 'a', 'bird', 'bird', 'b', 'bird', 'is', 'the',
               'word', 'A', 'well', 'a', "don't", 'you', 'know', 'about',
               'the', 'bird', 'Well', 'everybody', 'knows', 'that', 'the',
               'bird', 'is', 'the', 'word', 'A', 'well', 'a', 'bird',
               'bird', 'b', 'bird', 'is', 'the', 'word', 'A', 'well',
               'a', 'A', 'well', 'a', 'everybody', 'is', 'heard', 'about',
               'the', 'bird', 'Bird', 'bird', 'bird', 'b', 'bird', 'is',
               'the', 'word', 'A', 'well', 'a', 'bird', 'bird', 'bird',
               'b', 'bird', 'is', 'the', 'word', 'A', 'well', 'a',
               'bird', 'bird', 'bird', 'b', 'bird', 'is', 'the', 'word',
               'A', 'well', 'a', 'bird', 'bird', 'b', 'bird', 'is',
               'the', 'word', 'A', 'well', 'a', 'bird', 'bird', 'bird',
               'b', 'bird', 'is', 'the', 'word', 'A', 'well', 'a',
               'bird', 'bird', 'bird', 'b', 'bird', 'is', 'the', 'word',
               'A', 'well', 'a', 'bird', 'bird', 'bird', 'b', 'bird',
               'is', 'the', 'word', 'A', 'well', 'a', 'bird', 'bird',
               'bird', 'b', 'bird', 'is', 'the', 'word', 'A', 'well',
               'a', "don't", 'you', 'know', 'about', 'the', 'bird',
               'Well', 'everybody', 'is', 'talking', 'about', 'the', 'bird', 'A',
               'well', 'a', 'bird', 'bird', 'b', 'bird', 'is', 'the',
               'word', 'A', 'well', 'a', 'bird', "Surfin'", 'bird', 'Bbbbbbbbbbbbbbbbbb',
               'aaah', 'Pa', 'pa', 'pa', 'pa', 'pa', 'pa', 'pa',
               'pa', 'pa', 'pa', 'pa', 'pa', 'pa', 'pa', 'pa',
               'pa', 'Pa', 'pa', 'pa', 'pa', 'pa', 'pa', 'pa',
               'pa', 'pa', 'pa', 'pa', 'pa', 'pa', 'pa', 'ooma',
               'mow', 'mow', 'Papa', 'ooma', 'mow', 'mow', 'Papa', 'ooma',
               'mow', 'mow', 'papa', 'ooma', 'mow', 'mow', 'Papa', 'ooma',
               'mow', 'mow', 'papa', 'ooma', 'mow', 'mow', 'Ooma', 'mow',
               'mow', 'papa', 'ooma', 'mow', 'mow', 'Papa', 'ooma', 'mow',
               'mow', 'papa', 'ooma', 'mow', 'mow', 'Papa', 'ooma', 'mow',
               'mow', 'papa', 'ooma', 'mow', 'mow', 'Oom', 'oom', 'oom',
               'oom', 'ooma', 'mow', 'mow', 'Papa', 'ooma', 'mow', 'mow',
               'papa', 'oom', 'oom', 'oom', 'Oom', 'ooma', 'mow', 'mow',
               'papa', 'ooma', 'mow', 'mow', 'Ooma', 'mow', 'mow', 'papa',
               'ooma', 'mow', 'mow', 'Papa', 'a', 'mow', 'mow', 'papa',
               'ooma', 'mow', 'mow', 'Papa', 'ooma', 'mow', 'mow', 'ooma',
               'mow', 'mow', 'Papa', 'ooma', 'mow', 'mow', 'ooma', 'mow',
               'mow', 'Papa', 'oom', 'oom', 'oom', 'oom', 'ooma', 'mow',
               'mow', 'Oom', 'oom', 'oom', 'oom', 'ooma', 'mow', 'mow',
               'Ooma', 'mow', 'mow', 'papa', 'ooma', 'mow', 'mow', 'Papa',
               'ooma', 'mow', 'mow', 'ooma', 'mow', 'mow', 'Well', "don't"
               , 'you', 'know', 'about', 'the', 'bird', 'Well', 'everybody', 'knows',
               'that', 'the', 'bird', 'is', 'the', 'word', 'A', 'well',
               'a', 'bird', 'bird', 'b', 'bird', 'is', 'the', 'word']

Mondern renditions of the song are also quite entertaining.

What to put in your HW Write-up

  • Paste in your code for count_words(word,word_list)
  • Answer the following questions
    1. How many words are in Surfin' Bird (Hint: use the len() function to count how long the list is in the interactive loop).
    2. How many times does the word bird occur in full_lyrics?

4 Appending To Lists

Lists can be modified in a number of ways. One of the most useful ways is to append things to the end of them for which the append() function is used. The syntax is a little funny but it makes it clear which list is having elements added to it. Examples

>>> my_list = ['a','b','c']
>>> print(my_list)
['a', 'b', 'c']
>>> my_list.append('d')
>>> print(my_list)
['a', 'b', 'c', 'd']
>>> my_list.append('e')
>>> my_list.append('f')
>>> print(my_list)
['a', 'b', 'c', 'd', 'e', 'f']
>>> my_other_list = []
>>> print(my_other_list)
[]
>>> my_other_list.append("bird")
>>> my_other_list.append("word")
>>> print(my_other_list)
['bird', 'word']
>>> print(my_list)
['a', 'b', 'c', 'd', 'e', 'f']

Appending is useful when one wants to build up a list of answers in a function. The following function creates a list of all odd numbers that appear in a list of numbers.

# Parameters
#   num_list: a list of numbers like [2,4,6] or [1,2,5], or [3,3,2,2,1,3]
#
# Return: A list of all the odd numbers that appear in num_list like
#        [], [1,5], or [3,3,1,3]
def get_all_odds(num_list):
    odds = []
    for x in num_list:
        if x % 2 == 1:
            odds.append(x)
    return odds

Note the use of odds.append(x) which appends a number to the growing list of odds. Ultimately the list of odd numbers, called odds is returned from the function.

Examples of use are below.

>>> get_all_odds([2,4,6])
[]
>>> get_all_odds([1,2,5])
[1, 5]
>>> get_all_odds([3,3,2,2,1,3])
[3, 3, 1, 3]
>>> odd_list = get_all_odds([3,3,2,2,1,3])
>>> print(odd_list)
[3, 3, 1, 3]

Append is useful to grow a new list while visiting elements of another list.

5 Problem 2: Count All Words

A web search often has a several words in it which can be represented as a list of words. Define a function that takes two lists

  • A list of query words to search for
  • A list of words that appear on a web page

You will count all occurrences of all the query words on the list of page words and return a list of those counts.

Call your function count_all_words(query_words, page_words) and use the following comments/code to start your function.

# Count how many times each word in the list query words occurs in the
# list page_words.  Return a list of numbers that are the counts. Use
# the function count_words to make this easier. Use
# alistg.append(number) to add a number at the end of a list.
#
# Parameters:
#   query_words: a list of words to search for like ["hello","kitty"]
#   page_words: a list of words that appears on a page like ["my","favorite","kitty","says","hello"]
#
# Return: a list of counts like [1,1] or [0,0,4,8,7]
def count_all_words(query_words, page_words):

Your basic approach should be as follows.

  • Start with an empty list of word counts
  • Use a loop to go through all words in query_words
  • For each query word q, use a call to count_words(q,p) to count how many times the query word appears in the list page_words.
  • Append the count for query word onto the word counts using the append() function.

Examples

>>> count_all_words(["hello","kitty"], ["here", "kitty", "kitty","kitty","kitty","cat"])
[0, 4]

>>> count_all_words(["cat","dog","mouse"], ["here", "kitty", "kitty","kitty","kitty","cat"])
[1, 0, 0]

>>> count_all_words(["word","heard","ooma","mow"], full_lyrics)
[22, 2, 26, 60]

What to put in your HW Write-up

  • Paste in your code for count_all_words(query_words,page_words)
  • Calculate how many times the words Papa, everybody, and the appear in the full_lyrics list from Problem 1. Show how you would call count_all_words() to count those three words and the output list of numbers like what is shown in the Examples above.

6 Problem 3: A Simple Web Page Scoring System

The following two functions build on your answers to Problem 1 and Problem 2 to create a simple web page ranking system. Paste these functions into your HW 4.

# Import some libraries to download web pages
from urllib.request import *

# Download the given url, split up the query string, return a list of
# how many times each word in the query string appears in on the web
# page.
#
# Parameters:
#   query_string: string of words to search for like "hello kitty cartoon"
#   url: a web page like http://www.sanrio.com/ or http://www.tyrusbooks.com/books/hello-kitty-must-die
#
# Return: a list of how many times each query word occurs on page at
#   the given url. The count is generated by the function
#   count_all_words(q,p)
def count_web_words(query_string,url):
    connection = urlopen(url)       # connect to google
    thebytes = connection.read()    # read whole page
    text = thebytes.decode("UTF-8").lower() # decode and make all lower case
    page_words = text.split()
    query_words = query_string.split()
    return count_all_words(query_words,page_words)
    
# For the given query string, print the web_score of each page in
# url_list.  The web score is the sum of how many times each query
# word in the query_string appears on given url.  Use the function
# count_web_words(q,u) to generate a list of how many times each query
# word occurs and sum them.
#
# Parameters
#   query_string: string of words to search for like "hello kitty cartoon"
#   url_list: a list of url strings like ["http://www.google.com", "http://www.gmu.edu/"]
#
# Return: no return value, just print results to the screen
def print_all_web_scores(query_string, url_list):
    for url in url_list:
        query_counts = count_web_words(query_string,url)
        web_score = sum(query_counts)
        print(str(web_score)+" : "+url)

The system is very simple: grab all words from a web page and count how many times the words in a query string occur in each page. The function print_all_web_scores(query_string,urls) is used to generate scores for a list of web pages for the given query. For example, here is a list of urls that are among the top Google hits for the query computer science with python. Paste the list of urls into your hw4.py.

# Some test urls to use with the query "computer science with python"
python_urls=[
    "https://www.khanacademy.org/computing/computer-science",
    "http://www.openbookproject.net/thinkcs/python/english2e/",
    "http://www.amazon.com/Python-Programming-Introduction-Computer-Science/dp/1887902996",
    "https://www.edx.org/course/mitx/mitx-6-00-1x-introduction-computer-2841#.VCsIS3VdVhE",
    "http://neopythonic.blogspot.com/2013/10/book-review-charles-dierbach.html",
    "http://www.greenteapress.com/thinkpython/",
    "https://www.udacity.com/course/cs101",
    "http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-00sc-introduction-to-computer-science-and-programming-spring-2011/unit-1/lecture-5-objects-in-python/",
    "http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-01sc-introduction-to-electrical-engineering-and-computer-science-i-spring-2011/python-tutorial/",
    "http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.111.6062"
]

You can calculate the ranking of the pages for our system using the following call to print_all_web_scores().

>>> print_all_web_scores("computer science with python",python_urls)
84 : https://www.khanacademy.org/computing/computer-science
17 : http://www.openbookproject.net/thinkcs/python/english2e/
99 : http://www.amazon.com/Python-Programming-Introduction-Computer-Science/dp/1887902996
15 : https://www.edx.org/course/mitx/mitx-6-00-1x-introduction-computer-2841#.VCsIS3VdVhE
35 : http://neopythonic.blogspot.com/2013/10/book-review-charles-dierbach.html
17 : http://www.greenteapress.com/thinkpython/
63 : https://www.udacity.com/course/cs101
25 : http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-00sc-introduction-to-computer-science-and-programming-spring-2011/unit-1/lecture-5-objects-in-python/
32 : http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-01sc-introduction-to-electrical-engineering-and-computer-science-i-spring-2011/python-tutorial/
16 : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.111.6062

The scores are printed on the left and the url with that score on the right. The best scoring url has score 99 and is a link to an amazon textbook.

  • Check that your results match above. If not, there might be something wrong with one of your functions from Problem 1 or Problem 2
  • You'll need to be connected to the internet for print_all_web_scores() to work because it downloads web pages and processes them. This can take a few seconds depeneding on how good your connection is.

Here is another list of urls generated from searching google for the query "web search engine".

web_search_urls = [
    "http://en.wikipedia.org/wiki/Web_search_engine",
    "http://www.dogpile.com/",
    "http://www.bing.com/",
    "https://duckduckgo.com/",
    "http://search.yahoo.com/",
    "http://www.thesearchenginelist.com/",
    "https://www.ixquick.com/",
    "http://www.wordstream.com/articles/internet-search-engines-history"
]

Paste them into your hw4.py and use them in the questions below.

What to put in your HW Write-up Answer the following questions

  1. The results for the query computer science with python among the urls in python_urls are given above. Do a google search with the same query and report whether our word counting system gives the same or different results on how Google ranks pages. If you see differences in the way Google ranks pages, describe why you think the rankings differ from our system?
  2. Call the print_all_web_scores() function with the query computer science python and the url list python_urls. Note that the query is missing the word with compared to the first version of it. Copy the results you see into your write up. Do the relative ranks of the pages change with this small change to the query? Does the Amazon textbook site still have the top score or does some other site?
  3. Call the print_all_web_scores() function with the query web search engine and the url list web_search_urls. Paste the results into your HW writeup. Remember that the urls in web_search_urls were selected from the first page of Google results. Describe any differences you perceive with how Google must determine "good" pages to report versus our simple word counting scheme. Are there any sites missing from web_search_urls that you would expect to see there?

Author: Chris Kauffman (kauffman@cs.gmu.edu)
Date: 2015-10-13 Tue 08:55