Programming Project 4

In this project, you're going to write some classes that allow an HTML file to be read in and be converted into a tree structure. Once such a tree is obtained, we'll be attempting to create our own version of Google Image Search where we'll look at the images, text, and other information in HTML files in order to try to return good image search results. A good search engine will do more than simply rely on the text displayed on a webpage; it will use the hierarchical structure of the webpage to make decisions about what textual elements may relate closely to the images we're trying to find. In particular, after successfully completing this project, you'll have met the following goals:


Once you have completed this project, you will be ready to take Assessment 7 and Assessment 8. You will be uploading and using your own code for these assessments, so make sure to spend time commenting your code and making it clean and easy for you to work with. Assessment 7 will measure your understanding of trees, and Assessment 8 will measure your understanding of recursion.


Introduction


In this project, you'll be writing some code to create various classes that model the HTML code of a website (note: you don't need to know HTML to complete this project, but you may end up picking up some of it along the way). HTML is a type of markup language which is a fancy way of saying that it is used to indicate how text and information is going to be organized and displayed in a web browser. Did you know web browsers are perfectly capable of displaying simple text files? However, plain text files don't usually look very appealing, which is where HTML comes into play.

If you right-mouse-click on this page, and select View Source, you can see the HTML code responsible for rendering this page in the browser. An HTML file is a nested set of tags that the browser interprets in order to render the page in a visually appealing fashion. Besides text, you can have images, media files, and many other types of elements in an HTML file that a browser can display. Each tag is not only used by the browser to decide what to display, but also how to display it: tags are full of attributes. For example, the tag to display an image is >IMG<, and to set an image's height to be 150 pixels, we could specify this as an attribute of height=150.

You will have to do the following for this assignment:

The diagram below shows the relationship between various classes:


Step 1: Write the following classes


abstract class TreeNode
The purpose of this class is to model a node in a tree data structure. Trees are used to give a hierarchical ordering to information we want to store; for example, one might use a family tree to track a hereditary disease or recessive gene, or even in computer science to diagram the relationships between various classes, like we've been showing this semester in the projects.

Our trees will be made up of nodes, which look identical to one another; each node will store some data, and a reference to a set of child nodes. Similarly, each node will store a reference to its parent node. A tree can be built from nodes by storing the root node in a variable, and then adding its child nodes to it. Then, more generations of children can be added to each child node, etc., forming a tree.
Binary tree In the image below, the root would be node 2 at the top, and its children would be nodes 7 and 5.
ATTRIBUTES (please do not change their names)
static int count This will store the number of tree nodes that have been created, and will be used to generate their IDs.
String id Each node will have a unique id that corresponds to the number of nodes that have been created so far. The first node to be created would have an id of 1.
List<TreeNode> children This will store a list of all the children of the node..
TreeNode parent This will store a reference to the parent of the node, or null if the node is a root.
METHODS (please do not change their names)
public TreeNode(List children) This constructor will set the attributes to the incoming arguments. It will set the id of the node as well as hook up all the children passed in to the current node as their parent.
getters/setters for all attributes Use the Source option to automatically generate these in Eclipse.
public void addChild(TreeNode child) This method will add a single child to the current node, as well as connect the child node to its parent, the current node.

TagNode
The purpose of this class is to model an HTML node that serves a specific purpose in the browser, denoted by its tag. It extends the TreeNode. These types of nodes can also store additional attributes, as name-value pairs.
ATTRIBUTES (please do not change their names)
String tag This will hold the tag of the node; which has meaning for the browser.
Map<String,String> attributes This uses a map data structure to store the name-value pairs.
METHODS (please do not change their names)
public TagNode(String tag) Sets the attributes to the incoming arguments, as well as the parent class. Its children should be a new list.
getters/setters for all attributes Use the Source option to automatically generate these in Eclipse.
public void addAttribute(String name, String value) Records a name-value pair in the mapping.
public String getValue(String name) Returns the value associated with that incoming name in the attributes.
public String mineCloseText() Looks at all the children, and then siblings, of the current node; if any of these siblings are text nodes, it collects the text pieces, separated by spaces.

class TextNode
The purpose of this class is to text, which is not a part of any tag. It extends the TreeNode.
ATTRIBUTES (please do not change their names)
String text This will hold the text.
METHODS (please do not change their names)
public TextNode(String text) Sets the attributes to the incoming argument. A text node should not have any children, so it should set these to null.
getters/setters for all attributes Use the Source option to automatically generate these in Eclipse.

class FileParser
The purpose of this class is to parse an HTML file into a tree, and to mine elements of such a tree. Specifically, we're interested in locating images in an HTML file and finding surrounding and related text for an image to improve Google-like searches for images.
ATTRIBUTES (please do not change their names)
TagNode root This will hold the root of the HTML file parsed.
METHODS (please do not change their names)
public void createTree(ArrayList<String> lines) Creates a tree of nodes from the incoming HTML file (a list of strings). Each node beings with an opening tag and ends with a closing tag:

<TAG>
...children...
</TAG>

The two tags above correspond to a single TagNode. Between this opening and closing tag are the tag's children, which are either more tags or text nodes. Text nodes just appear as text, without any tags.

A tag can also be missing a closing tag, or inlined into a single line (which has no children) where there is a slash at the end:

<TAG\>

A tag may also have name-value pairs, which need to be stored in the tag as its attributes:

<TAG align="right" bgcolor="blue" \>

getters/setters for all attributes Use the Source option to automatically generate these in Eclipse.
public void mineImages(ArrayList images, TreeNode node) Uses recursion to populate the incoming argument with all the nodes that have IMG as their tag, starting with the node passed in. The method must call itself (direct recursion).
public String getKeywordsForImage(String filename) Looks for a node with an IMG tag, and if its src name-value pair matches the incoming filename, attempts to mine nearby text. First it will call the mineCloseText method on the node, and then sees if there is an alt name-value pair and collect the value, appending to the result. Otherwise, if it hasn't found the image yet, it makes recursive calls to all of the current node's children.

For ease of implementation, you may choose to use a helper method, that you must call getKeywordsForImageHelper , to implement this method. Either the getKeywordsForImage or the getKeywordsForImageHelper method must call itself (direct recursion). If you don't use getKeywordsForImageHelper, please implement it as an dummy method so the style checkers pass.


Step 2: Testing Your Code For Functionality and Elegance

Please log in again to view these unit tests and simulator code (fixme).


The unit tests will, in addition to correctness, measure the following elegance metrics:

Coding style and readability Make sure you use descriptive variables names, proper indentation, etc.
Class, attribute, and method documentation Make sure you comment the purpose of each of these items.
Adherence to abstraction barriers
  • Public methods are used to access private class attributes, rather than accessing them directly.
Code elegance
  • createTree should be simplified to be less than 50 lines of code.
  • Recursion must be used in the mineImages function.
  • Recursion must be used in the getKeywordsForImage function (or its helper function, if you wrote one).
Inheritance correctness and safety
  • All child classes appropriately use the parent class' constructors.
  • The @Override tag is used in all the appropriate places.
  • All appropriate attributes are private.


Step 4: Preparing for Assessments 7 and 8

Create a copy of all of your files in a separate directory (preferably a new project in Eclipse). Then, modify those files to get them to pass the unit tests of Sample Assessment 7 and Sample Assessment 8.