The purpose of this assignment is to create a fairly complex application. You will need to use networking classes, ArrayLists, string manipulation, exception handling, etc.
I would like a web crawler that can find bad links in a web site. The application has three main functions.
If a htm file is referenced more than once, only search its contents once.
For example...
Assume the file AAA.htm contains links to the files BBB.htm and CCC.htm. The file BBB.htm contains no links. The file CCC.htm links to DDD.htm and EEE.htm. In other words:
AAA:
link to BBB
link to CCC
BBB
no links
CCC
link to DDD
link to EEE
DDD and EEE
they exist, but contents don't matter
If told to search the file AAA.htm, the application will print that AAA contains valid
links to BBB and CCC. Additionally, the application will print that BBB contains no
links, and that CCC contains valid links to DDD and EEE.
Important Note: stop there! We don't want this application to become a big beast eating network space. So, we will only go down one level from AAA to BBB and CCC, but will not also search for links in the files DDD and EEE, and files that they link to, and files that those files link to, and ... If this were a real web crawler, then we might make it recursive. But a recursive version of this program would quickly create a huge list of links to search.
Here is an example run when searching faculty.winthrop.edu/dannelly/csci392/. Note that I printed the list of external links (which are not validated), but you can just ignore external links.
To make programming this project manageable, we will impose a few limitations.
First, hard-code your application to only connect to faculty.winthrop.edu. Although, you may want to test the program on your web site.
Second, don't worry about links to external sites. Note that the first link below is local and needs to be processed, but the second is external and should be ignored.
<a href="bob.htm"> a link to bob </a>
<a href="http://www.cnn.com"> an external link </a>
Third, ignore email links.
Forth, it is okay to ignore (or give an error for) lines that contain two links.
Where do I get started?
The first thing to do is put all the links that version 1 found into an ArrayList. Just print the contents of the list. If it contains the right stuff, then move on. After you put every link into the list, try to put just the local .htm files into that list.
The second task that I did was write a function that would return true or false if a file exists or not. So, my version 1.1 just printed the list of htm links and if those links were good or bad. This is easy to do. Just open a socket to the server, send the request, and read just the first line. (The first line contains status info that will tell us if the file exists. We don't care (yet) about the contents of the file.) If the first line contains a code 200 (an "okay" message), then return true, otherwise return false.
I originally had all my code in main. So, for me, the next task was to use much of the code inside main to write a function that takes a file name as a parameter and read each line of the file to find the links (all the link names go into an ArrayList). Once this was a function, main could call the function once passing in arg[0], and then again in a loop passing in the contents of the arraylist of links.
My final word of advice is, yet again, ADD LOTS OF COMMENTS AND PRINT STATEMENTS. Just to illustrate this point, here is some debugging output from one of my versions.
To make grading a bit easier, please name the class (and hence the file name for your code) hw8.
Email your one file to dannellys@winthrop.edu by the beginning of class on Nov 3rd.