Web Crawler 2.0

Java Homework Ten
Web Crawler 2.0

The purpose of this assignment is to create a fairly complex application. You will need to use networking classes, Jsoup, string manipulation, exception handling, probably need ArrayLists, etc.

Requirements

I would like a web crawler that can find bad links in a web site. The application has three main functions.

I would like to specify an HTML file via the command line (arg[0]) and have the application list all links within that file (that was version 1.0).
For each of the links that are local, the application should print a message indicating if the link is okay or broken.
For each of the links that are links to local HTML files, the application should list all links inside those sub-files and determine if those links are valid.

Do NOT check external links to determine if they are okay or broken. Just list external links as external. We will assume that any href beginning with http:// or https:// is an external link.

Assume the web server is faculty.winthrop.edu. For example, if the user specified "dannellys", your program would search http://faculty.winthrop.edu/dannellys/.

Make your output clear and easily understood. It should be clear which links belong to the original URL and which links belong to different sub-URLs. Just of long list of links and whether they are good or bad is not very helpful to someone maintaining a web site.

Spaces inside an href will probably confuse your program, but that's okay. For example, <a href="file one.ppt"> will probably confuse your program.

Depth of the Search

Assume the file AAA.htm contains links to the files BBB.htm and CCC.htm. The file BBB.htm contains no links. The file CCC.htm links to DDD.htm and EEE.htm. In other words:

   AAA:
      link to BBB
      link to CCC
   BBB
      no links
   CCC
      link to DDD
      link to EEE
   DDD and EEE
      they exist, but contents don't matter

If told to search the file AAA.htm, the application will print that AAA contains valid links to BBB and CCC. Additionally, the application will print that BBB contains no links, and that CCC contains valid links to DDD and EEE.

Important Note: stop there! We don't want this application to become a big beast eating network bandwidth. So, we will only go down one level from AAA to BBB and CCC, but will not also search for links in the files DDD and EEE, and files that they link to, and files that those files link to, and so on.

Example Run

Here is an example run of one of my versions of this program. This output came from a solution where I used recursion. The output from my non-recursive version looks a lot different.

Hints

Where do I get started?

The first thing to do is change version 1 to report which links are local or external.

Second, you will need a function that can determine if a link is good or bad. While Jsoup does have a way to do this, I found it much easier to use Java's networking functions to connect to faculty.winthrop.edu, GET the file, and read the first line of the reply. A code of 200 in the first line means the link is good.

The tricky part is getting the URL correct. If you are processing the file
    /dannellys/csci392/default.htm
and you find a reference to
    homework/hw01.htm
then the link you want to check is
    /dannellys/csci392/homework/hw01.htm
That link is part of the file name and part of the reference.

That problem can be solved with either careful string manipulation or with Jsoup methods.

Third, for all local links that are good and end with either .htm or .html, you will need to search them for links and determine which links are okay, broken, or external. In other words, you need to figure out how to go down one level.

To solve the third problem, you could build a list of HTML links that you need to process. You should use an ArrayList for this. For example, while processing /dannellys/csci392, each good local link to an HTML file could be placed into an ArrayList. After processing /dannellys/csci392/, process each link in the list.

Another approach to solving the third problem is to write a recursive function. Be careful to add a stopping condition, otherwise you might end up accidentally creating a Denial of Service attack.

General Advice

My main() was very short. It worked best to write several short functions.

Most importantly, ADD LOTS OF COMMENTS AND PRINT STATEMENTS.

Submission Instructions

To make grading a bit easier, please name the class (and hence the file name for your code) hw10.

Email your one file to dannellys@winthrop.edu by the beginning of the final day of class.