The purpose of this assignment is to create a fairly complex application. You will need to use networking classes, Jsoup, string manipulation, exception handling, probably need ArrayLists, etc.
I would like a web crawler that can find bad links in a web site. The application has three main functions.
Do NOT check external links to determine if they are okay or broken. Just list external links as external. We will assume that any href beginning with http:// or https:// is an external link.
Assume the web server is faculty.winthrop.edu. For example, if the user specified "dannellys", your program would search http://faculty.winthrop.edu/dannellys/.
Make your output clear and easily understood. It should be clear which links belong to the original URL and which links belong to different sub-URLs. Just of long list of links and whether they are good or bad is not very helpful to someone maintaining a web site.
Spaces inside an href will probably confuse your program, but that's okay. For example, <a href="file one.ppt"> will probably confuse your program.
AAA: link to BBB link to CCC BBB no links CCC link to DDD link to EEE DDD and EEE they exist, but contents don't matterIf told to search the file AAA.htm, the application will print that AAA contains valid links to BBB and CCC. Additionally, the application will print that BBB contains no links, and that CCC contains valid links to DDD and EEE.
Important Note: stop there! We don't want this application to become a big beast eating network bandwidth. So, we will only go down one level from AAA to BBB and CCC, but will not also search for links in the files DDD and EEE, and files that they link to, and files that those files link to, and so on.
The first thing to do is change version 1 to report which links are local or external.
Second, you will need a function that can determine if a link is good or bad. While Jsoup does have a way to do this, I found it much easier to use Java's networking functions to connect to faculty.winthrop.edu, GET the file, and read the first line of the reply. A code of 200 in the first line means the link is good.
The tricky part is getting the URL correct. If you are processing the file
/dannellys/csci392/default.htm
and you find a reference to
homework/hw01.htm
then the link you want to check is
/dannellys/csci392/homework/hw01.htm
That link is part of the file name and part of the reference.
That problem can be solved with either careful string manipulation or with Jsoup methods.
Third, for all local links that are good and end with either .htm or .html, you will need to search them for links and determine which links are okay, broken, or external. In other words, you need to figure out how to go down one level.
To solve the third problem, you could build a list of HTML links that you need to process. You should use an ArrayList for this. For example, while processing /dannellys/csci392, each good local link to an HTML file could be placed into an ArrayList. After processing /dannellys/csci392/, process each link in the list.
Another approach to solving the third problem is to write a recursive function. Be careful to add a stopping condition, otherwise you might end up accidentally creating a Denial of Service attack.
My main() was very short. It worked best to write several short functions.
Most importantly, ADD LOTS OF COMMENTS AND PRINT STATEMENTS.
To make grading a bit easier, please name the class (and hence the file name for your code) hw10.
Email your one file to dannellys@winthrop.edu by the beginning of the final day of class.