Loading... Cancel

JavaFX: Parse HTML content with DOMParser R

May 18th, 2007

I have started an interesting project. I will illustrate a crawler based on JavaFX scripting. Though there is no specific reason beyond using JavaFX except to enhance my knowledge on JavaFX scripting.

Here is some preliminary code that I have already written:

Import statement and package declaration –

———————————————————

package com.we4tech.linkcrawler.fx.script;

import java.lang.*;

import java.io.*;

import java.net.*;

import java.util.*;

import org.xml.sax.SAXException;

import org.w3c.dom.Document;

import org.w3c.dom.NodeList;

import org.w3c.dom.Node;

import org.w3c.dom.NamedNodeMap;

import com.sun.org.apache.xerces.internal.parsers.DOMParser;

Sample class skeleton –

————————————————————–

/**

* Define structure for Crawler class.

* @author hasan (hasan -AT- somewherein.net)

*/

public class Crawler {

/**

* Define the target url.

*/

public attribute url: String;

public attribute debug: Boolean;

private attribute foundLinks: String*;

/**

* Start crawling process.

*/

public operation start();

private operation followupLink(link: String);

/**

* Return a list of all available links.

*/

public function getLinks(): String*;

/**

* Print out debug output without timestamp and [LEVEL] prefix.

*/

private operation debug(message: String);

private operation error(message: String);

}

Implementation –

————————————————————————-

operation Crawler.start() {

debug(”Initiating crawler process..”);

debug(”URL is set to - {url}”);

debug(”Crawling just initiated.”);

followupLink(url);

}

operation Crawler.followupLink(link: String) {

debug(”Follow up link - {link}”);

if (link == null or link.length() == 0) {

error(”URL is empty”);

return;

} else {

try {

// Run in background thread through EDT.

do {

var parse = new DOMParser();

parse.parse(link);

var document = parse.getDocument();

var nodes = document.getElementsByTagName(”a”);

for (i in [0..nodes.getLength()]) {

var node = nodes.item(i);

var attributes = node.getAttributes();

if (attributes <> null) {

for (j in [0..attributes.getLength()]) {

var attr = attributes.item(j);

if (attr <> null) {

var nodeName = attr.getNodeName();

if (”href” == nodeName) {

var nodeValue = attr.getNodeValue();

debug(”{nodeName} link - {nodeValue}”);

insert nodeValue into foundLinks;

// start new follow up process.

followupLink(

}

}

}

}

}

}

} catch (e) {

error(”Error found during opening up new URLConnection - {e}”);

}

}

}

operation Crawler.debug(message: String) {

if (debug) {

System.out.println(”[DEBUG] - [{new Date()}] - {message}”);

}

}

operation Crawler.error(message: String) {

System.out.println(”[ERROR] - [{new Date()}] - {message}”);

}

function Crawler.getLinks() {

return foundLinks;

}

NOTE: I have published these codes to make some sense to those people who have started learning JavaFX. This might help other people to know more on it.

Best wishes,

A webmaster may or may not have qualms about pay per click. However majority of the search engine optimization experts advise to go with it and use other modalities like email marketing as well. In page optimization specially with the web design is encouraged too.

Total 1 response found

Close
  •   GGner

    Thu Jan 70 06:33

    That's what i'm looking for. I'm a jfx learner and trying to write a javafx html parser.
    But I couldn't see the definiton of followupLink() function in your code.
    Is it right?
    Thanks...

Posted in Java, JavaFX