Screenscraping made easy using jsoup and Maven

August 30th, 2011 by Micha Kops

Sometimes in a developer’s life there is no clean API available to gather information from a web application .. no SOAP, no XML-RPC and no REST .. just a website hiding the information we’re looking for somewhere in its DOM hierarchy – so the only solution is screenscraping.

Screenscraping always leaves me with a bad feeling – but luckily there is a tool that makes this job at least a bit easier for a developer .. jsoup to the rescue!

Prerequisites
Creating a new Project
Screenscraping a Website
Parsing HTML Fragments
Tutorial Sources
Resources

Prerequisites

Nothing special here .. just a JDK and good ole’ Maven ..

Creating a new Project

First we need a new Maven project …

Create a new Maven project using your IDE or via console mvn archetype:generate

We need just one dependency for jsoup – having added it my pom.xml finally looks like this

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
 <modelVersion>4.0.0</modelVersion>
 <groupId>com.hascode.samples</groupId>
 <artifactId>jsoup-example</artifactId>
 <version>0.0.1-SNAPSHOT</version>
 <dependencies>
 <dependency>
 <groupId>org.jsoup</groupId>
 <artifactId>jsoup</artifactId>
 <version>1.6.1</version>
 </dependency>
 </dependencies>
</project>

Screenscraping a Website

In the following example, we’re going to fetch the context of www.hascode.com and parse its title, the heading of the current article and some metadata available …

That’s what my screenscraping class looks like

package com.hascode.samples.jsoup;
 
import java.io.IOException;
 
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
 
public class WebScraper {
 public static void main(final String[] args) throws IOException {
 Document doc = Jsoup.connect("https://www.hascode.com/")
 .userAgent("Mozilla").timeout(6000).get();
 String title = doc.title(); // parsing the page's title
 System.out.println("The title of www.hascode.com is: " + title);
 Elements heading = doc.select("h2 > a"); // parsing the latest article's
 // heading
 System.out.println("The latest article is: " + heading.text());
 System.out.println("The article's URL is: " + heading.attr("href"));
 Elements editorial = doc.select("div.BlockContent-body small");
 System.out.println("The was created: " + editorial.text());
 }
}

Running the class we’re going to see the following output

The title of www.hascode.com is: hasCode.com
The latest article is: Contract-First Web-Services using JAX-WS, JAX-B, Maven and Eclipse
The article's URL is: https://www.hascode.com/2011/08/contract-first-web-services-using-jax-ws-jax-b-maven-and-eclipse/
The was created: August 23rd, 2011 by micha kops

Parsing HTML Fragments

Sometimes we get a single fragment of HTML code from an API .. no problem with jsoup …

The fragment html parser

package com.hascode.samples.jsoup;
 
import java.io.IOException;
 
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
 
public class FragmentParser {
 public static void main(final String[] args) throws IOException {
 String htmlFragment = "<div class=\"breadcrumb\">";
 htmlFragment += "<ul><li><a href=\"/\">Home</a></li>";
 htmlFragment += "<li><a href=\"#cat1\">Category 1</a></li>";
 htmlFragment += "</ul></div>";
 Document doc = Jsoup.parseBodyFragment(htmlFragment);
 Element div = doc.body().select("div").first();
 Element a1 = div.select("ul a").first();
 Element a2 = div.select("ul a").get(1);
 System.out.println(String.format("The div has the class '%s'",
 div.attr("class")));
 System.out
 .println(String
 .format("The first link in the breadcrum has the text '%s' and links to '%s'.",
 a1.text(), a1.attr("href")));
 System.out
 .println(String
 .format("The second link in the breadcrumb has the text '%s' and links to '%s'",
 a2.text(), a2.attr("href")));
 }
}

And its output produced

The div has the class 'breadcrumb'
The first link in the breadcrum has the text 'Home' and links to '/'.
The second link in the breadcrumb has the text 'Category 1' and links to '#cat1'

Tutorial Sources

I have put the source from this tutorial on my Bitbucket repository – download it there or check it out using Mercurial:

hg clone https://bitbucket.org/hascode/hascode-tutorials

Resources

Tags: crawler, dom, extraction, html, jsoup, maven, parser, screenscraping, tutorial

This entry was posted on Tuesday, August 30th, 2011 at 8:03 pm and is filed under Java. You can follow any responses to this entry through the RSS 2.0 feed. You can skip to the end and leave a response. Pinging is currently not allowed.

2 Responses to “Screenscraping made easy using jsoup and Maven”

mahesh Says:
June 2nd, 2016 at 10:45 am
can i take a screenshot of url i given using jsoup
rushang Says:
August 14th, 2017 at 6:38 am
nice