Screenscraping made easy using jsoup and Maven

August 30th, 2011 by

Sometimes in a developer’s life there is no clean API available to gather information from a web application .. no SOAP, no XML-RPC and no REST .. just a website hiding the information we’re looking for somewhere in its DOM hierarchy – so the only solution is screenscraping.

Screenscraping always leaves me with a bad feeling – but luckily there is a tool that makes this job at least a bit easier for a developer .. jsoup to the rescue!

 

Prerequisites

Nothing special here .. just a JDK and good ole’ Maven ..

Creating a new Project

First we need a new Maven project …

  • Create a new Maven project using your IDE or via console mvn archetype:generate
  • We need just one dependency for jsoup – having added it my pom.xml finally looks like this
    <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
     <modelVersion>4.0.0</modelVersion>
     <groupId>com.hascode.samples</groupId>
     <artifactId>jsoup-example</artifactId>
     <version>0.0.1-SNAPSHOT</version>
     <dependencies>
     <dependency>
     <groupId>org.jsoup</groupId>
     <artifactId>jsoup</artifactId>
     <version>1.6.1</version>
     </dependency>
     </dependencies>
    </project>

Screenscraping a Website

In the following example, we’re going to fetch the context of www.hascode.com and parse its title, the heading of the current article and some metadata available …

  • That’s what my screenscraping class looks like
    package com.hascode.samples.jsoup;
     
    import java.io.IOException;
     
    import org.jsoup.Jsoup;
    import org.jsoup.nodes.Document;
    import org.jsoup.select.Elements;
     
    public class WebScraper {
     public static void main(final String[] args) throws IOException {
     Document doc = Jsoup.connect("https://www.hascode.com/")
     .userAgent("Mozilla").timeout(6000).get();
     String title = doc.title(); // parsing the page's title
     System.out.println("The title of www.hascode.com is: " + title);
     Elements heading = doc.select("h2 > a"); // parsing the latest article's
     // heading
     System.out.println("The latest article is: " + heading.text());
     System.out.println("The article's URL is: " + heading.attr("href"));
     Elements editorial = doc.select("div.BlockContent-body small");
     System.out.println("The was created: " + editorial.text());
     }
    }
  • Running the class we’re going to see the following output
    The title of www.hascode.com is: hasCode.com
    The latest article is: Contract-First Web-Services using JAX-WS, JAX-B, Maven and Eclipse
    The article's URL is: https://www.hascode.com/2011/08/contract-first-web-services-using-jax-ws-jax-b-maven-and-eclipse/
    The was created: August 23rd, 2011 by micha kops

Parsing HTML Fragments

Sometimes we get a single fragment of HTML code from an API .. no problem with jsoup …

  • The fragment html parser
    package com.hascode.samples.jsoup;
     
    import java.io.IOException;
     
    import org.jsoup.Jsoup;
    import org.jsoup.nodes.Document;
    import org.jsoup.nodes.Element;
     
    public class FragmentParser {
     public static void main(final String[] args) throws IOException {
     String htmlFragment = "<div class=\"breadcrumb\">";
     htmlFragment += "<ul><li><a href=\"/\">Home</a></li>";
     htmlFragment += "<li><a href=\"#cat1\">Category 1</a></li>";
     htmlFragment += "</ul></div>";
     Document doc = Jsoup.parseBodyFragment(htmlFragment);
     Element div = doc.body().select("div").first();
     Element a1 = div.select("ul a").first();
     Element a2 = div.select("ul a").get(1);
     System.out.println(String.format("The div has the class '%s'",
     div.attr("class")));
     System.out
     .println(String
     .format("The first link in the breadcrum has the text '%s' and links to '%s'.",
     a1.text(), a1.attr("href")));
     System.out
     .println(String
     .format("The second link in the breadcrumb has the text '%s' and links to '%s'",
     a2.text(), a2.attr("href")));
     }
    }
  • And its output produced
    The div has the class 'breadcrumb'
    The first link in the breadcrum has the text 'Home' and links to '/'.
    The second link in the breadcrumb has the text 'Category 1' and links to '#cat1'

Tutorial Sources

I have put the source from this tutorial on my Bitbucket repository – download it there or check it out using Mercurial:

hg clone https://bitbucket.org/hascode/hascode-tutorials

Resources

Tags: , , , , , , , ,

2 Responses to “Screenscraping made easy using jsoup and Maven”

  1. mahesh Says:

    can i take a screenshot of url i given using jsoup

  2. rushang Says:

    nice

Leave a Reply

Please note, that no personal information like your IP address is stored and you're not required to enter you real name.

Comments must be approved before they are published, so please be patient after having posted a comment - it might take a short while.

Please leave these two fields as-is:
Search
Categories