How to Do Web Scraping in Java - Smart CAPTCHA Automation for Web Scraping

Java isn’t the first language most engineers think of for scraping, but it’s a solid choice: mature HTTP clients, robust HTML parsers, excellent concurrency, and enterprise-grade tooling. With Java 17+ you get modern APIs, better performance, and a thriving ecosystem that makes crawling, parsing, and exporting data straightforward—even for large jobs.

Picking the Right Tool for the Job

For static HTML, Jsoup is the workhorse: it fetches pages and exposes a jQuery-like selector API for parsing. When you need headless browsing because content is rendered by JavaScript, Selenium WebDriver (with Chrome/Firefox in headless mode) steps in. For high-volume crawling with retries, backoff, and connection pools, Apache HttpClient or OkHttp give you low-level control. And for JSON APIs that sit behind the page, Jackson cleanly maps responses into POJOs.

Fast Path: Static Pages with Jsoup

Add the dependency (Maven):

<dependency>
  <groupId>org.jsoup</groupId>
  <artifactId>jsoup</artifactId>
  <version>1.17.2</version>
</dependency>

Fetch and parse:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;

public class JsoupExample {
  public static void main(String[] args) throws Exception {
    Document doc = Jsoup
        .connect("https://example.com/products?page=1")
        .userAgent("Mozilla/5.0 (compatible; JavaScraper/1.0)")
        .timeout(10_000)
        .get();

    for (Element card : doc.select(".product-card")) {
      String title = card.select(".title").text();
      String price = card.select(".price").text();
      String url   = card.select("a").attr("abs:href");
      System.out.printf("%s | %s | %s%n", title, price, url);
    }
  }
}

Two details matter: set a realistic User-Agent and use abs:href so relative links are resolved automatically. Pagination is simply another request with the next page URL. Throttle requests (e.g., Thread.sleep(500)) to avoid hammering a site.

When HTML Isn’t There Yet: Selenium for Dynamic Pages

If the DOM is empty until JavaScript runs, use headless Chrome with Selenium.

Maven:

<dependency>
  <groupId>org.seleniumhq.selenium</groupId>
  <artifactId>selenium-java</artifactId>
  <version>4.23.0</version>
</dependency>

Basic headless scrape:

import org.openqa.selenium.*;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.chrome.ChromeOptions;
import org.openqa.selenium.support.ui.WebDriverWait;
import org.openqa.selenium.support.ui.ExpectedConditions;
import java.time.Duration;

public class SeleniumExample {
  public static void main(String[] args) {
    ChromeOptions options = new ChromeOptions();
    options.addArguments("--headless=new", "--disable-gpu", "--window-size=1366,768");
    WebDriver driver = new ChromeDriver(options);

    try {
      driver.get("https://example.com/infinite-scroll");
      new WebDriverWait(driver, Duration.ofSeconds(10))
          .until(ExpectedConditions.presenceOfElementLocated(By.cssSelector(".item")));

      // Scroll to load more
      ((JavascriptExecutor) driver).executeScript("window.scrollTo(0, document.body.scrollHeight)");
      Thread.sleep(1500);

      for (WebElement el : driver.findElements(By.cssSelector(".item"))) {
        System.out.println(el.getText());
      }
    } catch (InterruptedException ignored) {
    } finally {
      driver.quit();
    }
  }
}

Selenium is heavier and slower than Jsoup; treat it as a scalpel for JS-heavy pages, not your default.

Robust Networking: Apache HttpClient + Jackson for APIs

Many “web scraping” tasks are really API scraping once you inspect the Network tab. That’s faster and cleaner.

Dependencies:

<dependency>
  <groupId>org.apache.httpcomponents.client5</groupId>
  <artifactId>httpclient5</artifactId>
  <version>5.3.1</version>
</dependency>
<dependency>
  <groupId>com.fasterxml.jackson.core</groupId>
  <artifactId>jackson-databind</artifactId>
  <version>2.17.1</version>
</dependency>

Request + parse JSON:

import org.apache.hc.client5.http.classic.methods.HttpGet;
import org.apache.hc.client5.http.impl.classic.CloseableHttpClient;
import org.apache.hc.client5.http.impl.classic.HttpClients;
import org.apache.hc.core5.http.ClassicHttpResponse;
import com.fasterxml.jackson.databind.ObjectMapper;

record Product(String id, String title, double price) {}

public class ApiScrape {
  public static void main(String[] args) throws Exception {
    try (CloseableHttpClient http = HttpClients.custom().build()) {
      HttpGet get = new HttpGet("https://example.com/api/products?page=1");
      get.addHeader("User-Agent", "JavaScraper/1.0");
      try (ClassicHttpResponse resp = http.execute(get)) {
        ObjectMapper mapper = new ObjectMapper();
        Product[] items = mapper.readValue(resp.getEntity().getContent(), Product[].class);
        for (Product p : items) {
          System.out.printf("%s | %s | %.2f%n", p.id(), p.title(), p.price());
        }
      }
    }
  }
}

HttpClient gives you retries, timeouts, cookie stores, and proxy support—useful for resilience and compliance.

Concurrency, Rate Limits, and Backoff

Scrapers fail not because of parsing but because of impolite concurrency. Use a bounded pool and exponential backoff:

import java.net.SocketTimeoutException;
import java.time.Duration;
import java.util.concurrent.*;

public class CrawlPool {
  private static final ExecutorService pool =
      new ThreadPoolExecutor(4, 8, 30, TimeUnit.SECONDS, new LinkedBlockingQueue<>(200));

  static <T> T retry(Callable<T> task) throws Exception {
    int attempts = 0;
    long delay = 500;
    while (true) {
      try { return task.call(); }
      catch (SocketTimeoutException | java.io.IOException e) {
        if (++attempts >= 5) throw e;
        Thread.sleep(delay);
        delay = Math.min((long)(delay * 1.7), Duration.ofSeconds(8).toMillis());
      }
    }
  }
}

Keep per-host concurrency low (4–8 threads), respect Retry-After headers, and randomise small delays to avoid patterns.

Dealing with Anti-Bot Defences

Modern sites use rate limiting, JavaScript challenges, and behavioural detection. Simple mitigations include realistic headers, session reuse (cookies), and gradual navigation that mirrors a human. Headless browsers help render challenges, but always weigh terms of service and legal constraints. If a site disallows scraping via robots.txt or contract terms, don’t scrape it.

Cleaning and Exporting Data

After extraction, normalise text (trim whitespace, decode entities), validate numbers/dates, and deduplicate by a stable key (product URL or ID). For output, CSV is simplest; for analytics pipelines, write Parquet via Apache Avro/Parquet libraries.

Compliance and Ethics

Check robots.txt, site policies, and local law. Identify your scraper with a contact email in the User-Agent, throttle requests, and avoid PII. Caching responses and using If-Modified-Since headers reduces load on target sites and keeps you a good citizen.

The Bottom Line

In Java, the happy path is: try the API first, fall back to Jsoup for static HTML, and reach for Selenium only when JavaScript blocks the way. Wrap it with polite concurrency, retries, and clear parsing, and you’ve got an industrial-strength scraper that’s fast, maintainable, and acceptable to run in production.

Author

Owen Savage