How to browse a website and retrieve data using Rust


0


Web scraping is a popular technique for collecting large amounts of data from web pages quickly and efficiently. In the absence of an API, web scraping may be the next best approach.


Rust’s speed and memory safety make the language ideal for creating web scrapers. Rust hosts many powerful parsing and data extraction libraries, and its robust error-handling capabilities come in handy for efficient and reliable web data collection.

MAKEUSEOF VIDEO OF THE DAYScroll to continue with the content

Web scraping in Rust

Many popular libraries support web scraping in Rust, including reqwest, scraper, select, and html5ever. Most Rust developers combine reqwest and scraper features for their web scraping.

The reqwest library provides functions for sending HTTP requests to web servers. Reqwest is based on the integrated version of Rust hyper crate while providing a high-level API for standard HTTP functions.

Scraper is a powerful web scraping library that parses HTML and XML documents and extracts data using CSS selectors and XPath expressions.

After creating a new Rust project with the charge new command, add the Please And scraper crates in the Dependencies section of your freight.toml File:

 (dependencies)
reqwest = {version = "0.11", features = ("blocking")}
scraper = "0.12.0"

you will use it Please to send HTTP requests and scraper for parsing.

Retrieving webpages with Reqwest

They send a request for the content of a web page before analyzing it to retrieve specific data.

You can send a GET request and get the HTML source of a page using text function on the receive function of Please Library:

 fn retrieve_html() -> String {
    let response = get("https://news.ycombinator.com").unwrap().text().unwrap();
    return response;
}

The receive The function sends the request to the web page and the text The function returns the text of the HTML.

Parse HTML with Scraper

The fetch_html The function returns the HTML text and you need to parse the HTML text to get the specific data you need.

Scraper provides functionality for interacting with HTML in HTML And voters modules. The HTML The module provides functions for parsing the document voters The module provides functions for selecting specific elements from the HTML.

To get all titles on a page:

 use scraper::{Html, Selector};

fn main() {
    let response = reqwest::blocking::get(
        "https://news.ycombinator.com/").unwrap().text().unwrap();

    
    let doc_body = Html::parse_document(&response);

    
    let title = Selector::parse(".titleline").unwrap();
        
    for title in doc_body.select(&title) {
        let titles = title.text().collect::<Vec<_>>();
        println!("{}", titles(0))
    }
}

The parse_document function of HTML The module parses the HTML text and the Analyze function of voters The module selects the elements with the specified CSS selector (in this case the headline Class).

The for The loop iterates through these elements and outputs the first block of text from each one.

Here is the result of the operation:

result from retrieving titles from a web page

Selecting Attributes with Scraper

To select an attribute value, retrieve and use the required elements as before attr Tag value instance method:

 use reqwest::blocking::get;
use scraper::{Html, Selector};

fn main() {
    let response = get("https://news.ycombinator.com").unwrap().text().unwrap();
    let html_doc = Html::parse_document(&response);
    let class_selector = Selector::parse(".titleline").unwrap();

    for element in html_doc.select(&class_selector) {
        let link_selector = Selector::parse("a").unwrap();

        for link in element.select(&link_selector) {
            if let Some(href) = link.value().attr("href") {
                println!("{}", href);
            }
        }
    }
}

After selecting elements with the headline class with the analyze function that for The loop traverses them. The code is then called within the loop A tags and select them href attribute with the attr Function.

The primarily Function returns these links, with a result like this:

Result when retrieving URLs from a web page

You can build sophisticated web applications in Rust

Recently, Rust has become the language for web development from frontend to server-side app development.

You can use Webassembly to build full-stack web applications using libraries like Yew and Percy, or build server-side applications using Actix, Rocket, and the many libraries in the Rust ecosystem that provide functionality for building web applications.

Source link


Like it? Share with your friends!

0
ncult

0 Comments

Your email address will not be published. Required fields are marked *