Racket Web Scraping

Hi, Racket Discourse.

I would like to ask what the current state is of web scraping libraries in Racket, and whether it would be better to just roll my own.

I am thinking about building some functionality to help my team gather IOCs (indicators of compromise) and metadata regarding current malware campaigns and the like. We're quite a small team, so nothing ostentatious, but it would be nice to be able to gather the data and process it into queries/formats that our SIEM and EDR systems can understand, for example.

I can see a couple of salient resources:

But I've never worked with SXML before, so I have no idea what to expect. I probably won't build a bona fide crawler, but perhaps something that can be run once when you come upon a site with some interesting information that you'd like to quickly scrape and process. Our EDR, SentinelOne, has a similar integration/plug-in called SentinelOne Hunter; however, I often find myself thinking that it's rather limited.

As I mention, this is still only a pipe-dream, so any advice or commentary would be welcome.

Thanks!

1 Like

I'm not sure this is the best answer, but no one else has replied yet, so:

I've done some scraping. IIRC and FWIW: I used the html-parsing package, but not SXML. Instead I used xexprs. After quickly discovering it gets tedious and error-prone to "query" these using raw match, I rolled my own tiny system: A select function, with various combinator functions like parent?, class?, tag?, etc., that could be combined with conjoin.

So I could write vaguely xpath style, e.g.

(select xexpr
        (parent? (conjoin (tag? 'div)
                          (class? "some-box")))
        (tag? 'a)
        (class? "some-class")
        (attr-val 'href))

However. I was scraping mostly commercial web sites. So I learned the biggest challenge isn't parsing the HTML. Instead it's getting the HTML in the first place.

  • The default User-Agent won't cut it, if you use Racket to fetch.

  • Even if you get that right, and respect robots.txt, many sites these days use third-party anti-bot products and block you. (If you ask the site nicely and have some apparent cred, they might give you a key to get around this... but maybe not.)

My impression is that "pro" scrapers these days drive a headless browser that is (hopefully) indistinguishable from a human visitor, and various other tricks.

Or another solution is, just become as big as Google or Bing.

(Hooray for the open web, "information wants to be free". :wink:)

So when you say this:

That sounds like an easier plan. Maybe just focus on, "I already have the HTML bytes saved somehow from the browser", onward -- just the parsing/interpretation?

p.s. If instead of commercial sites you're targeting malware sites, maybe some of the above is less of a problem, I don't know.

2 Likes

There isn't a “web scraping library” in Racket; rather, scraping is a straightforward application of other packages.

Try this in DrRacket (after installing each of the modules in the require):

#lang racket/base

(require html-parsing
         net/http-easy
         threading
         txexpr)

(define (link? txpr)
  (and (list? txpr) (eq? 'a (car txpr))))

(define (scrape-links url)
  (~> (get url)
      response-body
      bytes->string/utf-8
      html->xexp
      (findf*-txexpr link?)))

I would recommend looking into the documentation for txexpr and http-easy in particular.

The above uses html->xexp from html-parsing in preference to string->xexpr because the latter will barf if the input isn’t pristine XML.

The threading module is simply a syntactic convenience.

2 Likes

Thanks for the reply, @greghendershott. I actually came upon one of your posts regarding scraping with Racket, and initially skimmed it because I am so unfamiliar with scraping in general.

But it is interesting, all of the limitations you mention. It seems like a problem where you already kind of have to know what it is you're looking for, to be able to find the solution. And as you say, being big doesn't hurt.

I like the idea of having something as simple as a select expression that can be used when the user is already visiting the site. My team isn't necessarily "programming literate" beyond certain scripting tasks and the like, although we are working towards that end. Thus, it would be nice to present an interface which is easy to understand, with just the "right" amount of magic in the background.

Not so many actual malware sites at present, but definitely reports and blogs and so on that describe investigations of breaches, malware campaigns, and that kind of thing. Often, people are kind enough to package their results in a format that can be easily ingested, but for the more elaborate reports, this can be challenging.

I mention this, because my idea with the scraper is to kind of paper over all of the copy-pastas that have to happen in this process, and so that one can more easily keep a timeline when conducting research online.

In any case, I appreciate the input!

1 Like

Hi, @joeld! Wow, that was pretty cool. Your code-snippet provides a neat intuition for what this hypothetical select statement might look, as well, from the inside.

And the threading is particularly nifty. I just read about @lexi.lambda and threading in a forum somewhere earlier this week, regarding macros in Clojure vs. Racket, if I recall correctly.

Thumbs up with http-easy, I have been using it to make API requests, because it was the easiest to understand. I will definitely have a closer look at the docs you mention.

Much obliged.

Do check out Qi for a more advanced threading. With judicious macros, you could probably swap conjoin for and in the select example.

I've seen, in passing, that there's quite some activity around Qi, it looks very intriguing. Thanks, @benknoble.