Meetups/Infra/2025-01-20

From Noisebridge
Jump to navigation Jump to search
Noisebridge | About | Visit | 272 | Manual | Contact | Guilds | Resources | Events | Projects | WGs | 5MoF | Meetings | Donate V · T · E
Events | Hosting | Streaming | Meetup | Upcoming Events | Anniversaries | Hackathons | External Events V · T · E
Meetups / Infra: 2025 | Template | Pad | Jitsi V · T · E

Talked about scraping & tools for requesting & processing HTML web pages, especially from the command line. ddgr, pandoc, html2text, htmlq + hq. bkt (rush) caching was also popular.


Introductions[edit | edit source]

* [name] - [background]. [goals for meetup, or interests to explore]
  • Loren - welcome to meetup -- want to talk about
  • Kevin -- like to code, like machine learning, have been working on spidering, scraping, fine-tuning LLMs, vector DBs, MCP. Also JS, TS, Java, golang, server side experience
  • Anup -- getting back in to server side, interested in self hosting.
  • Greg -- engineer & self-hosting. could talk about
  • Federico "fed" -- first time here, python, java, javascript, rust. In ML game late, now. web3 some apps.
  • Michael -- work at a hardware company that makes servers -- interest in self-hosting.
  • Sameer -- software developer, worked across much of the stack, getting back into infra after a long sojurn.
  • Jake -- do backend for work, video games on the side.


Lesson or Demo[edit | edit source]

https://meet.jit.si/nb-meetup-infra

  • federico finding coworking space -- last year --
  • michael -- found hackerspace, came once.
  • schema: personal interactive, personal non-interactive, project or busiess uses

Definition of terms:

  • scraping
  • just the crawling of a website
  • crawling & conversion
  • downloading & link extraction

There are many possibly pipelines, involving some combination of downloading from URLs, extracting links with deduplication & filtering, extract content, and summarize for other llm use.

[url(s)] --> [download: webpage]

webpage --> extract links --> (deduplicate -->) (filter -->) urls to scrape

webpage --> content extraction

webpage --> llm use?

Demos[edit | edit source]

  • Greg -- yt scraping -- "Pinchflat" self-hosted Channel export-backup. Elixir.

nix config to build a docker container.

TrueNAS with ZFS.


  • Kevin

Have a project to release later this week, on github. To parse out of common crawl: Crawling the Web

Rotating proxies. $4-8/GB. Home connections, proxies.

    • keepdb -- amazon products, charge by product lookup
    • camelcamelcamel

lxml & beautifulsoup4 (bs4) -- "street html" parsers.


  • Sameer -- simonw's git-scraping pattern, to capture the dimension of temporal variation in data, for sites that just show current status.

https://simonwillison.net/2020/Oct/9/git-scraping/ e.g. https://github.com/simonw/ca-fires-history




    • Kevin


https://endoflife.date/ https://endoflife.date/recommendations

ssh terminal.shop

https://github.com/charmbracelet/bubbletea


bkt -- curl -s https://endoflife.date/kde-plasma | htmlq 'div[class=main-content]' -p | html2text | col -b


cht.sh domain. 


  • Jake -- adversarial scraping.

Takeaways[edit | edit source]

  • Read aloud: clarify for meetup. We are taking notes in a riseup pad (or I am--help appreciated, and links). We have meeting notes posted to the wiki. noisebridge.net, search Infra, or Meetups/Infra. (the Infrastructure page has a disambiguation link.)
  • Shell, web services, self-hosting, networking!

Questions, Discussion, or Coworking[edit | edit source]

  • [Issue]

For next time[edit | edit source]

Questions[edit | edit source]

Readings & Exercises[edit | edit source]

  • Readings
  • Exercises

Join online[edit | edit source]

  • Try it yourself!
    • Join libera.chat #nb-meetup-infra

https://www.noisebridge.net/wiki/Meetups/Infra