Meetups/Infra/2025-01-20
Noisebridge | About | Visit | 272 | Manual | Contact | Guilds | Resources | Events | Projects | WGs | 5MoF | Meetings | Donate | V · T · E |
Events | Hosting | Streaming | Meetup | Upcoming Events | Anniversaries | Hackathons | External Events | V · T · E |
Meetups / Infra: 2025 | Template | Pad | Jitsi | V · T · E |
Talked about scraping & tools for requesting & processing HTML web pages, especially from the command line. ddgr, pandoc, html2text, htmlq + hq. bkt (rush) caching was also popular.
Introductions[edit | edit source]
* [name] - [background]. [goals for meetup, or interests to explore]
- Loren - welcome to meetup -- want to talk about
- Kevin -- like to code, like machine learning, have been working on spidering, scraping, fine-tuning LLMs, vector DBs, MCP. Also JS, TS, Java, golang, server side experience
- Anup -- getting back in to server side, interested in self hosting.
- Greg -- engineer & self-hosting. could talk about
- Federico "fed" -- first time here, python, java, javascript, rust. In ML game late, now. web3 some apps.
- Michael -- work at a hardware company that makes servers -- interest in self-hosting.
- Sameer -- software developer, worked across much of the stack, getting back into infra after a long sojurn.
- Jake -- do backend for work, video games on the side.
Lesson or Demo[edit | edit source]
https://meet.jit.si/nb-meetup-infra
- federico finding coworking space -- last year --
- michael -- found hackerspace, came once.
- schema: personal interactive, personal non-interactive, project or busiess uses
Definition of terms:
- scraping
- just the crawling of a website
- crawling & conversion
- downloading & link extraction
There are many possibly pipelines, involving some combination of downloading from URLs, extracting links with deduplication & filtering, extract content, and summarize for other llm use.
[url(s)] --> [download: webpage] webpage --> extract links --> (deduplicate -->) (filter -->) urls to scrape webpage --> content extraction webpage --> llm use?
Demos[edit | edit source]
- Greg -- yt scraping -- "Pinchflat" self-hosted Channel export-backup. Elixir.
nix config to build a docker container.
TrueNAS with ZFS.
- Kevin
Have a project to release later this week, on github. To parse out of common crawl: Crawling the Web
Rotating proxies. $4-8/GB. Home connections, proxies.
- keepdb -- amazon products, charge by product lookup
- camelcamelcamel
lxml & beautifulsoup4 (bs4) -- "street html" parsers.
- Sameer -- simonw's git-scraping pattern, to capture the dimension of temporal variation in data, for sites that just show current status.
https://simonwillison.net/2020/Oct/9/git-scraping/ e.g. https://github.com/simonw/ca-fires-history
- scraping, scraping infrastructure, scripting for it
- llm.txt (https://llmstxt.org/) (https://llmstxt.site/)
- Kevin
https://endoflife.date/
https://endoflife.date/recommendations
ssh terminal.shop
https://github.com/charmbracelet/bubbletea
bkt -- curl -s https://endoflife.date/kde-plasma | htmlq 'div[class=main-content]' -p | html2text | col -b
cht.sh domain.
- Jake -- adversarial scraping.
Takeaways[edit | edit source]
- Read aloud: clarify for meetup. We are taking notes in a riseup pad (or I am--help appreciated, and links). We have meeting notes posted to the wiki. noisebridge.net, search Infra, or Meetups/Infra. (the Infrastructure page has a disambiguation link.)
- Shell, web services, self-hosting, networking!
Questions, Discussion, or Coworking[edit | edit source]
- [Issue]
For next time[edit | edit source]
Questions[edit | edit source]
Readings & Exercises[edit | edit source]
- Readings
- Exercises
Join online[edit | edit source]
- Try it yourself!
- Join libera.chat #nb-meetup-infra