Meetups/Infra/2024-03-11

From Noisebridge
Jump to navigation Jump to search

Regulars and some newcomers took time to talk about database systems in detail, Change Data Capture (CDC) systems, often for export of data from transactional to analytical databases. We discussed some perils of high volume email sending, the limits of text as a way of learning about the world, and plans for possible future presentations -- both during introductions, and at the end of the meeting.

Introductions[edit | edit source]

  • [name] - [background]. [goals for meetup, or interests to explore]. [what present]
  • Gull - interest in laser cutting, 3d printing. SW infra, training. portfolio graphics, art, 3d print. Adobe creative suite, InDesign. Fusion 360.
  • Doug - lots of little projects, still trying to sell retail AI. Panama City and Lima tix again - Peru's largest direct-to-consumer company. Selling csv file Present, pitching robocoworker.com. (send spam)
  • Loren - infra. have conversations about infra weekly. Learn what's difficult to learn, what's relevant. Have these conversations weekly. Could present: cli tools demo - fzf, pipes. Or .... Or most likely
  • Emmanuel - Software dev, started with ML c. 2017, once part of Rust org on GitHub, now web3 contrib to Rust Async WG
  • Matt - Loren and I both sysadmins at student computer lab at UC Berkeley. Sys admin, now work at startup, not quite. Here for the vibes. Present: self-host something using docker. Have sites in it.
  • Jordan - software engineer, work on databases & storage. No personal projects recently, just working lost. Would present: on how to stream information from a database. CDC read-once, read many. dist postgress shards streaming but not globally coherent view of data. Citus shard yourself, Cockroach, YugaByte.
  • Ben, Benjamin (more sophisticated, preferred) - just moved here from Boston, bouncing around hostels, working on smart contract allowing exchange between any two tokens.


Lesson or Demo[edit | edit source]

  • re: jordan, issue streaming from Amazon, will kill connection after 30s.

Why aren't all databases streaming? stateless? Well, for log-structured, e.g. LSM, it couldn't commit for an hour, it takes a long time to reach committed status, have to scan ahead in the log. Debezium (oss) - standardized layers, with connector to backends. CDC (change data capture), used with Kafka. E.g. stream into data warehouse. Really good Uber blog post of CDC. Building caches of their database. MySQL-binlog

Postgres internally does CDC for it's read-only replicas. Transaction Log based databases In our database, can commit out of order. Even if LSN, log-structure number, is hiring, have to check what's actually committed. Want to see log / as of LSN, when using 2PC (2-phase commit), the commit message might come 2GB later, a long scan later. Observing at any point in the log.

  • This is in the service of ensuring Atomicity in databases (A of ACID), using 2 Phase Commit (2PC).
    • Coordinator and acceptor roles. C->A: Write, A->C: Prepare(d), C->A: Commit. In the log, you'll have Prepares in the log, and abort or commit.
    • Consensus (e.g. Raft) solves safely having new coordinator nodes.
    • CAP Consistency Availability Partition-tolerance, pick 2, must include P.
    • 2PC is how transactions to proceed across multiple shards.


  • little failure stories from introduction
  • email - medium severity alert
  • Jordan, gumdrop - 2k emails in 2 minutes.
  • 11k in 2hours.


  • topics today
    • email -
      • don't try to send more than 10k emails a day (from Amazon - they disable your account, Msft send angry messages). google mail, limit 2k, then clearly warn you.
      • ATT, Comcast. - complaints early. SimpleOptOutCompliance.com -
      • MailChimp 100k/mo $1k
    • rust async ? or dev tool from Emmanuel?
    • Yan LeCunn -

[Language is low bandwidth: less than 12 bytes/second. A person can read 270 words/minutes, or 4.5 words/second, which is 12 bytes/s (assuming 2 bytes per token and 0.75 words per token). A modern LLM is typically trained with 1x10^13 two-byte tokens, which is 2x10^13 bytes. This would take about 100,000 years for a person to read (at 12 hours a day). * Vision is much higher bandwidth: about 20MB/s. Each of the two optical nerves has 1 million nerve fibers, each carrying about 10 bytes per second. A 4 year-old child has been awake a total 16,000 hours, which translates into 1x10^15 bytes. In other words: - The data bandwidth of visual perception is roughly 16 million times higher than the data bandwidth of written (or spoken) language. - In a mere 4 years, a child has seen 50 times more data than the biggest LLMs trained on all the text publicly available on the internet. This tells us three things: 1. Yes, text is redundant, and visual signals in the optical nerves are even more redundant (despite being 100x compressed versions of the photoreceptor outputs in the retina). But redundancy in data is *precisely* what we need for Self-Supervised Learning to capture the structure of the data. The more redundancy, the better for SSL. 2. Most of human knowledge (and almost all of animal knowledge) comes from our sensory experience of the physical world. Language is the icing on the cake. We need the cake to support the icing. 3. There is *absolutely no way in hell* we will ever reach human-level AI without getting machines to learn from high-bandwidth sensory inputs, such as vision.] Yan le Cunn



New models to learn about, in the last few months? tokenize upcs? mamba, ssts SORA - ensemble of many. See stable diffusion paper for fully out in the open discussion.


Presentations:

   - nb infra
      - (door - gate)
   - 


  • Shell, web services, self-hosting, networking!

Questions, Discussion, or Coworking[edit | edit source]

  • [Issue]

For next time[edit | edit source]

Questions[edit | edit source]

Readings & Exercises[edit | edit source]

  • Readings
  • Exercises

Join online[edit | edit source]

  • Try it yourself!
    • Join libera.chat #nb-meetup-infra

https://www.noisebridge.net/wiki/Meetups/Infra