distributed-systems

DEV Community

I have used job queues my whole developer life without knowing what was inside them. So I built one. Not a wrapper around an existing queue. A full implementation from scratch with Redis, PostgreSQL, goroutines, and real failure handling. Here is everything I learned. Why Dual Storage Most job queues use one store. Redis is fast. PostgreSQL is durable. I wanted both. Redis handles dispatch via a …

computer-sciencedistributed-systems
DEV Community
Richard Shade
2d ago

Years ago at RightScale I learned more about distributed systems from broken log files than from any design doc. A request came in the front door, fanned out through a workflow service, hit a plugin, and the plugin called some cloud API. When it failed, the only way to find where was to line up the logs of every service it passed through. So we threaded a trace ID from the frontend all the way to…

computer-sciencedistributed-systems
Frontiers in Artificial Intelligence | New and Recent Articles

The expansion of Internet of Things (IoT) and Internet of Medical Things (IoMT) infrastructures has increased the generation of multivariate sensor streams that reflect complex operational behaviors in industrial and clinical environments. Centralized anomaly detection approaches face limitations in IoMT due to privacy constraints, latency, and device heterogeneity. Federated learning (FL) enable…

aicomputer-sciencedistributed-systemsmachine-learning
DEV Community

Every non-trivial business operation touches more than one system. An e-commerce order reserves inventory, charges a payment method, and schedules a shipment — three services, three databases. A bank transfer debits one account and credits another across two ledgers that may not even be in the same data center. A cloud VM provisioning workflow reserves a network port, allocates storage, starts th…

computer-sciencedistributed-systems
DEV Community

Your backend Pod just crashed. Kubernetes created a new Pod with a completely different IP address. Yet your application didn't notice anything changed. How? Because applications don't talk directly to Pods. They talk to Kubernetes Services. A Service provides a stable virtual IP and DNS name, while kube-proxy quietly programs the networking rules that route traffic to the right Pods. In this pos…

computer-sciencedistributed-systemssoftware-engineering
DEV Community

Not every failure is permanent. This is something I didn't think about before. When something fails in my app, my first thought was something broke, fix it. But when I started learning how distributed systems actually work, I realized that some failures are not really failures. They're just temporary. Network glitch. API timeout. A service that just restarted. Rate limiting kicking in. These are …

computer-sciencedistributed-systems
Semiconductor Engineering

Researchers from University of Wisconsin-Madison and AMD Research and Advanced Development published a technical paper titled “Eidola: Modeling Multi-GPU Network Communication Traffic in Distributed AI Workloads.” Abstract: “As distributed AI workloads grow in scale, multi-GPU systems have become essential for training large models. Although techniques like kernel fusion and overlapping communica…

aicomputer-sciencedistributed-systemsmachine-learning
DEV Community
Stéphane Derosiaux
6d ago

At Current 2026, I realized that nobody knows exactly what a Kafka proxy can do. Most engineers and architects think it's just some kind of reverse-proxy for Kafka (think nginx) to do routing and used to bridge a legacy or non-native client to the cluster. That's not it. It's barely the start of it. Encryption For instance, an engineer at a UK building society had a hard requirement: encrypt pers…

computer-sciencedistributed-systems
DEV Community

When most engineers think about rolling out a modern streaming or web application, they visualize a standard cloud-native environment: a global CDN, elastic load balancers, and a continuous pipeline pushing updates to infinite resources. But what happens when your deployment target is an isolated, battery-powered hardware device flying inside a metal tube at 35,000 feet? At AirFi , operating a ne…

computer-sciencedistributed-systems
DEV Community

Observability isn't just a buzzword; it's a necessity, especially when diving into distributed systems. If you're using NestJS, you might want to take a look at RedisX. It's a modular toolkit that can boost the observability of your applications. A standout feature? The Metrics Plugin. It meshes well with Prometheus, delivering insights into Redis operations in your NestJS setup. Getting RedisX M…

computer-sciencedistributed-systemssoftware-engineering
DEV Community

The 7 People Who Control the Internet Clock – A Deep‑Dive Companion to The Pattern Episode Welcome back, fellow engineers and curious minds. I’m The Systems Analyst , and after you’ve listened to the latest The Pattern episode, I wanted to give you a tangible, on‑the‑ground look at the invisible heartbeat that keeps everything from your phone’s alarm to high‑frequency trading platforms humming in…

computer-sciencedistributed-systems
Capgemini

As distributed cloud adoption accelerates, many organizations find themselves stuck between experimentation and scale. The post Insights from the field: Lessons from real world distributed Cloud deployments appeared first on Capgemini .

cloud-computingcomputer-sciencedistributed-systemstechnology
DEV Community

If you run vLLM, Triton, or any other inference server on Kubernetes, you have probably noticed that the HPA cannot see the GPU. Autoscaling decisions are driven by CPU and memory, while the resource that actually determines inference capacity remains invisible. A CNCF blog post published in May 2026 describes how to fix this by building a KEDA external scaler. The problem with default autoscalin…

cloud-computingcomputer-sciencedistributed-systems
DEV Community
Sai Chakradhar Rao Mahendrakar
14d ago

Consensus Protocols in Distributed Systems A Complete Learning Guide — Intermediate to Advanced Table of Contents Foundation — What is Consensus and Why Is It Hard? Core Concepts & Terminology Paxos — The Classic Protocol Raft — The Understandable Protocol Byzantine Fault Tolerance (BFT) Other Notable Protocols Real-World Systems Scalability & Performance Considerations Trade-Off Analysis Design …

computer-sciencedistributed-systems
DEV Community

Squirix 0.1.0 is an early preview of a .NET distributed cache. A typed client SDK talks to a remote server over gRPC; the server owns state, routing, durability, and operational endpoints. This is the direction I am validating in 0.1.0 — not a claim that every cache must work this way. Embedded designs are fine for many workloads. Squirix targets a different shape: the application stays a client;…

computer-sciencedistributed-systems
DEV Community

Your typical clock synchronization protocol like NTP provides a timestamp, but it can't guarantee that event A truly happened before event B if they occurred on different machines. Spanner's TrueTime solves this by providing time as an interval, not a point, ensuring global serializability even across continents. When your distributed system relies on timestamps from different servers, you're bui…

computer-sciencedistributed-systems
DEV Community

Refresher - I'm building a distributed chunked filestore in Go, and I setup a post for Part 1 here . That part dealt with uploading a file - this post is about downloads. Setup Requirements User hits our endpoint with the filename/fileid We use this fileid to get a list of chunks Our retrieve mechanism only depends on this list of chunks We want to be able to retrieve the associated chunks in par…

computer-sciencedistributed-systemsprogramming-languages
DEV Community

Building a Scalable Edge: A Practical Guide to Real-Time Geo-Distributed Data Ingestion for Global I Building a Scalable Edge: A Practical Guide to Real-Time Geo-Distributed Data Ingestion for Global IoT Edge computing is not just about pushing logic to the far end; it’s about orchestrating a cohesive flow where data is ingested, processed, and acted upon with millisecond latency, while preservin…

computer-sciencedistributed-systems
PhilPapers: Recent additions to PhilArchive

_Cross-Cloud Systems Measurement Report_. 2021Public cloud providers expose similar high-level resources but differ in processor generations, storage paths, network locality, virtualization overhead, accelerator availability, and pricing rules. These differences make direct comparison difficult for teams that operate analytics, web services, and machine-learning pipelines across providers. This p…

computer-sciencedistributed-systems
PhilPapers: Recent additions to PhilArchive

_Autonomic Distributed Systems Governance Bulletin_. 2021Large distributed systems are now operated through layers of schedulers, container controllers, service meshes, monitoring pipelines, and human runbooks. These mechanisms improve scale, but they also create governance problems: local controllers can fight one another, remediation rules may violate service-level or compliance constraints, an…

computer-sciencedistributed-systems
research.ioresearch.io

Sign up to keep scrolling

Create your feed subscriptions, save articles, keep scrolling.

Already have an account?