distributed-systems

DEV Community

I built a distributed job queue in Go to understand how they actually work

Uthman Oladele

1d ago

I have used job queues my whole developer life without knowing what was inside them. So I built one. Not a wrapper around an existing queue. A full implementation from scratch with Redis, PostgreSQL, goroutines, and real failure handling. Here is everything I learned. Why Dual Storage Most job queues use one store. Redis is fast. PostgreSQL is durable. I wanted both. Redis handles dispatch via a …

computer-sciencedistributed-systems

DEV Community

agentic experience for Go

Richard Shade

2d ago

Years ago at RightScale I learned more about distributed systems from broken log files than from any design doc. A request came in the front door, fanned out through a workflow service, hit a plugin, and the plugin called some cloud API. When it failed, the only way to find where was to line up the logs of every service it passed through. So we threaded a trace ID from the frontend all the way to…

computer-sciencedistributed-systems

Frontiers in Artificial Intelligence | New and Recent Articles

Structural impact of non-IID heterogeneity on federated behavioral anomaly detection in IoT and IoMT systems

William Villegas-Ch

4d ago

The expansion of Internet of Things (IoT) and Internet of Medical Things (IoMT) infrastructures has increased the generation of multivariate sensor streams that reflect complex operational behaviors in industrial and clinical environments. Centralized anomaly detection approaches face limitations in IoMT due to privacy constraints, latency, and device heterogeneity. Federated learning (FL) enable…

aicomputer-sciencedistributed-systemsmachine-learning

DEV Community

Saga Orchestration in Go: Distributed Workflows That Actually Roll Back

telegrapher

4d ago

Every non-trivial business operation touches more than one system. An e-commerce order reserves inventory, charges a payment method, and schedules a shipment — three services, three databases. A bank transfer debits one account and credits another across two ledgers that may not even be in the same data center. A cloud VM provisioning workflow reserves a network port, allocates storage, starts th…

computer-sciencedistributed-systems

DEV Community

How Does Traffic Actually Reach Your Pods? Kubernetes Services & kube-proxy Explained

Sreekanth Kuruba

5d ago

Your backend Pod just crashed. Kubernetes created a new Pod with a completely different IP address. Yet your application didn't notice anything changed. How? Because applications don't talk directly to Pods. They talk to Kubernetes Services. A Service provides a stable virtual IP and DNS name, while kube-proxy quietly programs the networking rules that route traffic to the right Pods. In this pos…

computer-sciencedistributed-systemssoftware-engineering

DEV Community

Retry in Distributed Systems — How Production Systems Recover From Temporary Failures

Neel-Vekariya

6d ago

Not every failure is permanent. This is something I didn't think about before. When something fails in my app, my first thought was something broke, fix it. But when I started learning how distributed systems actually work, I realized that some failures are not really failures. They're just temporary. Network glitch. API timeout. A service that just restarted. Rate limiting kicking in. These are …

computer-sciencedistributed-systems

Semiconductor Engineering

Modeling Multi-GPU Traffic For Distributed AI Workloads (UW Madison, AMD)

Technical Paper Link

6d ago

Researchers from University of Wisconsin-Madison and AMD Research and Advanced Development published a technical paper titled “Eidola: Modeling Multi-GPU Network Communication Traffic in Distributed AI Workloads.” Abstract: “As distributed AI workloads grow in scale, multi-GPU systems have become essential for training large models. Although techniques like kernel fusion and overlapping communica…

aicomputer-sciencedistributed-systemsmachine-learning

DEV Community

You can do WHAT with a Kafka proxy?

Stéphane Derosiaux

6d ago

At Current 2026, I realized that nobody knows exactly what a Kafka proxy can do. Most engineers and architects think it's just some kind of reverse-proxy for Kafka (think nginx) to do routing and used to bridge a legacy or non-native client to the cluster. That's not it. It's barely the start of it. Encryption For instance, an engineer at a UK building society had a hard requirement: encrypt pers…

computer-sciencedistributed-systems

DEV Community

The Disconnected Edge: How We Solved In-Flight Data Sync at 35,000 Feet

Shubham

8d ago

When most engineers think about rolling out a modern streaming or web application, they visualize a standard cloud-native environment: a global CDN, elastic load balancers, and a continuous pipeline pushing updates to infinite resources. But what happens when your deployment target is an isolated, battery-powered hardware device flying inside a metal tube at 35,000 feet? At AirFi , operating a ne…

computer-sciencedistributed-systems

DEV Community

Boosting Observability in NestJS with RedisX Metrics

Suren Krmoian

8d ago

Observability isn't just a buzzword; it's a necessity, especially when diving into distributed systems. If you're using NestJS, you might want to take a look at RedisX. It's a modular toolkit that can boost the observability of your applications. A standout feature? The Metrics Plugin. It meshes well with Prometheus, delivering insights into Redis operations in your NestJS setup. Getting RedisX M…

computer-sciencedistributed-systemssoftware-engineering

DEV Community

The 7 People Who Control The Internet Clock

Sam Chen

9d ago

The 7 People Who Control the Internet Clock – A Deep‑Dive Companion to The Pattern Episode Welcome back, fellow engineers and curious minds. I’m The Systems Analyst , and after you’ve listened to the latest The Pattern episode, I wanted to give you a tangible, on‑the‑ground look at the invisible heartbeat that keeps everything from your phone’s alarm to high‑frequency trading platforms humming in…

computer-sciencedistributed-systems

Capgemini

Insights from the field: Lessons from real world distributed Cloud deployments

sharmisthanaskar

11d ago

As distributed cloud adoption accelerates, many organizations find themselves stuck between experimentation and scale. The post Insights from the field: Lessons from real world distributed Cloud deployments appeared first on Capgemini .

cloud-computingcomputer-sciencedistributed-systemstechnology

DEV Community

GPU autoscaling on Kubernetes with KEDA: building an external scaler with NVML

Bruno Santos

14d ago

If you run vLLM, Triton, or any other inference server on Kubernetes, you have probably noticed that the HPA cannot see the GPU. Autoscaling decisions are driven by CPU and memory, while the resource that actually determines inference capacity remains invisible. A CNCF blog post published in May 2026 describes how to fix this by building a KEDA external scaler. The problem with default autoscalin…

cloud-computingcomputer-sciencedistributed-systems

DEV Community

Consensus Protocols in Distributed Systems

Sai Chakradhar Rao Mahendrakar

14d ago

Consensus Protocols in Distributed Systems A Complete Learning Guide — Intermediate to Advanced Table of Contents Foundation — What is Consensus and Why Is It Hard? Core Concepts & Terminology Paxos — The Classic Protocol Raft — The Understandable Protocol Byzantine Fault Tolerance (BFT) Other Notable Protocols Real-World Systems Scalability & Performance Considerations Trade-Off Analysis Design …

computer-sciencedistributed-systems

DEV Community

Why Squirix uses a strict client/server architecture for a .NET distributed cache

Alex E

15d ago

Squirix 0.1.0 is an early preview of a .NET distributed cache. A typed client SDK talks to a remote server over gRPC; the server owns state, routing, durability, and operational endpoints. This is the direction I am validating in 0.1.0 — not a claim that every cache must work this way. Embedded designs are fine for many workloads. Squirix targets a different shape: the application stays a client;…

computer-sciencedistributed-systems

DEV Community

TrueTime: Bounding Clock Uncertainty

rishabh pahwa

15d ago

Your typical clock synchronization protocol like NTP provides a timestamp, but it can't guarantee that event A truly happened before event B if they occurred on different machines. Spanner's TrueTime solves this by providing time as an interval, not a point, ensuring global serializability even across continents. When your distributed system relies on timestamps from different servers, you're bui…

computer-sciencedistributed-systems

DEV Community

Learning, Experimenting - Concurrency in Go - Part 2

Manish

15d ago

Refresher - I'm building a distributed chunked filestore in Go, and I setup a post for Part 1 here . That part dealt with uploading a file - this post is about downloads. Setup Requirements User hits our endpoint with the filename/fileid We use this fileid to get a list of chunks Our retrieve mechanism only depends on this list of chunks We want to be able to retrieve the associated chunks in par…

computer-sciencedistributed-systemsprogramming-languages

DEV Community

Building a Scalable Edge: A Practical Guide to Real-Time Geo-Distributed Data Ingestion for Global I

Rizwan Saleem

19d ago

Building a Scalable Edge: A Practical Guide to Real-Time Geo-Distributed Data Ingestion for Global I Building a Scalable Edge: A Practical Guide to Real-Time Geo-Distributed Data Ingestion for Global IoT Edge computing is not just about pushing logic to the far end; it’s about orchestrating a cohesive flow where data is ingested, processed, and acted upon with millisecond latency, while preservin…

computer-sciencedistributed-systems

PhilPapers: Recent additions to PhilArchive

Bitla, Narender ; Deshpande, Akshay ; Dulam, Murali Shankar & Saha, Sumit: Cross-Cloud Performance Benchmarking and Optimization

19d ago

_Cross-Cloud Systems Measurement Report_. 2021Public cloud providers expose similar high-level resources but differ in processor generations, storage paths, network locality, virtualization overhead, accelerator availability, and pricing rules. These differences make direct comparison difficult for teams that operate analytics, web services, and machine-learning pipelines across providers. This p…

computer-sciencedistributed-systems

PhilPapers: Recent additions to PhilArchive

Annamali Sekar, Mythili ; Dulam, Murali Shankar ; Mazumder, Abhirup & Kannan, Kabilan: Governing Distributed Systems with Intelligent Agents

19d ago

_Autonomic Distributed Systems Governance Bulletin_. 2021Large distributed systems are now operated through layers of schedulers, container controllers, service meshes, monitoring pipelines, and human runbooks. These mechanisms improve scale, but they also create governance problems: local controllers can fight one another, remediation rules may violate service-level or compliance constraints, an…

computer-sciencedistributed-systems

research.io

Sign up to keep scrolling

Create your feed subscriptions, save articles, keep scrolling.

Already have an account?