Open-source 7MB autonomous driving model that learns visual navigation, lane following, and drift recovery for edge deployment on lightweight hardware. Demonstrates practical real-time inference optimization for complex perception tasks without cloud infrastructure, valuable for understanding model compression and embedded AI systems.
Tomesphere is a free research paper discovery platform indexing 3M arxiv/OpenAlex papers with AI-generated TLDRs, peer reviews, GitHub repos, HuggingFace models, and semantic similarity search using SPECTER2 embeddings in pgvector. The semantic graph approach enables discovery of topically related papers beyond citation networks, with a Chrome extension for arxiv integration and multiple ranking modes (influential, recent, hidden gems, nearest neighbors).
WAVE is a portable GPU kernel abstraction layer that compiles to a unified binary compatible with Metal, PTX, HIP, and SYCL across Apple, NVIDIA, and AMD hardware. This solves a critical pain point for AI engineers building cross-platform systems—write kernels once and deploy identically across diverse GPU architectures with verified PyTorch integration.
Practical guide covering multiple inference frameworks (Transformers, llama-cpp-python, vLLM, SGLang, Ollama, etc.) for running a 27B quantized Qwen model. Includes GGUF quantization options and benchmark comparisons showing minimal accuracy degradation, useful for engineers optimizing local model deployment.
Guide for using a fine-tuned Qwen 3.5-35B variant (with reduced content restrictions) across multiple inference frameworks including Transformers, vLLM, and SGLang, with MMLU benchmark results (83.72% accuracy) and multiple quantization options available. Practical for engineers looking to deploy modified open-source models with different inference backends.
Aiki is a lightweight local tool for querying Wikipedia with custom TF-IDF retrieval and optional LLM answer generation. It demonstrates practical RAG implementation with minimal dependencies, featuring query expansion via Wikipedia links and flexible article selection—useful reference for building local knowledge systems.
Novel implementation of DCGAN inference on resource-constrained RISC-V microcontroller (CH32H417) with 512KB shared SRAM, using int8 quantization, SD card weight streaming with double buffering, and custom C inference engine achieving bit-identical PyTorch outputs. Demonstrates practical techniques for embedded generative models on non-ARM architectures where ecosystem tools like CMSIS-NN don't exist, with creative integration of quantum entropy for latent vector seeding.
Spice is an open-source decision layer framework that sits above execution agents to make agent decision-making explicit and interpretable. It captures what was observed, options considered, reasoning for selection, trade-offs rejected, and execution outcomes—addressing a key gap where agents excel at execution but lack transparent decision-making processes. The project is early-stage but functional, installable, and designed to work with existing agents like Claude Code and other tools.
Discussion of FWHT (Fast Walsh-Hadamard Transform) CUDA kernel implementation for quantized KV-cache in LLM inference, with performance benchmarks across different model architectures and head sizes. Shows practical optimization work for inference speed-ups when using q8_0 quantization on different GPU architectures (RTX 5090, CDNA).
MiniCPM5-1B is a new 1B-class open-source model achieving SOTA in its weight class with built-in hybrid reasoning modes, designed for on-device deployment and resource-constrained scenarios. The release includes deployment guides for Transformers, vLLM, and SGLang, plus fine-tuning resources and newly released training datasets (Ultra-FineWeb, UltraData-Math, UltraData-SFT).
Production-tested solution for enforcing tool-call constraints in LangGraph agents using a YAML-based contract layer that validates rules deterministically before execution. Addresses critical failure mode where prompt engineering and post-hoc auditing fail to prevent compliance violations, with the approach open-sourced as Sponsio for community feedback.
MergeNB is a VS Code extension that improves Jupyter Notebook merging for collaborative workflows, addressing pain points with existing tools like nbdime. The tool features a web UI and plans to expand as a git mergetool, offering practical improvements for teams managing notebook-based research and development.
Thermocompute is a PyTorch library that emulates thermodynamic probabilistic computing, offering stochastic neural layers (p-bits, samplers, generative models) designed to exploit parallel hardware where inference time remains constant as layer width increases. The key technical insight is that on GPUs with available parallel capacity, thermodynamic layers can achieve flat wall-clock time scaling with width, potentially outperforming classical dense FFNs for certain workloads.
A Go developer created a pure Go CUDA binding library (gocudrv) that eliminates cgo dependencies by loading libcuda.so at runtime using purego, enabling cross-compilation and smaller Docker images for ML workloads. The implementation uses OS thread locking to handle CUDA's per-thread context model via goroutine channels, with early support for memory allocation, kernel launches, and GPU event timing.
Papers with Code has been revived with new features for tracking AI SOTA across domains, including multi-metric leaderboards, paper lineage tracking, method taxonomy, and ~3k model evaluations. The platform now supports external paper submissions (non-Arxiv) with auto-enrichment via AI, making it a useful reference tool for staying current with model releases and benchmarks.
Deep dive into WordDetectorNN, a handwritten word detection model using per-pixel distance regression to bounding boxes instead of anchor-based detection, followed by DBSCAN clustering with IoU-based distance metric. The architecture uses ResNet18 + FPN decoder with 6-channel pixel-level outputs, offering no-tuning detection but with O(n²) clustering bottleneck and non-differentiable post-processing.
AgentLantern is an open-source devtool that provides visibility into AI agent project structure and execution, addressing the debugging and observability challenges in multi-agent systems. It offers three components: static documentation generation, linting for design issues, and a runtime viewer for observing agent behavior—currently supporting CrewAI with plans for broader framework support.
A software engineer describes a novel Hebbian learning architecture that achieves CIFAR-10 results without backpropagation, using only 5-7% of parameters through emergent sparse connectivity on a consumer GPU. The system exhibits interesting emergent behaviors including self-recovery after targeted neuron damage and performance jumps, suggesting biological plausibility might yield practical insights for efficient model design.
Spice is an open-source decision layer framework that sits above execution agents, providing context-aware task routing and decision-making through a perception → simulation → decision → execution → reflection loop. Rather than replacing agents like Claude or Codex, it adds orchestration capabilities including state modeling, option simulation, and outcome reflection to coordinate multi-agent workflows.
SM1 (Scalar Mamba1) implements a closed-form solution for state-space models with d_state=1 using pure PyTorch operations, eliminating the selective scan bottleneck and reducing memory by 16x compared to standard Mamba implementations. The author demonstrates practical benefits: training a 130M parameter model on MIDI data with minimal memory footprint (56KB state, no KV cache) on consumer hardware, highlighting that scalar state dimensions can be sufficient when token representations already encode structure.