Johanes Alexander

Data & AI Architect

Experienced architect specializing in agentic AI systems and large-scale data platforms. Deep focus on multi-agent architectures, real-time data pipelines, and cloud-native solutions โ€” combining data engineering foundations with cutting-edge AI to design systems that operate autonomously at scale.

๐Ÿค– Current Focus: AI & Agentic Systems

Data Agents Development

Creating intelligent agents for data processing and analysis that combine traditional data engineering with autonomous decision-making capabilities.

Agentic Workflows

Designing autonomous systems for data pipeline management that can adapt and optimize themselves based on changing requirements.

AI-Powered Data Solutions

Integrating LLMs with traditional data engineering patterns to create next-generation data processing systems.

๐Ÿ”ฌ Current Project

Developing custom data agents based on ADK samples, combining my data expertise with agentic AI capabilities to create intelligent data processing workflows.

๐ŸŽค Community Contributions & Speaking

As a thought leader in data analytics and AI, I actively contribute to the developer community through speaking engagements and knowledge sharing.

Singapore Technology Week 2025

October 9, 2025
"Unleash the Power of Generative AI in BigQuery with Colab Data Science Agents and BigFrames"

Demonstrated practical applications of Generative AI in BigQuery, showcasing how to leverage Colab Data Science Agents and BigFrames for advanced data analytics workflows. Explored the integration of AI-powered tools with BigQuery to enable data scientists and analysts to build intelligent data processing pipelines with natural language interfaces and automated insights generation.

Generative AI BigQuery Colab Data Science Agents BigFrames Data & AI workshops

Google Cloud Next Extended Singapore 2025

June 14, 2025
"Metadata: The Key to Unlocking Data Analytics in the Agentic Era"

Presented insights on Google Cloud's latest data analytics innovations from Next '25, focusing on AI integration with BigQuery and the crucial role of metadata in enabling AI agents. Covered specialized AI agents for various user roles, AI-assisted notebooks, and the BigQuery AI Query Engine's capabilities with both structured and unstructured data.

BigQuery metadata AI agents Data governance Query optimization Autonomous data processing

GDG Monthly Meetup #10

October 24, 2024
"Harnessing Real-Time Insights: LLM Inference for Streaming Data with SQL"

Explored practical techniques for performing real-time inference on streaming data using large language models (LLMs) and SQL. Demonstrated seamless integration of LLMs into existing application workflows, enabling real-time insights, predictions, and classifications directly within familiar SQL environments.

Real-time data processing LLM integration Streaming analytics SQL-based AI inference

๐Ÿ› ๏ธ Tech Stack

Go Go
Python Python
Java Java
Apache Spark Apache Spark
BigQuery BigQuery
Google Cloud Google Cloud
PostgreSQL PostgreSQL
Redis Redis
Docker Docker
Cassandra Cassandra
Kafka Kafka

๐Ÿš€ Featured Projects

๐Ÿค– AI & Agentic Systems

bq-agent-app - Multi-Agent BigQuery System

A powerful AI-powered data analysis system combining BigQuery with Google Agent Development Kit (ADK). Features multi-agent orchestration with specialized sub-agents for data retrieval, data science workflows, and BQML operations. Includes RAG corpus integration for BQML documentation and MCP protocol support.

Tech Stack: Python, ADK, MCP, Gemini 2.5, BigQuery, Vertex AI
Key Features: Multi-agent architecture, Python code execution, Statistical analysis, BQML with RAG, MCP integration

mcp-cr - Model Context Protocol Server

A comprehensive tutorial for deploying MCP (Model Context Protocol) servers to Google Cloud Run, featuring a zoo animal database with interactive tools. Demonstrates modern AI integration patterns with cloud-native deployment.

Tech Stack: Python, FastMCP, Google Cloud Run, Docker
Key Features: MCP server implementation, Cloud Run deployment, Interactive AI tools, RESTful API

๐Ÿ“Š Data Engineering & Analytics

mdm-gcp - Master Data Management with AI

Production-ready MDM solution with 5-strategy AI matching for batch processing and 4-strategy real-time matching for streaming. Features vector embeddings with Gemini, fuzzy matching, business rules, and AI natural language reasoning. Unified batch and streaming architecture with BigQuery and Spanner.

Tech Stack: Python, BigQuery, Spanner, Gemini, Vertex AI Vector Search
Key Features: 5-strategy AI matching, Vector embeddings, Real-time streaming, Unified batch+streaming architecture

data-clean-room-demo - BigQuery Data Clean Rooms

Comprehensive BigQuery Data Clean Room implementation with Analytics Hub integration. Demonstrates privacy-preserving analytics, BQML collaborative ML, and secure data sharing patterns with automated setup scripts for both DCR and DCX deployments.

Tech Stack: Python, BigQuery, Analytics Hub, BQML, Vertex AI
Key Features: Privacy-preserving analytics, BQML collaborative ML, Analytics Hub automation, Data exchange patterns

random-stuff - BigQuery Analytics Toolkit

Production-ready BigQuery tools and demos covering advanced analytics patterns. Includes FinOps cost optimization, geospatial routing, Places Insights competitive analysis, RLS/CLS security with Dataform, Firebase Analytics integration, Streaming CDC pipelines, and dbt migration workflows.

Tech Stack: Python, BigQuery, Dataform, dbt, PySpark, Jupyter
Key Features: FinOps cookbook, Geospatial analysis, Places Insights, RLS/CLS security, Streaming CDC, dbt+Spark+BQ, Firebase Analytics

random-stuff/agent_stuff - AI Agent Configs & Guides

Curated collection of AI agent configurations, coding standards, and workspace architecture guides for multi-model agentic workflows. Includes OpenClaw workspace architecture guides for Anthropic and Gemini, Google-style coding standards for AI-generated code, BigQuery data science agent prompt libraries, and opencode configuration scripts.

Tech Stack: Python, OpenClaw, Anthropic Claude, Gemini, Google Cloud
Key Features: OpenClaw workspace architecture guides (Anthropic + Gemini), Google-style AI coding standards (Python/Go/Java), BQ agent prompt library, opencode config + sync scripts, dbt migration agents

spark-hybrid-compute - Advanced Spark Integration

Comprehensive solution for Spark integration with BigLake Metastore and Apache Iceberg, supporting both Dataproc and Docker-based deployments. Demonstrates hybrid cloud computing patterns for modern data lakes.

Tech Stack: Apache Spark, BigLake, Apache Iceberg, Dataproc, Docker, Jupyter
Key Features: Hybrid cloud architecture, Iceberg table management, BigQuery integration, Multiple deployment options

bigquery-antipattern-recognition - BigQuery SQL Optimization

Enhanced fork of Google Cloud Platform's utility for identifying and rewriting common anti-patterns in BigQuery SQL. Added query grouping functionality and clustering optimization patterns for improved performance analysis.

Tech Stack: Java, BigQuery, Maven, Docker, Cloud Run, Vertex AI
Key Features: 15+ antipattern detections, AI-powered SQL rewriting, Query grouping analysis, Remote UDF deployment

sheets-pyspark - Google Sheets with PySpark

Integration of Google Sheets as a data source for PySpark on Dataproc Serverless. Includes Airflow demo for scheduling notebook execution with three deployment options: PythonVirtualenvOperator, Vertex AI Custom Training, and Dataproc Serverless.

Tech Stack: Python, PySpark, Dataproc Serverless, Airflow, Google Sheets API, Jupyter
Key Features: Sheets as data source, Dataproc Serverless, Airflow orchestration, Multiple execution options

๐Ÿ”„ Real-Time Data Pipelines

dataflow-kafka-bq-examples - Kafka to BigQuery Streaming

Comprehensive Dataflow examples for streaming Kafka data to BigQuery. Features multi-branch processing, Beam SQL aggregations, multi-stream joins, and both custom Java pipelines and Flex Templates for different deployment scenarios.

Tech Stack: Java, Apache Beam, Kafka, Dataflow, BigQuery, Beam SQL
Key Features: Multi-branch processing, Beam SQL joins, Real-time aggregations, Flex Template deployment

beam-dataflow-iceberg-bqms - Beam with Iceberg Tables

Demonstration of Apache Beam with standard BigQueryIO and Managed I/O for BigQuery operations. Showcases 8 pipeline patterns including BigQuery Iceberg and BigLake Iceberg table operations with automatic schema handling.

Tech Stack: Python, Apache Beam, Dataflow, Apache Iceberg, BigQuery, BigLake
Key Features: Managed I/O, BigQuery Iceberg tables, BigLake integration, Multiple pipeline patterns

cf-pubsub-to-bq - Real-Time Data Ingestion

Complete real-time data pipeline solution from Pub/Sub to BigQuery using Cloud Run Functions. Includes data generation, streaming processing, and automated table management.

Tech Stack: Go, Pub/Sub, BigQuery, Cloud Run Functions, Dataflow
Key Features: Real-time processing, Automated data generation, Partitioned tables, End-to-end pipeline

dataflow-pubsub-to-bq-examples-py - Pub/Sub to BigQuery Streaming

Python streaming pipeline from Pub/Sub to BigQuery using BigQuery Storage Write API. Features micro-batching, Pub/Sub metadata capture, and partitioned tables with DirectRunner and DataflowRunner V2 support.

Tech Stack: Python, Apache Beam, Dataflow, Pub/Sub, BigQuery
Key Features: Storage Write API, Micro-batching, Pub/Sub metadata capture, Runner V2 support

dataflow-pubsub-perf-test - Dataflow/BigQuery Performance Testing

Test infrastructure for diagnosing the Dataflow/BigQuery "Noisy Neighbor" throughput degradation pattern. Six rounds of testing across Pub/Sub and Kafka sources (Python + Java SDKs) โ€” 2.2 billion rows, 2.4 TB, 901k rows/sec peak, zero errors. Confirmed linear scaling and identified a shared Kafka consumer group as the root cause of production degradation. Exceeded the BigQuery Storage Write API regional quota and sustained it.

Tech Stack: Java, Python, Apache Beam, Dataflow, Pub/Sub, Kafka (Google Managed), BigQuery Storage Write API
Key Features: 2.2B rows / 2.4 TB scale testing, 901k rows/sec peak throughput, Noisy Neighbor root-cause diagnosis, Multi-source testing (Pub/Sub + Kafka), Python + Java SDK coverage

๐Ÿงช AI Experiments & Tools

gemini-cli-1c - One-Click Gemini CLI Setup

Automated one-command installation script for a complete development environment with NVM, Node.js, and Google's Gemini CLI. Streamlines developer onboarding for AI-powered workflows.

Tech Stack: Shell, Node.js, NVM, Gemini CLI
Key Features: One-command installation, Environment configuration, Developer productivity tools

vision-sandbox - Agentic Vision Tool

Agentic vision tool built as an OpenClaw skill, leveraging Gemini's native code execution sandbox for spatial grounding, visual math, and UI auditing tasks. Demonstrates OpenClaw skill architecture for vision-based agentic workflows.

Tech Stack: Python, Gemini, Google Cloud, OpenClaw
Key Features: Agentic vision analysis, Spatial grounding, Visual math, UI auditing, OpenClaw skill architecture

๐Ÿ’ผ Core Competencies

Data Architecture & Engineering

  • Big Data Processing: Apache Spark, Dataproc, distributed computing, Iceberg tables
  • Data Warehousing: BigQuery, data modeling, partitioning strategies, performance optimization
  • Real-Time Streaming: Pub/Sub, Kafka, Apache Beam, event-driven architectures
  • Database Technologies: PostgreSQL, Spanner, Redis, Cassandra
  • Master Data Management: AI-powered entity resolution, vector embeddings, multi-strategy matching

AI & Machine Learning

  • AI Agents: Multi-agent systems, agentic workflows, autonomous data processing
  • LLM Integration: Gemini AI, prompt engineering, RAG systems, AI-powered analytics
  • ML Engineering: Model deployment, MLOps, BQML, production ML systems
  • Vector Search: Semantic similarity, embeddings generation, hybrid search strategies

Cloud Architecture

  • Google Cloud Platform: Comprehensive expertise across data, AI, and compute services
  • Serverless Computing: Cloud Functions, Cloud Run, event-driven architectures
  • Infrastructure as Code: Terraform, deployment automation
  • Data Governance: Data Clean Rooms, Analytics Hub, privacy-preserving analytics

๐Ÿ“ˆ GitHub Stats

GitHub Profile Details GitHub Stats Top Languages