Cloud Architect (AI Native)

Job Description: Cloud Architect (AI Native)

About Us

Shanda Group is a global investment firm and pioneer in Artificial Intelligence, committed to revolutionizing enterprise intelligence through our vision of “Discoverable AI.”

As a Cloud Architect, you will be instrumental in building the foundational infrastructure that supports our diverse portfolio of cutting-edge AI companies.

The Role

We are seeking a highly skilled and visionary Cloud Architect to design, implement, and manage the robust, scalable cloud infrastructure required to power our suite of AI products. You will work closely with the Head of Data, Chief Scientists, and engineering leaders across our portfolio companies to ensure our cloud environments are optimized for high-performance AI training, inference, and massive data processing.

This is not a role for someone who wants to maintain infrastructure. It is a role for someone who wants to reinvent it.

We are looking for a Cloud Architect who thinks in systems, moves with urgency, and embraces AI as a first-order tool — not just something they build for, but something they build with. You will design and own the cloud foundation that powers AGI research, real-time AI products, and PB-scale data platforms across our entire portfolio.

If you believe that a small, AI-empowered infrastructure team can outperform a traditional team ten times its size — and you want to prove it — this is your role.

Responsibilities

AI-assisted cloud architecture design and governance: Lead the end-to-end design of multi-cloud and hybrid cloud architectures (AWS, GCP, Azure) across Shanda’s portfolio companies — using AI coding assistants (e.g., GitHub Copilot, Cursor) and LLM-powered design tools to accelerate architecture documentation, diagram generation, and standards authoring. Establish clear, AI-maintained standards for cloud resource organization, naming conventions, tagging policies, network topology, and environment segmentation (dev/staging/production), with AI agents continuously auditing drift and flagging violations.
AI-augmented data platform infrastructure: Design and maintain the cloud infrastructure underpinning Shanda’s unified data platform, including PB-scale data lakes, streaming pipelines, and analytical warehouses. Leverage AI-powered data quality monitoring, anomaly detection, and automated pipeline repair to ensure high-reliability data access for AI training pipelines and real-time product features (e.g., MemGraph for Tanka, EverMemOS for EverMind) — with the goal of self-healing pipelines that require minimal human intervention.
Intelligent networking and connectivity: Design and manage enterprise-grade cloud networking across geographically distributed teams and products spanning the US, Singapore, and other regions. Apply AI-based traffic analysis and anomaly detection to proactively identify latency bottlenecks, routing inefficiencies, and DDoS patterns — shifting from reactive troubleshooting to predictive network operations.
AI-driven security, compliance, and access control: Define and enforce cloud security architecture across all portfolio companies using AI-powered threat detection, automated policy enforcement, and continuous compliance monitoring tools (e.g., Wiz, Orca, AWS Security Hub with AI-enriched findings). Leverage LLMs to accelerate security review of IaC changes, auto-generate compliance evidence for SOC 2 and HIPAA audits, and surface remediation recommendations in real time.
AI-generated Infrastructure as Code and platform engineering: Champion an AI-native IaC workflow where engineers use LLM-assisted code generation (Copilot, Amazon Q, or equivalent) to author, review, and refactor Terraform/Pulumi modules at speed. Build self-service infrastructure platforms where AI agents can interpret natural-language requests from product teams and translate them into validated, policy-compliant infrastructure changes — dramatically reducing the toil of manual provisioning.
Autonomous reliability, observability, and incident response: Define and own SLOs/SLAs for cloud infrastructure. Implement AI-enhanced observability stacks that go beyond dashboards — using anomaly detection models, AIOps platforms (e.g., Dynatrace, Moogsoft), and LLM-powered runbook execution to enable autonomous or semi-autonomous incident triage, root cause analysis, and remediation. Drive toward a posture where the majority of P2/P3 incidents are resolved without human escalation.
AI-powered FinOps and cost optimization: Establish a FinOps practice augmented by AI forecasting models that predict spend trends, identify rightsizing opportunities, and recommend optimal reserved/spot instance strategies across all business units. Use AI agents to continuously scan for idle resources, orphaned assets, and cost anomalies — turning cloud cost management from a periodic review into a continuous, automated discipline.
Continuous technology evaluation with AI-assisted research: Continuously evaluate emerging cloud technologies, managed AI services, and infrastructure tooling — using AI research assistants to synthesize vendor documentation, benchmark reports, and community feedback at scale. Provide well-reasoned build-vs-buy recommendations and lead proof-of-concept initiatives, with AI agents helping to automate test harness setup, results analysis, and decision documentation.

Qualifications

Bachelor’s degree in Computer Science, Engineering, or equivalent practical experience.
Cloud architecture expertise: 10+ years of hands-on experience in cloud architecture or infrastructure engineering, with deep expertise in at least one major cloud provider (AWS, GCP, or Azure) and working knowledge of multi-cloud environments. Relevant certifications (e.g., AWS Solutions Architect Professional, GCP Professional Cloud Architect) are a strong plus.
AI/ML infrastructure: Proven experience designing and operating infrastructure for large-scale AI and machine learning workloads, including GPU cluster management, distributed training (e.g., PyTorch DDP, DeepSpeed, Megatron-LM), and high-throughput model inference serving (e.g., Triton, vLLM, TensorRT).
Networking and distributed systems: Strong understanding of cloud networking fundamentals — VPCs, subnetting, routing, load balancing, service mesh (e.g., Istio), and global traffic management. Experience designing low-latency, high-availability architectures for globally distributed systems.
Infrastructure as Code and DevOps: Proficiency in IaC tools (Terraform, Pulumi, or CloudFormation) and CI/CD pipelines for infrastructure delivery. Experience with GitOps workflows and platform engineering practices that enable self-service infrastructure for development teams.
Containerization and orchestration: Deep expertise in Docker and Kubernetes, including cluster administration, autoscaling, resource quotas, and multi-tenant cluster design. Experience with managed Kubernetes services (EKS, GKE, AKS) and GPU-aware scheduling (e.g., NVIDIA device plugins, MIG partitioning).
Data and storage systems: Solid understanding of cloud-native storage solutions (object storage, block storage, distributed file systems such as Lustre or GPFS for HPC workloads) and modern data architectures including data lakes, lakehouses, and real-time streaming platforms (Kafka, Flink, Spark).
Security and compliance: Strong command of cloud security principles — IAM, RBAC, secrets management (Vault, AWS Secrets Manager), network security groups, and encryption. Hands-on experience with compliance frameworks such as SOC 2, HIPAA, or ISO 27001, and familiarity with data residency and cross-border data transfer regulations.
Observability and reliability engineering: Experience building and operating production observability stacks and defining SLOs. Familiarity with chaos engineering principles and tools (e.g., Chaos Monkey, LitmusChaos) to proactively validate system resilience.
FinOps and cost optimization: Demonstrated ability to manage and optimize cloud costs at scale, including experience with reserved instances, spot/preemptible instances, and cloud cost allocation frameworks across multiple teams or business units.
Communication and leadership: Excellent written and verbal communication skills, with the ability to present complex architectural decisions clearly to both technical peers and executive stakeholders. Experience mentoring engineers and driving cross-functional alignment on infrastructure strategy.

Why Join Us?

Impact: Play a critical role in building the infrastructure that powers the next generation of AI-native enterprises.
Innovation: Work with cutting-edge technologies and collaborate with world-class researchers and engineers across our portfolio companies.
Growth: Be part of a rapidly growing ecosystem with significant investment and a bold vision for the future of AI.
Culture: Join a dynamic, forward-thinking team that values continuous learning, innovation, and cross-functional collaboration.

Shanda Group is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees.

If you are interested in these job openings, please submit your resume and cover letter to shandahr@shanda.com. We also welcome assistance from recruitment agencies.