Cloud Architect (AI Native)
Job Description: Cloud Architect (AI Native)
Shanda Group is a global investment firm and pioneer in Artificial Intelligence, committed to revolutionizing enterprise intelligence through our vision of “Discoverable AI.”
As a Cloud Architect, you will be instrumental in building the foundational infrastructure that supports our diverse portfolio of cutting-edge AI companies.
The Role
This is not a role for someone who wants to maintain infrastructure. It is a role for someone who wants to reinvent it.
We are looking for a Cloud Architect who thinks in systems, moves with urgency, and embraces AI as a first-order tool — not just something they build for, but something they build with. You will design and own the cloud foundation that powers AGI research, real-time AI products, and PB-scale data platforms across our entire portfolio.
If you believe that a small, AI-empowered infrastructure team can outperform a traditional team ten times its size — and you want to prove it — this is your role.
Responsibilities
- AI-assisted cloud architecture design and governance: Lead the end-to-end design of multi-cloud and hybrid cloud architectures (AWS, GCP, Azure) across Shanda’s portfolio companies — using AI coding assistants (e.g., GitHub Copilot, Cursor) and LLM-powered design tools to accelerate architecture documentation, diagram generation, and standards authoring. Establish clear, AI-maintained standards for cloud resource organization, naming conventions, tagging policies, network topology, and environment segmentation (dev/staging/production), with AI agents continuously auditing drift and flagging violations.
- AI-augmented data platform infrastructure: Design and maintain the cloud infrastructure underpinning Shanda’s unified data platform, including PB-scale data lakes, streaming pipelines, and analytical warehouses. Leverage AI-powered data quality monitoring, anomaly detection, and automated pipeline repair to ensure high-reliability data access for AI training pipelines and real-time product features (e.g., MemGraph for Tanka, EverMemOS for EverMind) — with the goal of self-healing pipelines that require minimal human intervention.
- Intelligent networking and connectivity: Design and manage enterprise-grade cloud networking across geographically distributed teams and products spanning the US, Singapore, and other regions. Apply AI-based traffic analysis and anomaly detection to proactively identify latency bottlenecks, routing inefficiencies, and DDoS patterns — shifting from reactive troubleshooting to predictive network operations.
- AI-driven security, compliance, and access control: Define and enforce cloud security architecture across all portfolio companies using AI-powered threat detection, automated policy enforcement, and continuous compliance monitoring tools (e.g., Wiz, Orca, AWS Security Hub with AI-enriched findings). Leverage LLMs to accelerate security review of IaC changes, auto-generate compliance evidence for SOC 2 and HIPAA audits, and surface remediation recommendations in real time.
- AI-generated Infrastructure as Code and platform engineering: Champion an AI-native IaC workflow where engineers use LLM-assisted code generation (Copilot, Amazon Q, or equivalent) to author, review, and refactor Terraform/Pulumi modules at speed. Build self-service infrastructure platforms where AI agents can interpret natural-language requests from product teams and translate them into validated, policy-compliant infrastructure changes — dramatically reducing the toil of manual provisioning.
- Autonomous reliability, observability, and incident response: Define and own SLOs/SLAs for cloud infrastructure. Implement AI-enhanced observability stacks that go beyond dashboards — using anomaly detection models, AIOps platforms (e.g., Dynatrace, Moogsoft), and LLM-powered runbook execution to enable autonomous or semi-autonomous incident triage, root cause analysis, and remediation. Drive toward a posture where the majority of P2/P3 incidents are resolved without human escalation.
- AI-powered FinOps and cost optimization: Establish a FinOps practice augmented by AI forecasting models that predict spend trends, identify rightsizing opportunities, and recommend optimal reserved/spot instance strategies across all business units. Use AI agents to continuously scan for idle resources, orphaned assets, and cost anomalies — turning cloud cost management from a periodic review into a continuous, automated discipline.
- Continuous technology evaluation with AI-assisted research: Continuously evaluate emerging cloud technologies, managed AI services, and infrastructure tooling — using AI research assistants to synthesize vendor documentation, benchmark reports, and community feedback at scale. Provide well-reasoned build-vs-buy recommendations and lead proof-of-concept initiatives, with AI agents helping to automate test harness setup, results analysis, and decision documentation.
Qualifications
- Bachelor’s degree in Computer Science, Engineering, or equivalent practical experience.
- Cloud architecture expertise: 10+ years of hands-on experience in cloud architecture or infrastructure engineering, with deep expertise in at least one major cloud provider (AWS, GCP, or Azure) and working knowledge of multi-cloud environments. Relevant certifications (e.g., AWS Solutions Architect Professional, GCP Professional Cloud Architect) are a strong plus.
- AI/ML infrastructure: Proven experience designing and operating infrastructure for large-scale AI and machine learning workloads, including GPU cluster management, distributed training (e.g., PyTorch DDP, DeepSpeed, Megatron-LM), and high-throughput model inference serving (e.g., Triton, vLLM, TensorRT).
- Networking and distributed systems: Strong understanding of cloud networking fundamentals — VPCs, subnetting, routing, load balancing, service mesh (e.g., Istio), and global traffic management. Experience designing low-latency, high-availability architectures for globally distributed systems.
- Infrastructure as Code and DevOps: Proficiency in IaC tools (Terraform, Pulumi, or CloudFormation) and CI/CD pipelines for infrastructure delivery. Experience with GitOps workflows and platform engineering practices that enable self-service infrastructure for development teams.
- Containerization and orchestration: Deep expertise in Docker and Kubernetes, including cluster administration, autoscaling, resource quotas, and multi-tenant cluster design. Experience with managed Kubernetes services (EKS, GKE, AKS) and GPU-aware scheduling (e.g., NVIDIA device plugins, MIG partitioning).
- Data and storage systems: Solid understanding of cloud-native storage solutions (object storage, block storage, distributed file systems such as Lustre or GPFS for HPC workloads) and modern data architectures including data lakes, lakehouses, and real-time streaming platforms (Kafka, Flink, Spark).
- Security and compliance: Strong command of cloud security principles — IAM, RBAC, secrets management (Vault, AWS Secrets Manager), network security groups, and encryption. Hands-on experience with compliance frameworks such as SOC 2, HIPAA, or ISO 27001, and familiarity with data residency and cross-border data transfer regulations.
- Observability and reliability engineering: Experience building and operating production observability stacks and defining SLOs. Familiarity with chaos engineering principles and tools (e.g., Chaos Monkey, LitmusChaos) to proactively validate system resilience.
- FinOps and cost optimization: Demonstrated ability to manage and optimize cloud costs at scale, including experience with reserved instances, spot/preemptible instances, and cloud cost allocation frameworks across multiple teams or business units.
- Communication and leadership: Excellent written and verbal communication skills, with the ability to present complex architectural decisions clearly to both technical peers and executive stakeholders. Experience mentoring engineers and driving cross-functional alignment on infrastructure strategy.
Why Join Us?
- Impact: Play a critical role in building the infrastructure that powers the next generation of AI-native enterprises.
- Innovation: Work with cutting-edge technologies and collaborate with world-class researchers and engineers across our portfolio companies.
- Growth: Be part of a rapidly growing ecosystem with significant investment and a bold vision for the future of AI.
- Culture: Join a dynamic, forward-thinking team that values continuous learning, innovation, and cross-functional collaboration.