AI Systems Engineer · ML • GenAI • Agentic AI

Building production AI systems that scale from machine learning to autonomous agents.

Senior AI Engineer with 7+ years of experience building machine learning, RAG, and agentic AI systems across fintech and regulated enterprise domains using AWS-native cloud infrastructure.

0+
Years Experience
0
Companies
0+
ML Models in Production
AWS
Cloud-Native
Portrait of Vikash Agrawal, Senior AI Engineer
PythonAWSLangGraphBedrockXGBoost

Experience

7+ years building scalable software, ML, and AI systems

My journey across software engineering, machine learning, and production AI systems.

  1. 2026 – Present

    Solstice Intelligence

    Senior AI Engineer

    Architecting production-grade agentic AI systems, enterprise RAG platforms, and AWS-native AI infrastructure for regulatory intelligence in life sciences.

    • LangGraph
    • Bedrock
    • AWS
    • RAG
    • Agentic AI
  2. 2021 – 2026

    Scienaptic AI

    Senior Data Scientist / Generative AI Engineer

    Built credit risk ML systems, anomaly detection pipelines, fairness evaluation frameworks, and enterprise generative AI platforms for large-scale financial institutions.

    • XGBoost
    • SHAP
    • ML
    • LLM
    • Risk AI
  3. 2019 – 2021

    TCS

    System Engineer

    Built backend APIs and scalable distributed applications using Python, Django, PostgreSQL, and React.

    • Python
    • Django
    • React
    • PostgreSQL

Case Studies

Featured Case Studies

Real-world AI systems built at enterprise scale.

Agentic AI / Production GenAI / Enterprise RAG

Agentic Regulatory Intelligence Platform

A production agentic AI platform for life-science compliance teams that continuously monitors global regulators such as the FDA and EMA, performs semantic diffing across regulatory document versions, and generates citation-backed compliance impact assessments through a stateful multi-agent LangGraph workflow.

  • 10K+ documents / month
  • 6-agent LangGraph workflow
  • Citation-backed reports

Semantic change detection

Structural diff first, then an LLM compares old and new text, then a final pass scores business impact. It catches small edits like “annually” turning into “quarterly” that carry big compliance consequences.

Multi-agent reasoning

Six LangGraph agents split the work: a planner, a retrieval agent, a change analyzer, a domain expert that rates severity, a client-impact agent, and a report generator.

Hybrid retrieval

Vector similarity runs alongside keyword search, with metadata filters for agency, version, and effective date, all over OpenSearch. Pure vector search kept missing exact regulatory terms.

Human in the loop

Anything flagged high-severity pauses for a compliance reviewer to sign off before it goes out. That one checkpoint did more for trust than any accuracy gain.

  • LangGraph
  • LangChain
  • Bedrock
  • OpenSearch
  • ECS Fargate
  • DynamoDB
  • SQS
  • EventBridge
  • Terraform
  • LangSmith

Machine Learning / FinTech AI / Risk Modeling

Enterprise Credit Risk Intelligence Platform

Credit underwriting and portfolio risk models for large lenders, scoring millions of applicants on a blend of bureau and alternative data. The job was always the same balancing act: approve more good borrowers without quietly taking on more default risk, and keep every decision explainable enough to defend to a regulator.

  • Millions of applicants scored
  • KS and Gini for risk separation
  • PSI / CSI drift monitoring in production
  • AIR fairness checks on protected groups
  • Real-time and batch scoring pipelines
  • XGBoost
  • Random Forest
  • SHAP
  • Optuna
  • PSI
  • CSI
  • AIR
  • LexisNexis

AI Reliability / Observability

LLM Evaluation Framework

Evaluation harnesses for enterprise LLM features, built so quality is something you can actually see. They score retrieval and generation on real traffic, track groundedness and hallucination rate, and run as regression gates every time a prompt or model changes.

  • Recall@K and MRR on retrieval
  • Groundedness and hallucination scoring
  • Latency and cost tracking
  • Evaluation
  • LangSmith
  • Groundedness
  • Recall@K
  • Observability

Architecture

Architecture Gallery

System designs behind production AI platforms.

Writing

Technical Writing

Thoughts on AI systems, machine learning, and production engineering.

RetrievalEssay

Production RAG at Scale

Naïve top-k retrieval quietly degrades past a few thousand documents. The fix is rarely a bigger model. It's getting the right context in front of it.

  • Chunk on document structure, not fixed token windows
  • Run keyword and vector search together, then rerank the top hits
  • Filter on metadata before the query ever reaches the model

Retrieval quality, not the model, is the ceiling on RAG accuracy.

AgentsEssay

Designing Reliable Agentic AI Systems

Agents don't fail in the demo. They fail on the long tail, three steps deep, when a tool returns something unexpected. Reliability is mostly about closing those gaps.

  • Give every tool a strict schema and one narrow job
  • Cap planning loops so an agent can't wander or stall
  • Add a critic pass before anything acts on the result

Reliability comes from constraints, not larger models.

EvaluationEssay

LLM Evaluation in Enterprise Systems

You can't ship what you can't measure. A vibe check on ten prompts doesn't survive a model swap, so evaluation has to be built in, not bolted on.

  • Score retrieval and generation as separate stages
  • Measure groundedness and hallucination rate on real traffic
  • Gate every prompt or model change behind the eval suite

Treat evals like CI: every prompt change is a regression test.

Resume

Resume

Download my latest resume covering experience across AI systems engineering, machine learning, generative AI, and cloud-native architectures.

Download Resume (PDF)

Contact

Let's Build Something Meaningful

Open to AI engineering opportunities, consulting, and high-impact technical collaborations.