TalentAQ

TalentAQ

Machine Learning Software Engineer AI Ops & Model Infrastructure

EngineeringFull Time5-10 yearsBangalore Rural, Karnataka

Required Skills
11 skills

Python
C++
Go
TensorRT
ONNX Runtime
Kubernetes
Azure ML
Prometheus
Grafana
MLflow
Airflow

Job Description

#Roles & Responsibilities: Design, build, and maintain serving infrastructure for ONNX models, Augloop services, and future SLM/LLM pipelines. Integrate AI models into scalable APIs for real-time inference and RAG-based applications. Set up and run A/B experiments for Copilot features to evaluate model and system performance. Implement robust logging, alerting, and telemetry to monitor for drift, regressions, and latency spikes. Develop dashboards for automated monitoring, error detection, and inference quality tracking. Optimize inference latency and compute cost across CPU and GPU environments. Create internal tools for model performance analysis, comparison, and root-cause troubleshooting. Work on both batch and streaming inference frameworks, ensuring adherence to SLA and uptime guarantees. Implement orchestration and utilization monitoring for distributed CPU/GPU workloads. Build tools to monitor uptime, throughput, job scaling, and container health metrics. Ensure scalability, efficiency, and reliability of model-serving APIs with clear SLAs for performance metrics. Profile and improve cold-start performance, concurrency handling, and system load capacity. Integrate Responsible AI checks for fairness, explainability, and model behavior consistency. Address AI injection threats, implement sandboxing techniques, and apply privacy guardrails. Contribute to automated pipelines for SLA regression, PII validation, and compliance testing across Copilot systems.
#Roles & Responsibilities: Design, build, and maintain serving infrastructure for ONNX models, Augloop services, and future SLM/LLM pipelines. Integrate AI models into scalable APIs for real-time inference and RAG-based applications. Set up and run A/B experiments for Copilot features to evaluate model and system performance. Implement robust logging, alerting, and telemetry to monitor for drift, regressions, and latency spikes. Develop dashboards for automated monitoring, error detection, and inference quality tracking. Optimize inference latency and compute cost across CPU and GPU environments. Create internal tools for model performance analysis, comparison, and root-cause troubleshooting. Work on both batch and streaming inference frameworks, ensuring adherence to SLA and uptime guarantees. Implement orchestration and utilization monitoring for distributed CPU/GPU workloads. Build tools to monitor uptime, throughput, job scaling, and container health metrics. Ensure scalability, efficiency, and reliability of model-serving APIs with clear SLAs for performance metrics. Profile and improve cold-start performance, concurrency handling, and system load capacity. Integrate Responsible AI checks for fairness, explainability, and model behavior consistency. Address AI injection threats, implement sandboxing techniques, and apply privacy guardrails. Contribute to automated pipelines for SLA regression, PII validation, and compliance testing across Copilot systems.

Similar Jobs

10000 jobs available

Machine Learning
AI
LLMs
+8 more
Python
LangChain
Crew.ai
+16 more
Machine Learning
Deep Learning
AI
+15 more
Machinelearning
MLOPS
VertexAI
+3 more
AI
Machine Learning
Data Science