Job Description
#Roles & Responsibilities:
Design, build, and maintain serving infrastructure for ONNX models, Augloop services, and future SLM/LLM pipelines.
Integrate AI models into scalable APIs for real-time inference and RAG-based applications.
Set up and run A/B experiments for Copilot features to evaluate model and system performance.
Implement robust logging, alerting, and telemetry to monitor for drift, regressions, and latency spikes.
Develop dashboards for automated monitoring, error detection, and inference quality tracking.
Optimize inference latency and compute cost across CPU and GPU environments.
Create internal tools for model performance analysis, comparison, and root-cause troubleshooting.
Work on both batch and streaming inference frameworks, ensuring adherence to SLA and uptime guarantees.
Implement orchestration and utilization monitoring for distributed CPU/GPU workloads.
Build tools to monitor uptime, throughput, job scaling, and container health metrics.
Ensure scalability, efficiency, and reliability of model-serving APIs with clear SLAs for performance metrics.
Profile and improve cold-start performance, concurrency handling, and system load capacity.
Integrate Responsible AI checks for fairness, explainability, and model behavior consistency.
Address AI injection threats, implement sandboxing techniques, and apply privacy guardrails.
Contribute to automated pipelines for SLA regression, PII validation, and compliance testing across Copilot systems.#Roles & Responsibilities:
Design, build, and maintain serving infrastructure for ONNX models, Augloop services, and future SLM/LLM pipelines.
Integrate AI models into scalable APIs for real-time inference and RAG-based applications.
Set up and run A/B experiments for Copilot features to evaluate model and system performance.
Implement robust logging, alerting, and telemetry to monitor for drift, regressions, and latency spikes.
Develop dashboards for automated monitoring, error detection, and inference quality tracking.
Optimize inference latency and compute cost across CPU and GPU environments.
Create internal tools for model performance analysis, comparison, and root-cause troubleshooting.
Work on both batch and streaming inference frameworks, ensuring adherence to SLA and uptime guarantees.
Implement orchestration and utilization monitoring for distributed CPU/GPU workloads.
Build tools to monitor uptime, throughput, job scaling, and container health metrics.
Ensure scalability, efficiency, and reliability of model-serving APIs with clear SLAs for performance metrics.
Profile and improve cold-start performance, concurrency handling, and system load capacity.
Integrate Responsible AI checks for fairness, explainability, and model behavior consistency.
Address AI injection threats, implement sandboxing techniques, and apply privacy guardrails.
Contribute to automated pipelines for SLA regression, PII validation, and compliance testing across Copilot systems.