Introduction

Our team is working to improve the health and wealth of millions of current customers and acquire more customers in the future. One of the most effective and efficient way to achieve our goal is by getting an app into the millions of people. As it turns out, we already have a wonderful application which is downloaded by more than 3 million users as I write this post. The mobile application has a carousel portion in the bottom half section of the home page where dynamic banners can be rendered. Each banner is utilized as a form of information, communication medium or an application feature. This is the first page that is seen by all users who successfully register and a portion of them clicking on the banner displayed registering their interest. Our team's goal is to increase engagement within the app. The first step was to understand the source of users who were clicking the banners, why are they willing to go into exploring app via banners after registering while others would go on to explore the app via other routes.

Problem statement

The goal was to increase user engagement within the app by understanding user’s interest in a variety of banners and then leverage the results across the app.

we didn’t have existing data about user interaction with the app neither did we have enough time at hand to perform that activity. We were also looking at an incoming huge inflow of new users expected in near future due to the planned marketing campaigns. We were essentially looking at a cold start problem to improve engagement since, we would know little about the new users and time to market was a very important factor. We were expected to go live within two weeks duration with a solution to make the best out of data available at hand.

Solution

Bayesian bandits with Thomson sampling ticked all boxes as follows:

  1. It requires no data or less to start with compared to other options
  2. It will learn incoming users/data and start recommending banners
  3. Can work with new banners configured as new arms

The next phase of the project was also discussed where we agreed to work on building contextual bandits. In this post, I will be talking more about how we used various tools and technology making deployment possible. I will not be talking about how the recommendation algorithm works and the technology stack used to achieve it.

Deployment

The build and deployment part of the project was broken down into two technical stages/phases:

  1. Testing model and documenting results in the pre-prod environment with production data, define the input and the output schema for the model which will be used by the data engineer team to create a streaming pipeline.
  2. Setup model to consume a live stream of event data, and respond via a REST endpoint with the recommended list of banners for the users

The front end of the mobile app is configured for a response time of one second w.r.t to back-end. It meant that the app will try to generate dynamic banners on the user screen based on our recommendations or fall back to static banners if we failed to deliver a response within a second, which added another layer of complexity to the second stage. Our APIs were expected to support a wide range of user load starting from a few hundred requests to millions across the region.

We could list the deployment infra into three major components:

  1. A robust build and deployment pipeline
  2. Automated performance testing
  3. Production monitoring and alerting

Tools

Tools used for the complete setup:

  1. Jenkins
  2. Artifactory
  3. Docker
  4. Aquasec image scanning
  5. Fortify static code scan 6.. Sonar Nexus open source code scanning
  6. Kubernetes
  7. Predator
  8. Prometheus
  9. Grafana
  10. Bitbucket

Our application solution is a bunch of docker images which consumes/produces content in Kafka topics.

Step 1 — Fetching code and checking for changes

Our pipeline starts at fetching the code from Bitbucket repository. We store code in the folder structure for the 4 different docker images that are to be built. We check whether a file has been changed before initiating build for the files in that folder. The code in the Jenkins pipeline is as below for one of the folders titled ‘generator’.

#hide_output
script{
            GIT_RESULT = sh(script: '''git diff --quiet HEAD "$(git rev-parse @~1)" -- generator''',
                returnStatus: true
                )
                echo "GIT RESULT -- ${GIT_RESULT} -- ${params.branchname}"
              }

Step 2 - Fortify

Next step is to run complete code static security scanning by Fortify

#hide_output
sh '''
      echo "=================================================="
      echo "========--- SAST - Fortify Scan: Start ---========"
      echo "=================================================="
      hostname
      whoami
      ls -ahl
      echo 'WORKSPACE: ' $WORKSPACE
      cd $WORKSPACE
      pwd
      sourceanalyzer -v
      sourceanalyzer -b ${fortify_app_name} -clean
      sourceanalyzer -b ${fortify_app_name} -python-version ${python_version} -python-path ${python_path} ${fortify_scan_files}
      sourceanalyzer -b ${fortify_app_name} -scan -f ${fortify_app_name}.fpr
      fortifyclient -url https://sast.intranet.asia/ssc -authtoken "${fortify_upload_token}" uploadFPR -file ${fortify_app_name}.fpr -project ${fortify_app_name} -version ${fortify_app_version}
     '''

Step 3 - Docker

The next step is to build the docker image. We first login to Artifactory before initiating the build as our pip libraries are also pulled from mirrored pip in the Artifactory. I have provided a sample of code on how we achieve this.

#hide_output
sh """
                echo ${ARTIFACTORY_PASSWORD} | docker login -u ${ARTIFACTORY_USERNAME} --password-stdin docker-registry:8443
                cd generator
                docker build --file Docker-dev \
                 --build-arg HTTPS_PROXY=http://ip-address \
                 --build-arg ARTIFACTORY_USERNAME=${ARTIFACTORY_USERNAME} \
                 --build-arg ARTIFACTORY_PASSWORD=${ARTIFACTORY_PASSWORD} \
                 -t ${env.generator_image_latest} .
                docker tag ${env.generator_image_latest} ${env.generator_image_name}
                docker push ${env.generator_image_latest}
                docker push ${env.generator_image_name}
                docker logout docker-pcaaicoe.pruregistry.intranet.asia:8443
                cd ..
    """

Step 4 - Aquasec

After pushing an image into Artifactory, the next important and mandatory step to be performed is docker image security scanning.

#hide_output
    sh """
         echo "=================================================="
         echo "=============--- OSS - Nexus Scan ---============="
         echo "=================================================="
                docker save -o generator-dev.tar ${env.generator_image_latest}
                """
                String result = nexusscan("pcaaicoeaipulsenudgesgeneratordev", "$WORKSPACE", "build");
                echo result;
                sh """
                rm -f generator-dev.tar
                """
                sh """
                echo "=================================================="
                echo "=============--- CSEC - Aquasec Scan ---=========="
                echo "=================================================="
    """
                aquasecscan("${env.generator_image_latest}")

The code and image security scanning stages are major milestones to be cleared during the deployment phase. It is important as well as difficult to explain and agree between application security teams about what risks are we willing to take while allowing open source libraries with bugs to go live in our environment.

Step 5 — Kubernetes

Now we move on to the stage where we will be able to actually deploy and run our images. In order to deploy our solution, we need a Redis DB and Kafka cluster up and running. We deploy our docker images using the below code:

#hide_output
sh '''
            set +x
            echo "---- preparing options ----"
            export HTTPS_PROXY=ip-address:8080
            export KUBE_NAMESPACE="internal-namespace"
            export KC_OPTS=${KC_OPTS}" --kubeconfig=${KUBE_CONFIG}"
            export KC_OPTS=${KC_OPTS}" --insecure-skip-tls-verify=true"
            export KC_OPTS=${KC_OPTS}" --namespace=${KUBE_NAMESPACE}"
            
            echo "---- prepared options ----"
            echo "---- preparing alias ----"
            alias kc="kubectl ${KC_OPTS} $*"
            echo "---- alias prepared ----"
            
            echo "---- applying manifest ----"
   
   
           kc apply -f configmap.yaml

           if [ $which_app = "generator" ];then
             if [ $image_version = "latest" ];then
               kc delete deploy ai-pulse-nudges-events-reader||echo
             fi
             sed -i "s!GENERATOR_VERSION!$image_version!g" "generator.yaml"
             kc apply -f generator.yaml
           fi  

           if [ $which_app = "aggregator" ];then
             if [ $image_version = "latest" ];then
               kc delete deploy ai-pulse-nudges-click-counter||echo
             fi
             sed -i "s!AGGREGATOR_VERSION!$image_version!g" "aggregator.yaml"
             kc apply -f aggregator.yaml
           fi

           if [ $which_app = "detector" ];then
             if [ $image_version = "latest" ];then
               kc delete deploy ai-pulse-nudges-engine||echo
             fi
             sed -i "s!DETECTOR_VERSION!$image_version!g" "detector.yaml" 
             kc apply -f detector.yaml
           fi

           if [ $which_app = "restapi" ];then
             if [ $image_version = "latest" ];then
               kc delete deploy ai-pulse-nudges-restapi||echo
                fi
             sed -i "s!REST_VERSION!$image_version!g" "restapi.yaml"
             kc apply -f restapi.yaml
                    fi
            if [ $which_app = "all" ];then
             if [ $image_version = "latest" ];then
             kc delete deploy ai-pulse-nudges-events-reader||echo
             kc delete deploy ai-pulse-nudges-click-counter||echo
             kc delete deploy ai-pulse-nudges-engine||echo
             kc delete deploy ai-pulse-nudges-restapi||echo
             fi

             sed -i "s!GENERATOR_VERSION!$image_version!g" "generator.yaml"
             sed -i "s!AGGREGATOR_VERSION!$image_version!g" "aggregator.yaml"
             sed -i "s!DETECTOR_VERSION!$image_version!g" "detector.yaml" 
             sed -i "s!REST_VERSION!$image_version!g" "restapi.yaml"

             kc apply -f generator.yaml
             kc apply -f aggregator.yaml
             kc apply -f detector.yaml
             kc apply -f restapi.yaml
                    fi
   
   
   
            echo "---- manifest applied ----"
            echo "---- checking result ----"
            
            echo " >> Deployments "
            kc get deployments
            
            echo " >> Services"
            kc get svc
            
            echo " >> Ingress"
            kc get ingress
            
            echo " >> Pods"
            kc get pods
            
            echo "---- Done ----"
          '''

Step 6 — Performance test

We deploy Predator — the tool which we use for performance test.

#hide_output
sh '''
            set +x
            echo "---- preparing options ----"
            export HTTPS_PROXY=ip-address:8080
            export KUBE_NAMESPACE="internal-namespace"
            export KC_OPTS=${KC_OPTS}" --kubeconfig=${KUBE_CONFIG}"
            export KC_OPTS=${KC_OPTS}" --insecure-skip-tls-verify=true"
            export KC_OPTS=${KC_OPTS}" --namespace=${KUBE_NAMESPACE}"
            
            echo "---- prepared options ----"
            echo "---- preparing alias ----"
            alias kc="kubectl ${KC_OPTS} $*"
            echo "---- alias prepared ----"
            
            echo "---- applying manifest ----"
   
           kc get deploy|grep predator|awk '{print $1 }' || echo
           kc get deploy|grep predator|awk '{print $1 }'|xargs kc delete deploy || echo

           for i in `seq $replica_count`
           do
             echo $i
             cp -rf predator/predator.yaml tmp.yaml
             sed -i "s!REPLICA_NO!$i""!g" "tmp.yaml"
             kc apply -f tmp.yaml
           done  
   
   
            # kc apply -f predator/predator.yaml
            
            echo "---- manifest applied ----"
            echo "---- checking result ----"
            
            echo " >> Deployments "
            kc get deployments
            
            echo " >> Services"
            kc get svc
            
            echo " >> Ingress"
            kc get ingress
            
            echo " >> Pods"
            kc get pods
            
            echo "---- Done ----"
          '''

Predator is an amazing tool that enables us to leverage existing Kubernetes infra for an unlimited number of users for testing. Read more about the tool here: https://medium.com/zooz-engineering/by-niv-lipetz-software-engineer-zooz-b5928da0b7a8 We leverage the existing enterprise Prometheus and Grafana set up to monitor the application pods.

Lessons learned for next time:

  1. We started writing the pipeline code from scratch, whereas it would have helped save time if an advanced hello world type of empty pipeline existed, which could be used as a template structure. It would have enabled us to know what credentials and access were required at what stage.
  2. There were many credentials and access that were required to get the pipeline up and running. It would be a time and effort savior if we have one master service id created and assigned to a pipeline which can then be used across all tools in the organization.
  3. It is very difficult to build a machine learning model, and real-time streaming data was an additional complexity, but productionizing that model with streaming data is many folds difficult.

Contributors

Glenn Bayne, Tien Nguyet Long, John Yue, Zeldon Tay (郑育忠), Steven Chau , Denys Pang , Philipp Gschoepf , Syam Bandi , Uma Maheshwari, Michael Natusch