How to read TLMZ performance report

When a VM checkpoint or migrate, the source VM print out messages to stdio that show live migration information. The formats of these lines are as follows. Note that we refer to our software as TLMZ.

  1. At the beginning, TLMZ reports the contents of TLMZ configuration file to let users check if the parameters are correct. We show the following example.
     tlc_conf: TLC_MSERV_IP = 
     tlc_conf: TLC_MSERV_BASE_PORT = 10008 
     tlc_conf: TLC_INTERVAL = 1000 
     tlc_conf: TLC_SLOWDOWN_INTERVAL = 1000 
     tlc_conf: TLC_SLOWDOWN_TRIGGER = OFF 
     tlc_conf: TLC_SLOWDOWN_MARKER = 0 
     tlc_conf: VIC_FLAG = ON  
     tlc_conf: VIC_IP = 
     tlc_conf: VIC_PORT = 17777 
     tlc_conf: TLC_MSERV_REPLICATOR_IP = 
     tlc_conf: TLC_MSERV_BASE_REPLICATOR_PORT = 11118 
     tlc_conf: TLC_END = TLC_END 
    connect error errno=111 fd=22
    VIC Connect error occured errno = 111 
    tlc_do_migrate: in tlmz uri is tlmz:
     slowdown stepwise = 0 trigger = OFF marker = 0.00 agress = 0 freeze = 0 
    mserv_create_connection: sending tlmz and total page num to fwder
    tlc tcp start outgoing mig: hostport =
    tlc tcp start outgoing mig: create socket 
    tlc tcp start outgoing mig: create a connection 
    RAM_SAVE_FLAG_EOS ram_save_live stage 1

    We discuss the meaning of these configuration parameters on another blog post. However, the above example shows that the destination of the migration is at IP using the “tlmz” protocol. The source hypervisor makes 3 TCP connections with the destination. They are on port 9004, port 10008 and port 10009. The connection made to is used by the mtx thread to transmit non-dirty pages. The connection made to port 10008 is used to transmit dirty pages, while the connection on port 10009 is used to send the remaining dirty pages at stage 3 of tlmz protocol. The last line indicates the completion of the 1st stage of tlmz, setting up mtx thread, dtx thread and dirty bit collection mechanism.

  2. During live migration, TLMZ reports lines such as
    s221 0 844
    s222 45616 43 0 0.02
    s221 18208 1000 
    s222 64312 199 0 0.04
    s221 22274 1000 
    s222 80920 491 0 0.05
    s221 16486 1000 
    s222 63982 188 0 0.09
    s221 21278 1000
    s222 26370 27 0 1.00
    RAM_SAVE_FLAG_EOS ram_save_live stage 2

    that can be put into the following format.

    s221 duration_id pages_copied_last_epoch epoch_duration  
    s222 pages_gen_last_epoch mig_duration all_io_cnt stage2_progress   

    On the “s221” line, the “duration_id” is the number of epoch during live migration. The “pages_copied_last_epoch” reports the number of pages being sent to the destination ( in this example). The “epoch_duration” shows the length in milliseconds of an epoch (e.g. 1000 ms). On the “s222” line, “pages_gen_last_epoch” indicates the number of new dirty pages generated by VM computation during last epoch. The “mig_duration” reports the time tlmz used after each epoch to update dirty bit information. During this time, the mtx and dtx threads stop. Next, the “all_io_cnt” show the number of I/O operations that occurred in the last epoch. Finally, the “stage2_progress” displays the percentage of VM memory scan of the mtx thread. Once this value reach 1.0, the live migration stage is finished (as indicated by “RAM_SAVE_FLAG_EOS … ” line).

  3. After that tlmz stops the VM and report the followings:

    s221 10637 1000
    s2s3gap 638 ps_wait 1047
    s3 1869301 1
    .. 10 .. 20 .. 30 .. 40 .. 50 .. 60 .. 70 .. 80 .. 90 .. 100 Percents
    RAM_SAVE_FLAG_EOS ram_save_live stage 3
    s222 10414 65475 443 1.00
    s4 1726145 1575222 179963

    The last pair of “s221” and “s222” lines is shown at this stage with special meaning since the dirty bit update operation includes the time to send the remaining dirty pages after the VM stopped. From the above example, we can interpret the “s221” line as there were 10637 dirty pages being sent to the destination in the last 1000 ms epoch of VM computation.

    On the other hand, the “s222” line presents the followins. The number of dirty pages generated in the last epoch is 10414 pages. The time the VM stop to do dirty bit update and Stage 3 operation of tlmz (or Downtime) is 65475 milliseconds and the total number of I/O operation during the live migration period is 443 operations.

    The “s2s3gap” present the time TLMX have to wait after the mtx finishes and before stage 3 starts. It’s used mainly for debugging purpose.

  4. The “s3” line comes with the following format:

    s3 last_addr_buffer num_cp_threads

    where “last_addr_buffer” show the total number of dirty pages to be transfered to the destination at Satge 3, and “num_cp_threads” indicates a number of threads use to send data out at this stage. From the above example, the number of dirty pages remaining to be sent at Stage 3 is 1869301 pages. The percentage completion of the data transfer is also presented.

  5. Finally, the summary of tlmz migration is reported in the “s4” line with the following format:
    s4 tlc_pages_sent_mserv tlc_s21_pages_transferred total_migration_time

    The “tlc_pages_sent_mserv” show the total number of pages sent to destination by dtx threads (during Stage 2) and the “tlc_s21_pages_transferred” shows the total number of pages sent by the mtx thread. The “total_migration_time” shows the total migration time from start to finish (in milliseconds). From the above example, the total migration time is 179963 milliseconds.

Leave a Reply