The performance upper bound of all the baselines in different scenarios with different random seeds. The red colour illustrates the performance of our algorithm. We compared it with model-based and model-free baselines. Our approach achieves superior performance in terms of optimality, as demonstrated by the upper edge of the red boxplot being at its highest position in the figure. In addition, our method shows excellent performance in the upper quartile, median and lower quartile, as shown in the plot. Furthermore, our algorithm demonstrates enhanced training stability, indicating the absence of outliers within the red boxplot. The data (n = 5) are presented as median values (the central line of each box) along with the 25th and 75th percentiles (the bottom and top edges of each box), minima and maxima (the whiskers attached to each box), and outliers (outside the box and whiskers). Ours-lin, our method with linear models.