Large Language Model Selection for Test-Driven Prompt Android iOS Development

Muhammad Rizqullah; Emad Albassam

doi:10.3991/ijim.v20i03.59861

Authors

Muhammad Rizqullah King Abdulaziz University, Jeddah, Saudi Arabia https://orcid.org/0009-0007-9739-8850
Emad Albassam King Abdulaziz University, Jeddah, Saudi Arabia https://orcid.org/0000-0001-6949-0368

DOI:

https://doi.org/10.3991/ijim.v20i03.59861

Keywords:

Artificial Intelligence, Explainable AI, Empirical Software Engineering, Mobile Development, Software Engineering

Abstract

Large language model (LLM) code generation research predominantly focuses on Python, with test-driven prompt engineering exclusively targeting this language. This study presents a comprehensive LLM selection framework for mobile development through rigorous empirical analysis. We conducted 8,704 evaluations across 544 programming tasks (HumanEval and MBPP datasets) on Android (Java) and iOS (Swift) platforms using four state-of-the-art LLMs (GPT-4o, GPT-4o-mini, Qwen 14B, and Qwen 32B), two prompting strategies (base and test-driven), and two metrics (accuracy and remediation accuracy). Systematic analysis of platform-specific patterns yielded a decision tree incorporating first-attempt correctness, budget constraints, and self-hosting requirements, validated through three industry-relevant use cases. Results show test-driven prompting (TDP) achieves a +2.22 pp average accuracy improvement over baseline (95% CI [1.22–3.23 pp], p < 0.001, d = 0.3974). However, LLMs consistently underperform in mobile development (66.85%–88.87%) compared to Pythonbased code generation (86.90%–91.30%) regardless of model size or type. This framework establishes groundwork for platform-specific optimizations while providing practitioners with actionable guidance for model selection in mobile development contexts.

Author Biographies

Muhammad Rizqullah, King Abdulaziz University, Jeddah, Saudi Arabia

Muhammad Rizqullah is a graduate student in the Department of Computer Science at King Abdulaziz University. He received his Bachelor’s degree in Informatics from Telkom University, Indonesia in 2019. He worked full-time as a Software Engineer from 2020 to 2023 at various companies, most notably at Grab, a ride-hailing tech company in Singapore. His research topics mainly include Software Engineering, Empirical Software Engineering, and Artificial Intelligence

Emad Albassam, King Abdulaziz University, Jeddah, Saudi Arabia

Emad Albassam is an Associate Professor in the Department of Computer Science at King Abdulaziz University. He received his BSc degree in computer science from King Abdulaziz University, and an MSc and PhD degrees in software engineering and information technology with a concentration in software engineering from George Mason University, Fairfax, Virginia. He served as the Vice Dean for Applications at the Deanship of Information Technology. He currently serves as the Director of the Strategic Planning unit at the Faculty of Computing and Information Technology in King Abdulaziz University.

References

Z. Ságodi, I. Siket, and R. Ferenc, "Methodology for code synthesis evaluation of LLMs presented by a case study of ChatGPT and Copilot," IEEE Access, vol. 12, pp. 72303-72316, 2024, doi: https://doi.org/10.1109/ACCESS.2024.3403858 DOI: https://doi.org/10.1109/ACCESS.2024.3403858

X. Hou, et al., "Large language models for software engineering: A systematic literature review," ACM Transactions on Software Engineering and Methodology, vol. 33, no. 8, pp. 1-79, 2024, doi: https://doi.org/10.1145/3695988 DOI: https://doi.org/10.1145/3695988

U. K. Durrani, M. Akpinar, H. Bektas, and M. Saleh, "Impact of artificial intelligence on software engineering phases and activities (2013–2024): A quantitative analysis using zero-truncated Poisson model," IEEE Access, vol. 13, pp. 95535-95547, 2025, doi: https://doi.org/10.1109/ACCESS.2025.3574462 DOI: https://doi.org/10.1109/ACCESS.2025.3574462

L. Huang, et al., "A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions," ACM Transactions on Information Systems, vol. 43, no. 2, Article 42, pp. 1-55, 2025, doi: https://doi.org/10.1145/3703155 DOI: https://doi.org/10.1145/3703155

I. Augenstein, et al., "Factuality challenges in the era of large language models and opportunities for fact-checking," Nature Machine Intelligence, vol. 6, no. 8, pp. 852-863, 2024, doi: https://doi.org/10.1038/s42256-024-00881-z DOI: https://doi.org/10.1038/s42256-024-00881-z

N. Gruver, M. Finzi, S. Qiu, and A. G. Wilson, "Large language models are zero-shot time series forecasters," in Advances in Neural Information Processing Systems (NeurIPS 2023), vol. 36, pp. 19622-19635, 2023, available at: https://proceedings.neurips.cc/paper_files/paper/2023/file/3eb7ca52e8207697361b2c0fb3926511-Paper-Conference.pdf

J. Yang, H. Jin, R. Tang, X. Han, Q. Feng, H. Jiang, S. Zhong, B. Yin, and X. Hu, "Harnessing the power of LLMs in practice: A survey on ChatGPT and beyond," ACM Trans. Knowl. Discov. Data, vol. 18, no. 6, article 160, Apr. 2024, doi: https://doi.org/10.1145/3649506. DOI: https://doi.org/10.1145/3649506

F. Tambon, A. Moradi-Dakhel, A. Nikanjam, F. Khomh, M. Desmarais, and G. Antoniol, "Bugs in large language models generated code: An empirical study," Empir. Softw. Eng., vol. 30, article 65, 2025, doi: https://doi.org/10.1007/s10664-025-10614-4 DOI: https://doi.org/10.1007/s10664-025-10614-4

J. Liu, C. S. Xia, Y. Wang, and L. Zhang, "Is your code generated by ChatGPT really correct? Rigorous evaluation of large language models for code generation," in Advances in Neural Information Processing Systems (NeurIPS 2023), vol. 36, pp. 21558-21572, 2023, available at: https://proceedings.neurips.cc/paper_files/paper/2023/file/43e9d647ccd3e4b7b5baab53f0368686-Paper-Conference.pdf

X. Xu, C. Ni, X. Guo, S. Liu, X. Wang, K. Liu, and X. Yang, "Distinguishing LLM-generated from human-written code by contrastive learning," ACM Trans. Softw. Eng. Methodol., vol. 34, no. 4, article 91, Apr. 2025, doi: https://doi.org/10.1145/3705300 DOI: https://doi.org/10.1145/3705300

I. Atoum, M. K. Baklizi, I. Alsmadi, A. A. Otoom, T. Alhersh, J. Ababneh, J. Almalki, and S. M. Alshahrani, "Challenges of software requirements quality assurance and validation: A systematic literature review," IEEE Access, vol. 9, pp. 137613-137634, 2021, doi: https://doi.org/10.1109/ACCESS.2021.3117989 DOI: https://doi.org/10.1109/ACCESS.2021.3117989

S. K. Pradhan, A. Kumar, and V. Kumar, "Modeling reliability-driven software release strategy considering testing effort with fault detection and correction processes: A control theoretic approach," Int. J. Reliab. Qual. Saf. Eng., vol. 32, no. 02, article 2440002, 2025, doi: https://doi.org/10.1142/S0218539324400023 DOI: https://doi.org/10.1142/S0218539324400023

J. Yi, J. Kim, and Y. K. Oh, "Uncovering the quality factors driving the success of mobile payment apps," J. Retailing Consum. Serv., vol. 77, article 103641, 2024, doi: https://doi.org/10.1016/j.jretconser.2023.103641 DOI: https://doi.org/10.1016/j.jretconser.2023.103641

L. Alwakeel, K. Lano, and H. Alfraihi, "AppCraft: Model-driven development framework for mobile applications," IEEE Access, vol. 13, pp. 23658-23699, 2025, doi: https://10.1109/ACCESS.2025.3536321 DOI: https://doi.org/10.1109/ACCESS.2025.3536321

O. Haggag, J. Grundy, M. Abdelrazek et al., "A large scale analysis of mHealth app user reviews," Empir. Softw. Eng., vol. 27, article 196, 2022, doi: https://doi.org/10.1007/s10664-022-10222-6 DOI: https://doi.org/10.1007/s10664-022-10222-6

B. Papis, K. Grochowski, K. Subzda, and K. Sijko, "Experimental evaluation of test-driven development with interns working on a real industrial project," IEEE Trans. Softw. Eng., vol. 48, no. 5, pp. 1644-1664, May 2022, doi: https://doi.org/10.1109/TSE.2020.3027522 DOI: https://doi.org/10.1109/TSE.2020.3027522

M. Marabesi, A. García-Holgado, and F. J. García-Peñalvo, "Exploring the connection between the TDD practice and test smells—A systematic literature review," Computers, vol. 13, no. 3, article 79, Mar. 2024, doi: https://doi.org/10.3390/computers13030079 DOI: https://doi.org/10.3390/computers13030079

S. Piya and A. Sullivan, "LLM4TDD: Best practices for test driven development using large language models," in Proceedings of the 1st International Workshop on Large Language Models for Code (LLM4Code '24), pp. 14-21, 2024, doi: https://doi.org/10.1145/3643795.3648382 DOI: https://doi.org/10.1145/3643795.3648382

S. Fakhoury, A. Naik, G. Sakkas, S. Chakraborty, and S. K. Lahiri, "LLM-based test-driven interactive code generation: User study and empirical evaluation," IEEE Transactions on Software Engineering, vol. 50, no. 9, pp. 2254-2268, 2024, doi: https://doi.org/10.1109/TSE.2024.3428972 DOI: https://doi.org/10.1109/TSE.2024.3428972

N. S. Mathews and M. Nagappan, "Test-driven development and LLM-based code generation," in Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering (ASE '24), pp. 1583-1594, 2024, doi: https://doi.org/10.1145/3691620.3695527 DOI: https://doi.org/10.1145/3691620.3695527

J. Liu, R. Liang, X. Zhu et al., "LLM4TDG: Test-driven generation of large language models based on enhanced constraint reasoning," Cybersecurity, vol. 8, article 32, 2025, doi: https://doi.org/10.1186/s42400-024-00335-4 DOI: https://doi.org/10.1186/s42400-024-00335-4

F. Cassano, et al., "MultiPL-E: A scalable and polyglot approach to benchmarking neural code generation," IEEE Transactions on Software Engineering, vol. 49, no. 7, pp. 3675-3691, 2023, doi: https://doi.org/10.1109/TSE.2023.3267446 DOI: https://doi.org/10.1109/TSE.2023.3267446

J. Jiang, F. Wang, J. Shen, S. Kim, and S. Kim, "A survey on large language models for code generation," ACM Trans. Softw. Eng. Methodol., Just Accepted, July 2025, doi: https://doi.org/10.1145/3747588

M.-F. Wong, S. Guo, C.-N. Hang, S.-W. Ho, and C.-W. Tan, "Natural language generation and understanding of big code for AI-assisted programming: A review," Entropy, vol. 25, no. 6, article 888, Jun. 2023, doi: https://doi.org/10.3390/e25060888 DOI: https://doi.org/10.3390/e25060888

X. Jiang, Y. Dong, L. Wang, Z. Fang, Q. Shang, G. Li, Z. Jin, and W. Jiao, "Self-planning code generation with large language models," ACM Trans. Softw. Eng. Methodol., vol. 33, no. 7, Article 182, pp. 1-30, Sep. 2024, doi: https://doi.org/10.1145/3672456 DOI: https://doi.org/10.1145/3672456

J. Jiang, F. Wang, J. Shen, S. Kim, and S. Kim, "A survey on large language models for code generation," ACM Transactions on Software Engineering and Methodology, Just Accepted, July 2025, doi: https://doi.org/10.1145/3747588 DOI: https://doi.org/10.1145/3747588

A. Biørn-Hansen, C. Rieger, T. M. Grønli, et al., "An empirical investigation of performance overhead in cross-platform mobile development frameworks," Empir. Softw. Eng., vol. 25, pp. 2997-3040, 2020, doi: https://doi.org/10.1007/s10664-020-09827-6 DOI: https://doi.org/10.1007/s10664-020-09827-6

F. Fan, et al., "An empirical study on common sense-violating bugs in mobile apps," ACM Transactions on Software Engineering and Methodology, vol. 34, no. 6, Article 179, pp. 1-26, 2025, doi: https://doi.org/10.1145/3709356 DOI: https://doi.org/10.1145/3709356

A. Ali, Y. Xia, Q. Umer, and M. Osman, "BERT based severity prediction of bug reports for the maintenance of mobile applications," J. Syst. Softw., vol. 208, article 111898, 2024, doi: https://doi.org/10.1016/j.jss.2023.111898 DOI: https://doi.org/10.1016/j.jss.2023.111898

T. Su, Y. Yan, J. Wang, J. Sun, Y. Xiong, G. Pu, K. Wang, and Z. Su, "Fully automated functional fuzzing of Android apps for detecting non-crashing logic bugs," Proc. ACM Program. Lang., vol. 5, no. OOPSLA, article 156, Oct. 2021, doi: https://doi.org/10.1145/3485533 DOI: https://doi.org/10.1145/3485533

H. Naveed, A. U. Khan, S. Qiu, M. Saqib, S. Anwar, M. Usman, N. Akhtar, N. Barnes, and A. Mian, "A Comprehensive Overview of Large Language Models," ACM Trans. Intell. Syst. Technol., vol. 16, no. 5, article 106, Oct. 2025, doi: https://doi.org/10.1145/3744746 DOI: https://doi.org/10.1145/3744746