Main Article Content

Abstract

Dropout prediction in higher education is important because it impacts the academic success of students and the overall effectiveness of educational institutions. This research aims to build an automated ETL pipeline using Apache Airflow and Apache Spark to process academic data and predict student graduation status. The dataset used consists of 4,424 samples with 36 features covering demographic, academic, and socio-economic attributes. The data is processed through the stages of extraction, transformation (including SMOTE normalization), with loading into the Random Forest model. The evaluation results showed an accuracy of 62.93% and the highest ROC-AUC value of 0.81 for the dropout class. The Airflow pipeline excels in task scheduling efficiency, while Spark is effective for large-scale data processing. This approach shows practical potential in supporting early warning systems for academic policy decision-making. This research contributes to the intergation of big data and machine learning technologies for efficient and automated higher education data processing.

Keywords

Apache Airflow Apache Spark ETL dropout prediction machine learning

Article Details

References

  1. Abdelhamid, E., Tsikoudis, N., Duller, M., Sugiyama, M., Marino, N. and Waas, F. (2023), Adaptive Real-Time Virtualization of Legacy ETL Pipelines in Cloud Data Warehouses, doi: 10.48786/edbt.2023.64.
  2. Ahmed, N., Barczak, A.L.C., Susnjak, T. and Rashid, M.A. (2020), “A comprehensive performance analysis of Apache Hadoop and Apache Spark for large scale data sets using HiBench”, Journal of Big Data, Springer International Publishing, Vol. 7 No. 1, doi: 10.1186/s40537-020-00388-5.
  3. Alyahyan, E. and Düştegör, D. (2020), “Predicting academic success in higher education: literature review and best practices”, International Journal of Educational Technology in Higher Education, Vol. 17 No. 1, p. 3, doi: 10.1186/s41239-020-0177-7.
  4. Ardchir, S., Ouassit, Y., Ounacer, S., Jihal, H., EL Goumari, M.Y. and Azouazi, M. (2020), “Improving Prediction of MOOCs Student Dropout Using a Feature Engineering Approach”, pp. 146–156, doi: 10.1007/978-3-030-36653-7_15.
  5. Asha, P., Vandana, E., Bhavana, E. and Shankar, K.R. (2020), “Predicting University Dropout through Data Analysis”, 2020 4th International Conference on Trends in Electronics and Informatics (ICOEI)(48184), IEEE, pp. 852–856, doi: 10.1109/ICOEI48184.2020.9142882.
  6. Aziz, K., Zaidouni, D. and Bellafkih, M. (2019), “Leveraging resource management for efficient performance of Apache Spark”, Journal of Big Data, Vol. 6 No. 1, p. 78, doi: 10.1186/s40537-019-0240-1.
  7. Del Bonifro, F., Gabbrielli, M., Lisanti, G. and Zingaro, S.P. (2020), “Student Dropout Prediction”, pp. 129–140, doi: 10.1007/978-3-030-52237-7_11.
  8. Cawi, E., La Rosa, P.S. and Nehorai, A. (2019), “Designing machine learning workflows with an application to topological data analysis”, PLOS ONE, Vol. 14 No. 12, p. e0225577, doi: 10.1371/journal.pone.0225577.
  9. Gil, P.D., da Cruz Martins, S., Moro, S. and Costa, J.M. (2021), “A data-driven approach to predict first-year students’ academic success in higher education institutions”, Education and Information Technologies, Vol. 26 No. 2, pp. 2165–2190, doi: 10.1007/s10639-020-10346-6.
  10. Giovanelli, J., Bilalli, B. and Abelló, A. (2022), “Data pre-processing pipeline generation for AutoETL”, Information Systems, Vol. 108, p. 101957, doi: 10.1016/j.is.2021.101957.
  11. Gueddoudj, E.Y. and Chikh, A. (2023), “Towards a Scalable and Efficient ETL”, International Journal of Computing and Digital Systems, Vol. 14 No. 1, pp. 10223–10231, doi: 10.12785/ijcds/140195.
  12. Gulino, A., Canakoglu, A., Ceri, S. and Ardagna, D. (2020), “Performance Prediction for Data-driven Workflows on Apache Spark”, 2020 28th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS), IEEE, pp. 1–8, doi: 10.1109/MASCOTS50786.2020.9285944.
  13. Jacob, D. and Henriques, R. (2023), “Educational Data Mining to Predict Bachelors Students’ Success”, Emerging Science Journal, Vol. 7, pp. 159–171, doi: 10.28991/ESJ-2023-SIED2-013.
  14. Krüger, J.G.C., Britto, A. de S. and Barddal, J.P. (2023), “An explainable machine learning approach for student dropout prediction”, Expert Systems with Applications, Vol. 233, p. 120933, doi: 10.1016/j.eswa.2023.120933.
  15. Lee, S. and Park, S. (2021), “Performance Analysis of Big Data ETL Process over CPU-GPU Heterogeneous Architectures”, 2021 IEEE 37th International Conference on Data Engineering Workshops (ICDEW), IEEE, pp. 42–47, doi: 10.1109/ICDEW53142.2021.00015.
  16. Mhon, G.G.W. and Kham, N.S.M. (2020), “ETL Preprocessing with Multiple Data Sources for Academic Data Analysis”, 2020 IEEE Conference on Computer Applications(ICCA), IEEE, pp. 1–5, doi: 10.1109/ICCA49400.2020.9022824.
  17. Mitchell, R., Pottier, L., Jacobs, S., Silva, R.F. da, Rynge, M., Vahi, K. and Deelman, E. (2019), “Exploration of Workflow Management Systems Emerging Features from Users Perspectives”, 2019 IEEE International Conference on Big Data (Big Data), IEEE, pp. 4537–4544, doi: 10.1109/BigData47090.2019.9005494.
  18. Mohit Nara, Aquila Shaikh and Rashmita Pradhan. (2023), “Managing Data Pipeline with Apache Airflow”, International Journal of Advanced Research in Science, Communication and Technology, pp. 244–250, doi: 10.48175/IJARSCT-12134.
  19. Oliveira, M.M. de, Barwaldt, R., Pias, M.R. and Espindola, D.B. (2019), “Understanding the Student Dropout in Distance Learning”, 2019 IEEE Frontiers in Education Conference (FIE), IEEE, pp. 1–7, doi: 10.1109/FIE43999.2019.9028433.
  20. Palacios, C.A., Reyes-Suárez, J.A., Bearzotti, L.A., Leiva, V. and Marchant, C. (2021), “Knowledge Discovery for Higher Education Student Retention Based on Data Mining: Machine Learning Algorithms and Case Study in Chile”, Entropy, Vol. 23 No. 4, p. 485, doi: 10.3390/e23040485.
  21. Pogiatzis, A. and Samakovitis, G. (2020), “An Event-Driven Serverless ETL Pipeline on AWS”, Applied Sciences, Vol. 11 No. 1, p. 191, doi: 10.3390/app11010191.
  22. Ramanan, B., Drabeck, L., Woo, T., Cauble, T. and Rana, A. (2020), “~PB&J~ - Easy Automation of Data Science/Machine Learning Workflows”, 2020 IEEE International Conference on Big Data (Big Data), IEEE, pp. 361–371, doi: 10.1109/BigData50022.2020.9378128.
  23. Realinho Valentim, V.M.M.M.J. and Baptista, L. (2021), “Predict Students’ Dropout and Academic Success”, doi: https://doi.org/10.24432/C5MC89.
  24. Singh, V.K., Karnam, S.E. and Hanji, B.R. (2021), “Orchestration of ML-Based Recommendation Systems”, Journal of University of Shanghai for Science and Technology, Vol. 23 No. 08, pp. 173–180, doi: 10.51201/JUSST/21/08340.
  25. Stan, C.-S., Pandelica, A.-E., Zamfir, V.-A., Stan, R.-G. and Negru, C. (2019), “Apache Spark and Apache Ignite Performance Analysis”, 2019 22nd International Conference on Control Systems and Computer Science (CSCS), IEEE, pp. 726–733, doi: 10.1109/CSCS.2019.00129.
  26. Zdravevski, E., Lameski, P., Apanowicz, C. and Ślȩzak, D. (2020), “From Big Data to business analytics: The case study of churn prediction”, Applied Soft Computing, Vol. 90, p. 106164, doi: 10.1016/j.asoc.2020.106164.