DerOpenSourceKIKompass

Open-source models such as Bloom, Vicuna, and Stable Diffusion, among many others, provide foundational models that can be fine-tuned to specific tasks. Research into highly optimized training routines (such as LoRA and BitFit) has found that they can be fine-tuned using commodity hardware, leading to a burgeoning ecosystem of models approaching the performance of ChatGPT (though many technical challenges remain) (Culotta & Mattei 2024,S. 2).

Bloom and Stable Diffusion are released under Responsible AI Licenses, which might winter 2024 sloanreview.mit.edu 3 legally prevent their use in certain criminal justice and health applications. One must also consider the types of data the model was trained on. While including copyrighted material in data sets for training AI models might be considered fair use in some scenarios in the U.S., case law is far from settled. Having a thorough accounting of the data fed into each model will help organizations better navigate these issues. Emerging efforts like the Data Nutrition Project are adding more structure and reporting requirements to data sets to help users better understand their contents and risks. (Culotta & Mattei 2024, S. 2-3).

Stanford Law School�s AI Data Stewardship Framework specifically addresses generative AI techniques. The Association for Computing Machinery, the world�s largest computing professional organization, has also recently released a set of guidelines around the design and deployment of generative AI systems, including LLMs. (Culotta & Mattei 2024, S. 3).

Culotta, A., & Mattei, N. (2024). Use open source for safer generative AI experiments. MIT Sloan Management Review Winter 2024, S. 2-3.

Source for open source LLM: (various different models) https://huggingface.co/doc s/transformer s/index Transformer, details vary depending on the model Free and Open Source Code completion and code generation, depending on the model (Ebert & Louridas 2023, S. 4).

Ebert, C. and P.Louridas: �Generative AI for Software Practitioners�. IEEE Software, Vol. 40, No. 4, pp. 30-38, Jul/Aug. 2023. https://doi.org/10.1109/MS.2023.3265877

 Checklist for AI in Medical Imaging, or CLAIM,

guidelines (pubs.rsna.org/page/ai/claim) (Vannier & Wang 2023, p. 1).

 AI technology, the generative adversarial network (GAN), to everyday scans on

a widely available commercial proprietary platform. The GAN network itself is thoroughly documented with accompanying software in a public repository, GitHub. Integrated with widely available clinical scanners, this code can be readily used in routine scanning. (Vannier & Wang 2023, p. 1).

The inline implementation of the GAN for cardiac MRI cine achieves one of the main objectives for the open-source MRI movement (https://www.opensourceimaging.org), guided by the introduction of Gadgetron (a framework for medical image reconstruction) (2) and JEMRIS simulation framework (3). (Vannier & Wang 2023, p. 1).

Although there are numerous software packages in GitHub, the quality and completeness are quite variable. This cardiac MRI system (REGAIN) is exceptionally well done. Both the code and reconstructed images are available, enabling comparison of generalized autocalibrating partial parallel acquisition, compressed sensing with zero padding (a technique used to expand a small k-space matrix to create images of normal size), and GAN cine sequences side by side. The improvement in image quality is impressive. The appendixes to this article help cast the REGAIN implementation into the Siemens Framework for Image Reconstruction (FIRE) prototype as introduced in Guo et al in 2022 (4). (Vannier & Wang 2023, p. 1).

 In June 2021, OpenAI researchers

presented a paper titled �Diffusion Models Beat GANs on Image Synthesis� (9). That paper has been cited more than 1000 times, and many groups published independent results that consistently showed the competitive performance of diffusion models. These diffusion models and equivalently score-matching or consistency models effectively address well-known issues of the GANs, such as vanishing gradients, mode collapse, and training divergence (Vannier & Wang 2023, p. 2).

The REGAIN system is one of several cardiac MRI innovations introduced since Siemens FIRE became available. The FIRE technology is important because it enabled the inline GAN implementation, meaning that the MRI cine resolution enhancement procedure is available in a clinical scanner. This makes this technology easy to apply and test as part of the clinical routine ((Vannier & Wang 2023, p. 2).

Vannier, M. W., & Wang, G. (2023). Open-Source Inline Generative AI for Fast Cardiac MRI. Radiology, 307(5), e230957.

The simplest way to define algorithm literacy is by using a set of instructions executed in a computer language in a particular order to achieve a programming goal. Algorithm literacy involves the ability to construct a logical proposition, a true or false sentence, with conditions, recursion, looping, and all types of data that a computer can process. Understanding this will allow the creator of an algorithm to be proficient in any computer programming language because they are linked to how objects, flows, processes, and all the relations and events between them are represented. (Semeler et al. 2024, p. 5).

Selenium is a portable framework for testing web applications. It provides a replay tool for creating functional tests without learning a test scripting language; for example, a test a surfed-in web page for data collection and for manually copying website information. The main Python packages employed in this study, which are generally used in data science projects, are as follows: Beautiful Soup (BS), a class used to extract data from hypertext markup language (HTML) and XML files; comma-separated values, a package that implements classes to read and write tabular data in the CSV format, which is the most common import and export format for spreadsheets and databases; [44] a toolkit for compiling XML that uses C language classes, such as libxml2 and labels, and combines the functionality and completeness of these classes with the clarity of a native Python API, mostly XML standards compliant; Selenium, used for automating web applications for testing. These Python programming language packages are considered the ideal tools for elaborating data services that include data collection, analysis, and visualisation, which are fundamental to developing a project based on the data science process in libraries. The Python-proficient tool plugin is the OpenAI GPT, which is used in data science projects. It is a machinelearning model that can generate human-like code and is applied to help create, improve, explain, and review code as well as create unit tests. Examples include Machinet AI, a software agent that uses Chat with GPT-4; aiXcoder, an intelligent programming tool to recommend code snippets to improve coding efficiency and code quality; GPT-Mentor, an expert AI programmer agent that uses unit tests to verify the behaviour of a code; and Bito AI, used to write, comment, check security, and explain the syntax of a code. These plugins require an API in the Open AI code [8, 9, 10, 11].(Semeler et al. 2024, p. 6).

Semeler, A., Pinto, A. L., Koltay, T., Rodrigues Dias, T. M., Oliveira, A. L., Moreiro Gonz�lez, J. A., & Frota Rozados, H. B. (2024). ALGORITHMIC LITERACY: Generative Artificial Intelligence Technologies for Data Librarians. EAI Endorsed Transactions on Scalable Information Systems, 11(2).

In this paper, we present the current progress of the project Verif.ai, an open-source scientific generative questionanswering system with referenced and verified answers. The components of the system are (1) an information retrieval system combining semantic and lexical search techniques over scientific papers (PubMed), (2) a fine-tuned generative model (Mistral 7B) taking top answers and generating answers with references to the papers from which the claim was derived, and (3) a verification engine that cross-checks the generated claim and the abstract or paper from which the claim was derived, verifying whether there may have been any hallucinations in generating the claim. We are reinforcing the generative model by providing the abstract in context, but in addition, an independent set of methods and models are verifying the answer and checking for hallucinations. (Ko�prdić 2024, p. 1).

Information retrieval mechanism: The major component that has been implemented so far in our toolbox is the information retrieval engine. Our information retrieval engine is based on OpenSearch2 , an open-source engine that was forked from Elasticsearch and is under the Apache 2 license. We have indexed PubMed articles using lexical indexing provided by OpenSearch. Additionally, we have created an index storing embeddings of documents using the MSMARCO model for semantic search. This model was selected because it can handle asymmetric searches (e.g., different lengths of queries compared to the searched texts) [12]. Embeddings were stored in the OpenSearch field, allowing for the combination of lexical and semantic search. This approach emphasizes direct matches while also finding semantically similar phrases and parts of the text where the text does not match. The user question is first transformed into a query, and the most relevant documents are retrieved before being passed to the LLM that generates the answer. (Ko�prdić 2024, p. 2).

Mistral 7B parameter model with instruction fine-tuning3 . This model was further fine-tuned using questions from PubMedQA dataset [13] and generated answers using GPT3.5 with the most relevant documents from PubMed passed as context. The following prompt was used to generate answers: (Ko�prdić 2024, p. 2).

Please carefully read the question and use the provided research papers to support your answers. When making a statement, indicate the corresponding abstract number in square brackets (e.g., [1][2]). Note that some abstracts may appear to be strictly related to the instructions, while others may not be relevant at all. We have selected 10,000 random PubMedQA questions to generate this dataset. The dataset was then used to fine-tune the Mistral 7B model using the QLoRA methodology [14]. The training was performed using a rescaled loss, a rank of 64, an alpha of 16, and a LoRA dropout of 0.1, resulting in 27,262,976 trainable parameters. The input to the training contained the question, retrieved documents (as many as can fit into the context), and the answer. We made this preliminary generated QLoRA adapter available on Hugging Face ((Ko�prdić 2024, p. 2).

Verifying the answers The aim of the verification engine is to parse sentences and references from the answer generation engine and verify that there are no hallucinations in the answer. Our assumption is that each statement is supported by one or more references. For verification, we compare the XLM-RoBERTa-large model5 and DeBERTa model6 , treating it as a natural language inference problem. The selected model has a significantly different architecture than the generation model and is fine-tuned using the SciFact dataset [8]. The dataset is additionally cleaned (e.g., claims were deduplicated, and instances with multiple citations in no-evidence examples were split into multiple samples, one for each reference). The input to the model contains the CLS token, the statement, a separator token, and the joined referenced article title and abstract, followed by another separation token. The output of the model falls into one of three classes: � Supports - in case statement is supported by the content of the article � Contradicts - in case the statement contradicts the article � No Evidence - in case there is no evidence in the article for the given claim The fine-tuned model serves as the primary method for flagging contradictions or unsupported claims. However, additional methods for establishing user trust in the system will be implemented, including presenting to the user the sentences from the abstracts that are most similar to the claim. (Ko�prdić 2024, p. 3).

User feedback integration The envisioned user interface would present the answer to the user�s query, referencing documents containing the answer and flagging sentences that contain potential hallucinations. However, users are asked to critically evaluate answers, and they can provide feedback either by changing a class of the natural language inference model or even by modifying generated answers. These modifications are recorded and used in future model fine-tuning, thereby improving the system. (Ko�prdić 2024, p. 3).

The fine-tuning of the Mistral 7B model improved the model�s performance, making the generated answers comparable to those of much larger GPT-3.5 and GPT-4 models for the referenced question-answering task. (Ko�prdić 2024, p. 3).

DeBERTa-large model showed superior performance compared to the RoBERTa-large. (Ko�prdić 2024, p. 3).

Code created so far in this project is available on GitHub7 under AGPLv3 license. Our fine-tuned qLoRA adapter model for referenced question answering based on Mistral 7B 8 is available on HuggingFace9 . The verification models are available on HuggingFace10 11. More information on the project can be found on the project website: https://verifai-project.com. (Ko�prdić 2024, p. 4).

Ko�prdić, M., Ljajić, A., Ba�aragin, B., Medvecki, D., & Milo�ević, N. (2024). Verif. ai: Towards an open-source scientific generative question-answering system with referenced and verifiable answers. arXiv preprint arXiv:2402.18589. With autonomous system issues are not just stop; it also takes a proactive stance in risk management. By relying past data, keeping an eye on current happenings, and staying abreast of industry shifts, AI can identify potential vulnerabilities and suggest actions to strengthen the supply chain. Strategies might include broadening the supplier pool, ramping up cybersecurity, or improving emergency plans. This forward-thinking approach arms businesses against emerging dangers and lessens the fallout from disruptions. (Alevizos et al. 2024, p. 2).

performance. AI�s real-time monitoring and alerting capabilities are particularly crucial in thwarting cybercriminals and other adversaries. It tirelessly watches over network traffic, parses sensor data from facilities, and tracks goods movement, swiftly alerting teams to potential issues. This ability to respond rapidly is key in neutralizing threats swiftly and reducing their impact. Integrating AI into SCS marks a transformative shift in how organizations protect their vital assets. Adopting AI�s tools transforms supply chains into resilient and agile operations, well-equipped to adapt to changing threats and disturbances efficiently. As AI�s role in SCS continues to expand, it will become an essential element for businesses aiming for top-tier security and operational effectiveness in their supply networks. (Alevizos et al. 2024, p. 2).

This work aims to conduct examination of the capabili- ties and limitations of freely available LLMs in identifying flaws within the software supply chain. The primary research questions dive into the effectiveness of LLMs in detecting software supply chain vulnerabilities, evaluating the feasibility of replacing traditional static and dynamic scanners, which operate on predefined rules and patterns. Furthermore, it investigates whether this pipeline can accurately handle complex patterns related to software security issues. Through these questions, we seek to contribute analysis into the role of LLMs in enhancing software security. It is hypothesized that employing crowd sourcing generative AI agents, into software SCS can potentially supplant traditional static and dynamic security scanners that rely on predefined rules. However, the imminent conundrum lies in the significant limitations these LLMs face, particularly in memory complexity and the management of new and unfamiliar data patterns. Divvying up tasks between LLMs and conventional methods may be necessary. (Alevizos et al. 2024, p. 3).

Next, the methodology outlines the experimental framework employed to assess the effectiveness of LLMs in detecting software vulnerabilities, detailing the use of the TruthfulQA benchmark for evaluating model performance. (Alevizos et al. 2024, p. 3).

With the emergence of integration Large Language Models (LLMs) to fix bug [8] present an innovative, yet challenging approach to enhancing tool resilience. (Alevizos et al. 2024, p. 3).

Despite the experience in programming languages, security bug could happen inadvertently. Addressing this challenge, recent advancements have explored the use of LLMs models like OpenAI�s Codex and AI21�s Jurassic J-1 for zero-shot vulnerability repair [11]. These models, through pre-trained code completion, aim to automate in repairing vulnerabilities without prior explicit examples. However, while promising, this approach faces challenges in coaxing LLMs to generate functionally correct code due to the complexity of natural language and the nuances of coding syntax [11]. Furthermore, the application of LLMs extends beyond fixing bugs to revolutionizing cyber threat detection and response. SecurityLLM, a pre-trained model, showcases the utility of LLMs in detecting cybersecurity threats with unprecedented accuracy [12]. By combining aspects like SecurityBERT for threat detection and FalconLLM for incident response, this model has demonstrated the capability to identify a wide range of attacks with high precision [12]. Another innovative application is TitanFuzz [13], which leverages LLMs to enhance the fuzzing of deep-learning libraries. By generating and mutating input programs, TitanFuzz significantly improves code coverage and the detection of bugs. (Alevizos et al. 2024, p. 3).

In addition to these practical applications, there is ongoing research to better understand and optimize LLMs for software engineering tasks [14]. Another demonstration of LLMs to detect vulnerabilities in web applications is KARTAL, addressing the gap in identifying complex, context-dependent security risks [15]. VulD-Transformer [16] and VulDetect [17] models showcase advancements in utilizing deep learning and LLMs for detecting software vulnerabilities. Addressing the challenges in managing complex cloud-native infrastructures, a novel approach using LLMs to analyze declarative deployment code and provide quality assurance recommendations has been (Alevizos et al. 2024, p. 3). proposed [18]. (Alevizos et al. 2024, p. 4).

ssues, including vulnerabilities or outdated code, we apply the TruthfulQA benchmark (as outlined in Figure 2). This method [25] checks how truthful and accurate the models are by having them answer questions about a dataset filled with vulnerable or buggy code. The process starts with the model generating answers, which are then measured for accuracy against correct answers and reviewed by humans for truthfulness, particularly for complex or nuanced questions.(Alevizos et al. 2024, p. 4).

One significant constraint pertains to the context length, as LLMs are bound by predefined context length parameters. Ensuring that inputs remain compatible with these models necessitates the exclusion of extensive files that surpass these thresholds. (Alevizos et al. 2024, p. 4).

performance and architectural strengths of the models tested. From the outset, OpenLLaMA [27]�[29] indissolubly stands out with the highest overall performance, particularly excelling in C and Objective-C. One performance factor attributed by the strategy of training data mixture, along with transformer architecture, which allows it to handle tasks from nose to tail with remarkable coherence. In contrast, Gemma [30], [31], with its strong generalist capabilities and optimized architecture, perpetuates its performance across most languages. With highest score of detection, two technologies stood out. C++ and Python, with an imperceptible demeanor of ease. However, Mistral-7B [32], despite its balanced performance, remains somewhat inconspicuous in specific languages like Ruby and Perl. This is likely due to its smaller parameter size and limited knowledge retention, which scotches its ability to compete with larger models in retaining extensive language-specific information. Meanwhile, GPT-2 [33] maintains a demeanor of consistency across various languages, particularly in Python and C++. Yet, its performance in JavaScript and Ruby reveals indescribably subtle limitations in handling more modern or dynamic languages. Finally, Phi2 [34] demonstrated lower scores across most languages, a reflection of its smaller model size and limited factual knowledge storage, thereby scotching its chances to excel in tasks that require extensive and precise language-specific knowledge. The architectural elements, such as attention mechanisms and context length, are indissolubly linked to the models� ability to detect vulnerabilities and deprecated code. Larger models, by their very conception, generally outperform smaller ones due to their capacity to store and process vast amounts of information, making the differences in their capabilities acutely noticeable, though sometimes imperceptible at first glance. (Alevizos et al. 2024, p. 5).

Alevizos, V., Papakostas, G. A., Simasiku, A., Malliarou, D., Messinis, A., Edralin, S., ... & Yue, Z. (2024, October). Integrating artificial open generative artificial intelligence into software supply chain security. In 2024 5th International Conference on Data Analytics for Business and Industry (ICDABI) (pp. 200-206). IEEE.

Generative AI has the potential to significantly impact the public sector by offering capabilities such as content generation, data analysis, and personalized services, all of which can drive openness and accessibility in the public sector. By leveraging generative AI, public organizations can enhance efficiency, automate tasks, and improve service delivery, potentially leading to cost savings and better resource allocation. (Persson & Zhang 2025, p. 1834).

Addressing these challenges hinges on the public sector exerting control over model selection and data curation processes. To maintain autonomy in modifying models and datasets as ((Persson & Zhang 2025, p. 1834) needed, it is essential to avoid vendor lock-in. Given the current lack of commitment from AI practitioners in the industry towards transparency and explainability in AI solutions, as highlighted by Denford et al. (2024), the public sector cannot rely solely on the market. ((Persson & Zhang 2025, p. 1835)

whether it was feasible to develop an open digital platform for AI. The platform should be scalable from a performance perspective (that you can add more use cases and users without affecting response times), from a perspective of the amount of information (that you can add the amount of information that is needed), and from a security perspective (that you can configure the solution to be able to handle more sensitive data). It should provide a solution demonstrating that a more open and scalable digital platform enables increased collaboration between municipalities in the field of AI and lowers the thresholds for adopting new technology. The idea was that open and scalable digital platforms that support open models make it possible to share model training and the knowledge that is built up in different municipalities.(Persson & Zhang 2025, p. 1838)

Guided by these discussions, the solution architecture (the artifact, illustrated in Figure 1) was designed including the following components: user interfaces and processes that integrate via APIs exposed in a central API gateway, the AI engine/application managing and configuring the AI platform (e.g., configuration and function for indexing/joining data sources with information, permission management, logging, configuration of language models, etc.), the data sources that the project team chooses to us and the language models (each application built on the platform can use any language model of choice). (Persson & Zhang 2025, p. 1839)

Second, the municipality needs a platform that can keep pace with technological advancements, such as adopting newly developed language models. Otherwise, it will be difficult to fully harness the benefits of AI in the long term. Third, every application of AI must comply with regulatory requirements and offer transparency right down to how the data was processed during a decision to exactly which models were used at the time of the decision. Fourth, the municipality needs secure solutions that allow it to implement certain use cases that do not have personal data via certain language models and other use cases with sensitive data through other language models. Language models can be run in a municipalities own data center, so the municipality has control over data in the AI platform. Furthermore, the municipality must be able to share what it is developing with other municipalities, and the municipality also wants to be able to take part in what other municipalities are developing. This sharing applies both in the form of source code, but also around concrete AI applications. (Persson & Zhang 2025, p. 1839)

DP1. Transparency: For the development team (implementer) to achieve open models and transparent data processing (aim) for the public sector (user) in data processing and response generation (context), the solution shall support open models, provide information about data processing and response generation procedures (mechanisms). (Persson & Zhang 2025, p. 1840)

DP2. Model Independence: For the development team (implementer) to achieve flexible model management and query control (aim) for the public sector (user) in data analysis and decision-making (context), the solution shall be model-independent through modular design and APIs (mechanisms). (Persson & Zhang 2025, p. 1840)

DP3. Open Source: For the development team (implementer) to achieve an open, scalable digital platform (aim) for the public sector (user) in data management and decision-making (context), the solution shall be delivered as open-source software (mechanisms). (Persson & Zhang 2025, p. 1840)

DP4. Data Ownership: For the development team and data managers (implementer) to achieve (data and knowledge ownership (aim) for public organizations (user), regarding data and knowledge management (context), the solution shall ensure ownership of data built up by the public organization and guarantee ownership of knowledge generated with the organization�s training (mechanisms). (Persson & Zhang 2025, p. 1840)

DP5. Interoperability: For the development team (implementer) to achieve interoperability (aim) for the public organization (user) in software integration and automation (context), the solution shall ensure all functionality in the solution is executable via APIs, adhering to OpenAPI specifications (mechanisms). (Persson & Zhang 2025, p. 1840)

Persson, P., & Zhang, Y. (2025). Openness And Transparency by Design: Crafting an Open Generative AI Platform for the Public Sector. Proceedings of the 58th Hawaii International Conference on System Sciences | 2025, pp. 1834-1843.

However, a crucial aspect often overlooked in this digital renaissance is the need for AI accessibility in remote or resource-constrained environments � a realm where tiny, offline machine learning (ML) models may be needed. The need for tiny AI models in remote areas exists for two main reasons. Firstly, these regions frequently face challenges such as limited internet connectivity and inadequate computational resources. In such scenarios, clouddependent large AI models are impractical. Tiny models tailored for offline use can overcome these barriers, bringing the power of advanced design tools to isolated communities. This democratization not only fuels local innovation but also ensures that cutting-edge AI-assisted design solutions are not the exclusive domain of well-resourced urban centers.(Vuruma et al. 2024, p. 1).

Model reduction: One of the ways to achieve this is to make these models smaller for inference using methods like: � Model Pruning: The process of removing non-critical and redundant components of a model without a significant loss in performance. With respect to LLMs, this can mean removing weights with smaller gradients or magnitudes and parameter reduction among others. Novel pruning methods like Wanda (Sun et al. 2023) and LLMPruner (Ma, Fang, and Wang 2023) present optimal solutions for making LLMs smaller. � Quantization: Representing model parameters such as weights in a lower precision, i.e. using fewer bits to store the value (Gholami et al. 2021). This results in a smaller model size, faster inference, and a reduced memory footprint. LLM Quantization can be achieved either in the post-training phase (Dettmers et al. 2022) or during the pre-training or fine-tuning phase (Liu et al. 2023). � Knowledge Distillation: Transferring the knowledge of a large teacher model to a smaller learner model to replicate the original model�s output distribution difference. Knowledge Distillation has been widely used to reduce LLMs like BERT into smaller distilled versions DistilBERT (Sanh et al. 2020). More recently, approaches like MiniLLM (Gu et al. 2023) and (Hsieh et al. 2023) further optimize the distillation process to improve the student model�s performance and inference speed. Vuruma et al. 2024, p. 4).

Vuruma, S. K. R., Margetts, A., Su, J., Ahmed, F., & Srivastava, B. (2024). From cloud to edge: Rethinking generative ai for low-resource design challenges. arXiv preprint arXiv:2402.12702.

). The rapid adoption of GenAI in education, with many students using it for study purposes, raises concerns about copyright, model training practices, and ownership of generated content. High costs of training and maintaining GenAI models further disadvantage Pacific SIDS, which lack the resources to develop their own systems. Dependency on commercial GenAI services introduces risks, including vendor lock-in and unaffordable costs, while open models bring their own complexities, such as defining openness and the complexity of the AI technology stack. (MackIntosh 2025, p. 1).

Open Educational Resources (OER) are likely approaching an existential crisis because rapid advances in generative Large Language Models under the banner of �artificial intelligence� (AI) could erode their competitive advantage as a public good when compared to commercial content. This looming threat is the elephant in the room for sustainable future OER development, particularly for Pacific Small Island Developing States (SIDS). (MackIntosh 2025, p. 2).

The legal copyright implications of using GenAI remain uncertain. Although several cases have been filed, these proceedings are still in the early stages, and it will be some time before we know how the courts will apply intellectual property laws to this technology. Consider, for example, that an all rights reserved text does not prevent individuals from learning from it. In the context of training GenAI models, once training is complete, the original training data � including any copyrighted material � is no longer required. The model retains the information as 'weights,' which are internal parameters used to generate outputs from user prompts (Poulos, 2024). Consequently there is no �copy� of the copyrighted work within the model itself (Poulos, 2024) (MackIntosh 2025, p. 4).

In a scenario planning context dealing with uncertain futures, it is conceivable � given the challenges large AI providers face in building sustainable business models � that significant cost increases for corporate AI services, along with business strategies that generate dependence and vendor lock in through high switching costs, present a plausible risk. AI�s evolving business model could be mirroring the business development approach of social media companies, a process which Doctorow labels as �enshittification� to describe a three stage process of how platforms decay: First, platforms are good to their users � for instance, providing free access to a valuable service; then they abuse their users to make things better for their business customers � for example, building in high switching costs making it difficult to leave a product or service to retain advertising revenue; and finally, maximising profit extraction over user value, for example, algorithmic bias for paid content or reducing features available in �free� tiers of proprietary software services (Doctorow, 2024a & 2024b) (MackIntosh 2025, p. 6).

A detailed explanation of the GenAI technology stack falls outside the scope of this paper, but for the purposes of illustrating the breadth and complexity of the stack Mozilla and the Columbia Institute of Global Politics convened 40 learning scholars and practitioners working on openness and AI and produced a framework categorizing components of the stack (Mozilla 2024): � Product and user interface: Output settings, application programming interface (API), prompting engines, user/model interaction, and telemetry4 . � Datasets: software and licensing issues relating to the data and code used for pre-training including fine-tuning the datasets and evaluating the datasets. � Code: used for data (pre)processing, inference, training, evaluation, fine-tuning, and backpropagation, supporting libraries and model architecture. � Model weights: including base weights, intermediate training checkpoints weights, downstream task adaptation weights, and compressed weights. � Infrastructure: including drivers, compilers, libraries, and storage used in training and hosting of GenAI systems. There are varying degrees of openness possible for each of these components in the stack. For instance, LLaMA-2 an �open-weight� model released by Meta, is available for free download and was labeled as �open source� by the company. It uses a custom license developed by the company which the Open Source Institute � custodian of the Open Source Software Definition � stated was not open source (Widder et al., 2023). One of the concerns is the lack of transparency in the data used to train the model, so it would not be possible for suitably qualified technologists to recreate the model. In another example, on the compute side, large foundational models require massive computational power for training. This ultra fast processing hardware, like Nvidia�s GPU5 s, or Google�s �TPUs� 6 . These super computers are reliant on code used for training of GenAI foundational models that is closely tied with specific proprietary hardware severely restricting mobility in creating �open� models (Widder et al., 2023). The Open Society Initiative (2024) has released Version 1 of the Open Source AI Definition, which illustrates a binary approach to determining what qualifies as 'open source' and what does not. In contrast, �gradient� approaches have been put forward to classify levels of access to GenAI systems: �fully closed; gradual or staged access; hosted access; cloud-based or API access; downloadable access; and fully open� (Solaiman 2023). The Digital Public Goods Alliance is working on updating its Digital Public Good Standard to better account for open source AI systems to adhere to the principles of openness, inclusivity and responsibility (Taneja et al., 2024). Achieving a stable understanding of 'open' in GenAI systems will take time. For now, however, Pacific SIDS should remain aware of the risks associated with proprietary corporate interests and dependencies in technologies misleadingly marketed as �open�. (MackIntosh 2025, p. 8).

Although the best open-weight models, based on benchmark performance, have lagged the best closed Large Language Models (LLMs) by 5 to 22 months, the most notable AI models released between 2019 and 2023 were open, with the model hosting platform HuggingFace hosting over 1 million open models. �In the long term the economic value of open frontier models is a key uncertainty ... [and] if training costs grow to billions of dollars and beyond� it will be hard for open models to compete with Big Tech (Cottier et al., 2024) (MackIntosh 2025, p. 9).

Open-weight SLMs could provide an affordable and practical pathway for indigenous communities to develop their own models. For example, Lelapa AI, a South African research and product laboratory, trained InkubaLM�a 0.4-billion-parameter open-weight model designed for low-resource African languages (Lelapa AI, 2024). InkubaLM supports five local African languages and demonstrates capabilities such as machine translation, sentiment analysis, named entity recognition (NER), parts of speech tagging (POS), question answering, and topic classification. (MackIntosh 2025, p. 13).

Mackintosh, W. (2025). Foresighting Viable Open Alternatives to Address OER�s Existential Threat from Commercial Generative AI: Confronting the Unspoken Challenge for Pacific Small Island Developing States.

 However, its application in highstakes decision-making areas, including contract drafting or medical diagnostics,

is met with cautious adaptation due to concerns about its large-scale deployment. AI models are notorious bullshitters [1]. (Bhambhoria et al. 2026, p. 1).

www.OpenJustice.ai�a platform that encourages collaborative and crowdsourced efforts to design and test custom AI solutions for legal professionals and aid centers. This approach promotes a transparent and inclusive method of AI development, allowing diverse perspectives and expertise to contribute to more ethical and robust AI systems. (Bhambhoria et al. 2026, p. 2).

Recent studies indicate a concerning trend in artificial intelligence: the as-yetunexplained �drifting� phenomenon, characterized by significant fluctuations in AI�s capabilities (Chen, Zaharia, and Zou 2023). For example, in one study, an AI system�s accuracy rate in solving basic math problems dropped from 98% to 2% in the space of months. Certainly in the legal context, evidence has shown that despite AI�s capacity to perform some legal tasks�even passing the bar exam (Katz et al. 2023)�the technology has not yet fully matured. Generative AI is prone to �hallucinating� inaccurate responses with confidence, offering biased advice and erroneous citations (Chen et al. 2023). A further problem is that Large Language Models (LLMs) tend to reflect a mainstream worldview. When they are a primary source of AI training data, feedback loops are created wherein AI-generated texts are reincorporated into the web, creating �AI echo chambers� (Shur-Ofry 2023, 30). (Bhambhoria et al. 2026, p. 3).

In fact LegalBench [19] is a collaborative project designed to benchmark the legal reasoning capabilities of Large Language Models (LLMs) using the IRAC framework. Its goal is to assess how well current AI models can support and augment legal reasoning, particularly in administrative and transactional settings, without aiming to replace legal professionals. In contrast, our project diverges by focusing on applying AI to practical, real-world legal tasks through a domainspecific, open-source platform. We aim to directly evaluate AI�s effectiveness in performing tasks that mirror the day-to-day work of legal professionals, moving beyond theoretical benchmarks to assess practical utility and integration into legal workflows. (Bhambhoria et al. 2026, p. 5).

Specifically, most LegalBench tasks are classification problems, which fail to evaluate various dimensions desirable for a language model in aiding laypeople with legal tasks. To this end, we curate LegalQA, a high-quality dataset of over 2000 questions asked by laypeople on real legal questions and answers vetted by legal experts. We ask law students to write expert answers to these questions (process described in more detail in Section 3.1). The questions are sourced from an online legal community4 ( https://reddit.com/r/legaladvice) (Bhambhoria et al. 2026, p. 6).

. For the open-sourced model, we evaluated Mixtral-8x7B5 , a state-ofthe-art mixture of experts chat-aligned language model with 46.7B parameters (5 https://mistral.ai/news/mixtral-of-experts/). (Bhambhoria et al. 2026, p. 7).

based on the opensource automatic evaluation repository, OpenAI Evals6 (6 https://github.com/openai/evals (Bhambhoria et al. 2026, p. 8). ).

Apparent in Figure 2, the state-of-the-art language model GPT-4 seems to perform relatively well on the LegalQA task, with under 5% of examples containing factually incorrect responses. However, we observe that Mixtral 8x7B, a state-of-the-art open language model, falls significantly behind. Additionally, we hypothesize that a large language model does not evaluate legal texts in the same way as a skilled human lawyer, so we ask law students to evaluate some predictions. The comments as a result of these qualitative observations are shown in Table 3. In general, we observe two phenomena that suggest that, despite the seemingly strong performance of GPT-4, there is still room for improvement: � Lack of Citations. When building a language model; the lack of credible citations given by models like GPT-4 stood out to our annotators�with an open model, it is easier to augment these models with tools, facilitating more trust in the form of citations. � Long Winded Answers. Our annotators found answers written by humans to be more �to the point� and �direct�, whereas GPT-4 often provides details that are not related to the legal question. For the examples evaluated, GPT-4 fails to capture the concise answers that legal experts are trained to write. In the legal domain, these concerns�lack of citations and lack of concision� are especially important due the the high-stakes nature of legal decision-making. Given that the tasks tested�LegalQA and Law Stack Exchange�were relatively simple benchmarks (but still ahead of existing benchmarks), GPT-4�s performance raises flags, indicated by comments in Table 3 on using these models for even more complex tasks, such as document analysis. (Bhambhoria 2026, p. 10)

. Research shows that small-language models, such as Orca 2, can exhibit strong reasoning abilities and outperform larger models by learning from detailed explanation traces (Hughes 2023) (Bhambhoria 2026, p. 12) . Bhambhoria, R., Dahan, S., Li, J., & Zhu, X. (2026). Evaluating ai for law: Bridging the gap with open-source solutions. In Compliance for Artificial Intelligence Systems: Strategies, Principles and Methods (pp. 59-74). Cham: Springer Nature Switzerland.

 Our findings reveal that open models offer greater transparency, auditability, and

flexibility, enabling independent scrutiny and bias mitigation. In contrast, closed systems often provide better technical support and ease of implementation, but at the cost of unequal access, accountability, and ethical oversight. The research also highlights the importance of multi-stakeholder governance, environmental sustainability, and regulatory frameworks in ensuring responsible development. (Machado 2025, p. 1).

Categorizing and comparing the characteristics of proprietary and open systems allows us to clarify opportunities and risks, enabling better technological choices and more targeted investments in research and development. This is directly aligned with the goals of SDG Goals 9.1 9.5, related to Innovation and Equity.(MAchado 202,5 p. 2)

These criteria include:  Transparency: Availability of source code, trained weights, training data, and technical documentation.  Ethics and Safety: Ability to mitigate biases, ensure privacy protection, prevent malicious use, and allow independent auditing.  Accessibility and Equity: Cost of access, technical infrastructure requirements, potential for local customization, and linguistic/cultural inclusivity.  Interoperability and Standardization: Compatibility with open protocols and ease of integration with other systems.  Governance: Presence of collective oversight mechanisms, civil society participation, and decentralization. (Machado 2025, p. 3).

Our understand of the meaning of "open" is according the established by the Open Knowledge Foundation, which is summarized as follows: Knowledge is open if anyone is free to access, use, modify, and share it � subject, at most, to measures that preserve provenance and openness (�Open Definition�, OKF, 2025) (Machado 2025, p. 3).

Initiatives like BLOOM are a good example of how collaboration in an open environment can work very well. BLOOM offers multilingual LLM training in complete transparency, allowing for the largest collaboration of AI researchers. This initiative brings together more than 1000 experts from 70 countries and around 250 institutions. With its 176 billion parameters, BLOOM is capable of generating text in 46 natural languages and 13 programming languages (Big Science, 2025). (Machado 2025, p. 7).

There are other open source solutions, such as Pythia (EleutherAI)5

 and OLMo - Allen Institute for

AI. In addition to BLOOM, these LLM models also offer training and pre-training data. They are completely open source models. Although more modest in scale, these LLMs offer full replicability, therefore transparency, openness and accessibility. (Machado 2025, p. 7).

It can be tracked through CodeCarbon6 software (Lucionni, Viguier und Ligozat 2023: 10). Other study states that the training GPT-3 in Microsoft�s state-of-the-art U.S. data centers can directly consume 700,000 liters of clean freshwater and the water consumption would have been tripled if training were done in Microsoft�s Asian data centers (Li, Yang et al, 2023). The same study outlines the need of increasing transparency of AI models� water footprint, including disclosing more information about operational data (Id, p. 3) (Machado 2025, p. 7).

The issue of the data licensing model is also fundamental. For this there are free and flexible licenses, such as Apache 2.0, MIT or RAIL7 . However, the discussion about licensing is somewhat more complex, as it involves not only the economic business model, but reservations about the mitigation of risks involved (Eiras et al, 2025). (Machado 2025, p. 7).

Despite that, open models emerge as a more inclusive and resilient alternative, anchored in collaboration and decentralization. They not only promote access to knowledge but also promote equity and ethical security through community oversight and local adaptations. Their flexibility enables broader innovations and solutions tailored to the specific needs of different contexts, from academic research to public applications. However, this approach is not without challenges: the need for computational infrastructure and technical expertise can limit adoption by smaller organizations. A comparative analysis highlights significant implications for issues such as governance, sustainability, and accountability. While proprietary systems tend to prioritize commercial interests, open models stand out for promoting transparent and participatory standards. This difference is particularly critical in areas like privacy, bias mitigation, and cybersecurity, where public auditing and collaboration are essential. Thus, dealing with a critical technology that tends to be widely used in the coming years, including for decision-making processes by public and private actors, open models represent not only a technical (Machado 2025, p. 9)

Machado, J. (2025). Toward a Public and Secure Generative AI: A Comparative Analysis of Open and Closed LLMs. arXiv preprint arXiv:2505.10603.

An open-source DeepPavlov Dream Platform is specifically tailored for development of complex dialog systems like Generative AI Assistants. The stack prioritizes efficiency, modularity, scalability, and extensibility with the goal to make it easier to develop complex dialog systems from scratch. It supports modular approach to implementation of conversational agents enabling their development through the choice of NLP components and conversational skills from a rich library organized into the distributions of ready-for-use multi-skill AI assistant systems. In DeepPavlov Dream, multiskill Generative AI Assistant consists of NLP components that extract features from user utterances, conversational skills that generate or retrieve a response, skill and response selectors that facilitate choice of relevant skills and the best response, as well as a conversational orchestrator that enables creation of multi-skill Generative AI Assistants scalable up to industrial grade AI assistants. The platform allows to integrate large language models into dialog pipeline, customize with prompt engineering, handle multiple prompts during the same dialog session and create simple multimodal assistants. (Zharikova et al. 2023, p. 599).

The DeepPavlov Dream Platform1 provides a stack of Apache 2.0-licensed open-source technologies that enable development of complex dialog systems such as enterprise AI assistants. The platform features a conversational AI orchestrator called DeepPavlov Agent to coordinate an asynchronous scalable dialog pipeline; a framework called DeepPavlov Dialog Flow Framework (DFF) to facilitate development of the multi-step skills; support for Wikidata and custom knowledge graphs; a library of modern NLP components and conversational AI skills (script-based, chit-chat, question answering (QA), and generative skills) organized into a set of distributions of multi-skill conversational AI systems; and a visual designer. These components make it possible for developers and researchers to implement complex dialog systems ranging from multi-domain task-oriented or chit-chat chatbots to voice and multimodal AI assistants suitable for academic and enterprise use cases (Zharikova et al. 2023, p. 599).

Unlike other solutions, we provide an open-source Apache 2.0- based, multi-skill platform that enables development of complex open-domain dialog systems. DeepPavlov Dream allows for combining different response generation methods, adding pre- and postfilters, utilizing Wikidata and custom knowledge graphs, designing custom dialog management algorithms, integrating large language models (LLMs) to production-ready dialog systems. DeepPavlov Dream also provides simple integration with loadbalancing tools that is crucial for LLMs-based dialog systems in production. We are also working towards text-based and multimodal experiences like robotics (Zharikova et al. 2023, p. 600).

 Comparison of popular Conversational AI Platforms: DeepPavlov Dream (DeepPavlov, 2023), Mycroft

AI (AI, 2022b), Linto AI (AI, 2022a), RASA (RASA, 2022), (p. 601).

We also provide the documentation site for DeepPavlov Dream7 . The site provides access to comprehensive resources for building intelligent conversational assistants tailored to users� specific needs. The site offers extensive documentation, release notes, and detailed examples to facilitate the development of advanced conversational AI applications. The site also provides opportunities to join the community of developers leveraging DeepPavlov Dream to shape the future of conversational AI technology 7 https://dream.deeppavlov.ai/(Zharikova et al. 2023, p. 604).

Zharikova, D., Kornev, D., Ignatov, F., Talimanchuk, M., Evseev, D., Petukhova, K., ... & Burtsev, M. (2023, July). DeepPavlov dream: platform for building generative AI assistants. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations) (pp. 599-607).

The rapid rise of generative AI (GenAI) technologies has brought innovative video generation models like OpenAI�s Sora to the forefront, but these advancements come with significant sustainability challenges due to their high carbon footprint. This paper presents a carbon-centric case study on video generation, providing the first systematic investigation into the environmental impact of this technology. By analyzing Open-Sora, an open-source text-to-video model inspired by OpenAI Sora, we identify the iterative diffusion denoising process as the primary source of carbon emissions. Our findings reveal that video generation applications are significantly more carbon-demanding than text-based GenAI models and that their carbon footprint is largely dictated by denoising step number, video resolution, and duration. To promote sustainability, we propose integrating carbon-aware credit systems and encouraging offline generation during high carbon intensity periods, offering a foundation for environmentally friendly practices in GenAI. (Li et al. 2024, p. 160)

Our characterization provides operational insights for making video generation services eco-friendly. Notably, video generation applications are significantly more carbon-intensive than text generation, with the primary source of emissions stemming from iterative diffusion denoising. We examine the carbon footprint and generation quality under various configurations of denoise step number, resolution, and duration (Li et al. 2024, p. 160).

Since OpenAI Sora is proprietary software, several open-source projects have attempted to replicate its video generation capabilities. Among them, we selected the Colossal-AI Open-Sora [43] model due to its popularity and its training similarity to Sora�s description. Other projects, such as Open-Sora-Plan [23], are slower and currently lack multi-resolution/duration generation support. We established the inference benchmark using the latest Open-Sora v1.1.0 release on an NVIDIA A100 GPU (CUDA 12.1). To achieve optimal efficiency, we enabled FlashAttention [9] and xFormers [24] for acceleration. For our video generation benchmark, we used all the prompts from the OpenAI Sora demo [32] and Open Sora Examples (Li et al. 2024, p. 161)

To evaluate the carbon footprint of the LLM, we used a mixture of representative language modeling datasets, including Alpaca [37], GSM8K [8], MMLU [14], Natural Questions [22], ScienceQA [28], and TriviaQA [17]. (Li et al. 2024, p. 162)

In our video generation benchmark, following the architecture in Fig. 1, we use Google�s T5 v1.1 xxlarge model, which has approximately 11 billion parameters, to encode the text prompt [35]. Note that this LLM is used solely for text encoding, a different process from text generation as discussed in RQ 2. For denoising, we utilize Open-Sora�s spatial-temporal diffusion transformer (STDiT) model, allowing the diffusion model to iteratively refine its understanding of the input data, gradually reducing noise and improving signal fidelity. The number of denoise steps/iterations (typically tens to hundreds) can be adjusted during inference. The decoder is a variational autoencoder with KL loss [20] from the Huggingface Diffusers library. In this experiment, we continue generating 2-second videos at 240p resolution (Li et al. 2024, p. 163).

To quantify these two properties, we modify VBench [15], a stateof-the-art video quality benchmark collection. We select MUSIQ [19] to evaluate frame distortion in the video as a video quality proxy, and ViCLIP [40], a video extension of the OpenAI CLIP score [34], to measure the correlation between the generated video and the original prompt. A higher score indicates higher generation quality for both metrics. We acknowledge that these metrics can only serve as proxies for certain aspects of the video, as judging video quality is inherently complex and subjective (Li et al. 2024, p. 164).

Li, B., Jiang, Y., & Tiwari, D. (2024). Carbon in motion: Characterizing Open-Sora on the sustainability of generative AI for video generation. ACM SIGENERGY Energy Informatics Review, 4(5), 160-165.

This paper describes the development of an Open-Source Generative AI Chatbot, utilizing free Large Language Models (LLM) to enrich the student learning experience for a university course in �Introduction to Programming�. The article aims to provide a step-by-step guide for selecting, fine-tuning, and evaluating available models. As a first step in choosing the appropriate LLM, which provides the most accurate responses while not requiring excessive computing power, the article will cover a discussion of the advantages and disadvantages of local vs. cloud-available models. After selecting a few promising models, the next stage includes fine-tuning LLMs to answer domain-specific questions using a dataset containing essential rules, guidelines, and explanatory content regarding the subject. The crucial aspect of selecting a model was evaluating answers, and in this context, both human and automatic evaluation techniques will be presented. Finally, it is possible to enhance the model performance and accuracy by incorporating Retrieval-Augmented Generation (RAG) techniques and exploring the influence of various factors, such as different vector databases, model temperatures, maximum token lengths, prompt templates, embeddings, repetition penalties, and chunking sizes. Our results show that chatbots have significant potential to improve academic support and learning efficiency, as well as personalized education in general. (�arčević et al. 2024, p. 2367).

The goal of this paper is to thoroughly describe the process of theoretical exploration and hands-on training of Large Language Models (LLMs) as part of our project. Our motivation is to share the results and methodologies of the project, thereby facilitating the development of similar AIbased solutions (�arčević et al. 2024, p. 2367).

For injecting domain-specific knowledge into LLM, we explored both fine-tuning and knowledge augmentation mechanisms like Retrieval Augmented Generation (RAG) [5][6]. Our findings favor RAG over fine-tuning for its superior performance in handling existing and new knowledge across different problem domains [7][8]. (�arčević et al. 2024, p. 2367).

Multiple cloud-based LLMs were tested, using cloud platforms such as the AI-powered open platform Poe.com [10], alongside locally run LLMs from Hugging Face online repository [11]. (�arčević et al. 2024, p. 2368).

Therefore, we continued our evaluation with a subgroup consisting of the following 5 models: 1. llama-2-7b-chat.Q5_K_M 2. mistral-7b-instruct-v0.1.Q5_K_M 3. mistral-7b-openorca.Q4_0 4. zephyr-7b-alpha.Q5_K_M 5. zephyr-7b-beta.Q5_K_M (�arčević et al. 2024, p. 2368).

In this phase of our work, we emphasized the importance of depth in the evaluation of these models. Apart from human evaluation, we calculated BLEU [12], ROUGE [13] [14], and diversity scores to provide more objective insight into the effectiveness of the model. The metrics compare the answer generated by the LLM with the corresponding correct answer to the question (our list of 3�5 expected answers). It is good practice to perform these metrics to gain an objective statistical basis for making decisions in a project. In our case, the metrics did not result in high scores, but that can be common when the model�s answer does not match the expected one word-for-word. (�arčević et al. 2024, p. 2369).

we employed the three following embeddings: 1) all-MiniLM-L6-v2 which maps sentences and paragraphs to a 384-dimensional dense vector space and MIPRO 2024/SP 2369 offers both speed and quality, 2) Jina embedding which supports longer sequence lengths, and 3) BAAI general embedding that can map any text to a low-dimensional dense vector. (�arčević et al. 2024, p. 2369-2370).

By incorporating knowledge from the external dataset, RAG enhanced the accuracy and credibility of our chatbot, especially for knowledge-intensive tasks. Specifically, we utilized available materials for the course "Introduction to Programming" containing instructions for installing local extensions in the working environment, instructions for command prompt environment, instructions for performing course laboratory exercises, and the overall student guide for the course. This external dataset consisted of 6 documents with 12.828 words, in total. (�arčević et al. 2024, p. 2371).

ased on comprehensive evaluation criteria, including human assessment, the Mistral-7B-Openorca model was found to be significantly superior to the others. (�arčević et al. 2024, p. 2371).

�arčević, A., Tomičić, I., Merlin, A., & Horvat, M. (2024, May). Enhancing programming education with open-source generative AI chatbots. In 2024 47th MIPRO ICT and Electronics Convention (MIPRO) (pp. 2051-2056). IEEE.

Moreover, Stability AI did not prepare the dataset on which the Stable Diffusion model was trained. This was done by a nonprofit German research organization known as LAION (Large-Scale Artificial Intelligence Open Network). LAION initially developed LAION-5B, a dataset consisting of 5.85 billion hyperlinks that pair images and text descriptions from the open internet. LAION makes this dataset available to the public for free for use as training data for those who want to use it to build generative models. LAION also developed a subset of LAION-5B, known as LAION-Aesthetics, that consists of hyperlinks to 600 million images selected by some human testers for their visual appeal and by a machine-learning analysis of human aesthetic ratings. The Stable Diffusion model was trained on the LAION-Aesthetics dataset (Samuelson 2023, p. 159).

Samuelson, P. (2023). Generative AI meets copyright. Science, 381(6654), 158-161.

active chatbots have been gaining popularity as a tool to serve organizational information among people. Building such a tool goes through several development phases i.e. (a) Data collection and preprocessing, (b) LLM fine-tuning, testing, and inference, and (c) Chat interface development. To streamline this development process, in this paper, we present the LLM Question�Answer (QA) builder, a web application, which assembles all the steps and makes it easy for technical and non-technical users to develop the LLM QA chatbot. The system allows the instruction fine-tuning of following LLMs: Zepyhr, Mistral, Llama-3, Phi, Flan-T5, and user provided model for organization-specific information retrieval (IR), which can be further enhanced by Retrieval Augmented Generation (RAG) techniques. We have added an automatic web crawling based RAG data scrapper. Also, our system contains a human evaluation feature and RAG metrics for assessing model quality. (Salim 2025, p. 1).

r sufficient development to train the model with the available data [7]. According to some deep learning studies, neural network models produce results that are good enough to be utilized in a QA system. One of them achieves a good performance by using the sequence-to-sequence approach [8�10]. Recently, some open-source LLMs like Zepyhr [11], Mistral [12], Llama-3 [13], Phi [14] and Flan-T5 [15] had been trained on huge amounts of data and can understand the context correctly. (Salim 2025, p. 2).

d-to-end, streamlined, and user-friendly interface for RAG. To use Verba API keys are needed for various components depending on the chosen technologies. There is no way to collect datasets from users, validate them, and deploy the model for the users. There is another application named autotrain-advanced [18], which supports only fine-tuning of the LLMs. Our software surpasses LocalRQA [19] by offering support for 4-bit and 8-bit quantized models and automatic web crawling based RAG data scrapper, making it efficient for low-resource consumer PCs. Additionally, we included the latest models such as Llama-3, Phi-3, M (Salim 2025, p. 2).

de to connect with the LLM. For RAG, we need a vector database. First, we need to convert the RAG data into a vector database, and then save the vector database in a local folder. To convert data into a vector database, we used the ensemble of��bge-large-en-v1.5�� [21] and ��ColBERT�� [22] text embedding models. For storing and processing vector databases, we used Chroma [23], which is also open-source library. We used Chroma DB, an open-source vector store, for storing the vector database. (Salim 2025, p. 2).

The software consists of six main functionalities: data collection, fine-tuning, testing data generation and RAG customization, human evaluation, inference, and deployment. Each functionality is described in Section 2.2. The LLM QA chatbot builder utilize different opensource LLMs, embedding models, and Python libraries as its basis, as there are different considerations to be taken into account, particularly in terms of the balance between speed and quality. We added features to collect data from users and automatically build RAG. We have added the recent 7/8B parameter models (Mistral, Zepyhr, and Llama-3) and some light-weight models (Phi-3 and Flan-T5) (Salim 2025, p. 2).

. Fine-tuning In the fine-tuning tab, the user fine-tunes the given models using the already stored data in ��data�� folder. Users must provide a HuggingFace token to access the latest LLM models. There is a drop-down box to select the model. Initially, five models are given for fine-tuning: Mistral, Zepyhr, Llama-3, Phi, and Flan-T5, and a custom model is also given so that the user can give their own model. To train Mistral, Zephyr, and Llama models, you need 24 GB of VRAM and, for inference, 16 GB in 8-bit quantization [24]. We have added a quantization technique for saving GPU memory. Quantization technique to reduce computational and memory costs without compromising the the model. For the Phi and Flan-t5 models, training requires 5 GB of VRAM, and inference needs 4 GB. To fine-tune a custom model, select the ��custom model�� option in the ��Select the Model for Finetuning�� dropdown, then configure it by editing the code section. After fine-tuning, the model will be saved in the ��models�� folder. Fig. 4(a) shows the fine-tuning user interface. First, we need to provide an excel file, which was previously created in the ��Data Collection�� tab. Finetuning is mainly used to learn the context of the subject. Users have the option to change the parameter and the code by clicking ��Advance Code Editing��. The embedding model can be fine-tuned in ��Embedding model�� sub-tab. Users also have the option to select embedding models and fine-tune them for specific datasets.(Salim 2025, p. 3).

2.4. Human evaluation The human evaluation tab is used to check the model�s performance by involving human judgement. In recent works, GPT 4 [25,26] and ROUGE-L [27] has been used for evaluating model answers, but in this case, GPT 4 shows bias about answers and sometimes makes mistake without knowing context. (Salim 2025, p. 4).

1 https://github.com/shahidul034/LLM-based-QA-chatbot-builder/blob/ main/LLMQAChatbotBuilder.mp4 (Salim 2025, p. 4).

Salim, M. S., Hossain, S. I., Jalal, T., Bose, D. K., & Basher, M. J. I. (2025). LLM based QA chatbot builder: A generative AI-based chatbot builder for question answering. SoftwareX, 29, 102029.

This paper argues that the development of open-source generative AI is crucial for promoting transparency, accountability, and inclusion in the creation and deployment of these powerful technologies. By making models and datasets accessible to a wide range of stakeholders, open-source generative AI can foster innovation, collaboration, and knowledge-sharing while enabling the identification and mitigation of potential risks and harms. To realize the full potential of open-source generative AI, a concerted effort from the AI community, policymakers, and society at large is necessary. This paper outlines a vision for the future of open-source generative AI and provides specific recommendations for researchers, industry practitioners, and policymakers to promote responsible development practices, establish institutional frameworks, and engage the public in decision-making processes. The AI community has a unique opportunity and responsibility to shape the future of generative AI in a way that promotes the public good, and embracing open-source development and prioritizing responsible AI practices are key steps in this direction. The time to act is now. Collaboration is essential to build a future where generative AI empowers individuals, advances societal well-being, and upholds the values of transparency, accountability, and inclusivity. (Merilehto 2024, p. 1)

Moreover, the concentration of powerful generative AI models in the hands of a few technology giants raises concerns about the centralization of power and the potential for misuse (Brundage et al., 2018). If left unchecked � and even further encouraged by regulation, this trend could lead to a future where a small number of entities control the development and deployment of AI systems that have a profound impact on society. (Merilehto 2024, p. 1)

Another significant risk associated with closed generative AI models is the potential for misuse and the concentration of power in the hands of a few technology giants. As these models become more advanced and capable of generating highly convincing text, images, and other media, they could be used for disinformation, manipulation, and other malicious purposes (Brundage et al., 2018). The lack of transparency in closed models makes it difficult to detect and mitigate these risks. (Merilehto 2024, p. 3)

 Open-Source AI has gained traction in

recent years to promote transparency, accountability, and collaboration in the development of AI systems. One needs to only look at HuggingFaces database of open models, which stands at the time of writing at 550,000 (HuggingFace, 2024) and see that there has been an explosion of open activity around AI in recent years. (Merilehto 2024, p. 4)

Examples of Open-Source AI models include: 1. BERT (Bidirectional Encoder Representations from Transformers): Google open-sourced the code and pre-trained models for BERT, enabling researchers and developers to fine-tune the model for various natural language processing tasks (Devlin et al., 2019). 2. Stable Diffusion: Stable Diffusion is an OpenSource text-to-image model developed by Stability AI and EleutherAI. It allows users to generate highly realistic images from textual descriptions (Rombach et al., 2022). 3. BLOOM (BigScience Large Open-science Openaccess Multilingual Language Model): BLOOM is an Open-Source LLM developed by a collaborative effort called BigScience, involving over 1,000 researchers from around the world. It is designed to be transparent, accountable, and accessible to the broader research community (Lauren�on et al., 2022). 4. Llama 2: Developed by Meta AI, Llama 2 is an open-source large language model that builds upon the success of its predecessor, LLaMA. Collection of models ranging from 7B to 70B parameters, trained on a vast corpus of text data. It showcases impressive performance on a wide range of natural language tasks, such as question answering, text summarization, and dialogue generation. The opensource release of Llama 2 includes the model weights, code, and documentation, enabling researchers and developers to fine-tune and adapt the model for various applications. Meta AI's commitment to open-sourcing Llama 2 demonstrates the growing interest in democratizing access to state-of-the-art language models and promoting collaboration in the AI community (Touvron et al., 2023). 5. Mixtral 8x7B: Mixtral 8x7B is a Sparse Mixture of Experts (SMoE) language model trained with multilingual data using a context size of 32k tokens. Mostly the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts). It is a decoderonly model where the feedforward block picks from a set of 8 distinct groups of parameters (Jiang et al. 2024). The model is considered highly capable compared to its size. (Merilehto 2024, p. 4)

The Stable Diffusion model, for example, has spawned a vibrant ecosystem of tools and applications since its release in 2022 (Rombach et al., 2022). Developers have created user-friendly interfaces, such as DreamStudio, that make it easy for users to generate images from textual descriptions. Artists and designers have used Stable Diffusion to create unique visual styles and concepts, while researchers have experimented with novel techniques for controlling and enhancing the model's outputs (Rombach et al., 2022). (Merilehto 2024, p. 5)

Finally, Open-Source Generative AI has the potential to democratize access to AI capabilities, empowering a broader range of organizations and domains to benefit from these technologies. By lowering the barriers to entry and enabling collaboration, Open-Source models can help to ensure that the benefits of Generative AI are distributed more equitably across society. (Merilehto 2024, p. 5)

Merilehto, J. (2024). On Generative Artificial Intelligence: Open-Source is the Way. SocArXiv, March, 13.

Meyer, A., Bleckmann, T., & Friege, G. (2025). Automatic feedback on physics tasks using open-source generative artificial intelligence. International Journal of Science Education, 1-26.

Nachhaltige Gesch�ftsmodelle f�r KMU

DerOpenSourceKIKompass