Abstract
In the gem trade, geographic provenance can significantly enhance the value of precious gemstone. The origin traceability of the prestigious gemstones such as ruby, sapphire, and emerald is not only the core competency of international gemmological laboratories, but also carries profound research significance including differences of origins, characteristics of deposits, mineralization types, ore-forming and prospecting patterns.Since the pioneering work of identifying the origins of coloured gemstones was first initiated by the Swiss Gübelin Gem Lab, Switzerland in the 1950s, the research on the origins of precious gemstones has been increasingly studied, which is the significant differences of conventional gemmological properties, inclusions, spectral characteristics, and chemical compositions of emeralds produced by deposits with different ore-forming backgrounds. Therefore, the current origin identification of gemstone is typically based on the comprehensive judgment of experienced gemologists regarding these four aspects. However, with the explosive growth of research on the origins and the comprehensive systematization of data on gemstones from different sources, the effectiveness of traditional identification methods has gradually declined. The phenomenon of similar inclusions, similar spectral patterns, and overlapping chemical composition ranges is commonplace, challenging the reliability of gemstone origin discrimination. In this study, the chemical fingerprint of gemstone is playing an important role in origin identification. With the advancement of micro-analysis techniques, the test precision of trace elements in gemstones has been improved significantly, and high-precision compositional data are strongly indicative significance for origins. Traditional identification techniques for chemical composition analysis mainly rely on low-dimensional binary/ternary discrimination diagrams, but the extensive overlap of data points cannot adapt to the increasingly large number of compositional data set. Therefore, the machine learning methods capable of efficiently mining high-dimensional compositional data are beneficial in addressing the current challenges faced in the origin identification of gemstone.In this study, 303 collected emerald samples from 12 producing areas(Fig. 1)including western emerald belt in Colombia (WEB), Kafubu in Zambia, Itabira in Brazil, Carnaíba in Brazil, Socotó in Brazil, Panjshir in Afghanistan, Swat in Pakistan, Shakiso in Ethiopia, Malysheva in Russia, Gwantu in Nigeria, Mananjary in Madagascar, and Dayakou in China.Firstly, the conventional gemmological parameters, inclusions, ultraviolet-visible-near infrared spectroscopy (UV-Vis-NIR), infrared spectroscopy (IR), Raman spectroscopy, and major and trace element contents of emerald samples from various origins were tested by traditional origin identification methods, and the chemical composition databases of the three prestigious gemstones were compiled and constructed. Additionally, three different machine learning methods were applied to mine gemstone composition data, and three efficient and accurate models for gemstone origin identification were constructed.Systematic research on emeralds from global originsindicate that there are three different UV-Vis-NIR spectrum absorption patterns (Fig. 2). The study on deuterium water in the emerald band also indicates that infrared absorption in the range of 2 600-2 850 cm-1 can also divide global emeralds into three groups (Fig. 3), which is a groundbreaking conclusion with strong implications for emerald provenance. The chemical composition database of emerald samples from 45 occurrences in 22 countries were compiled, totaling 2 753 data, including 425 test data in this study. Based on this database, three mature and efficient machine learning methods (random forest, support vector machine, and extreme gradient boosting) were used to mine emerald composition data and construct origin identification models. Among them, random forest (RF) and extreme gradient boosting (XGBoost) models have strong adaptability and superior performance for emerald data. The highest accuracy and F1 score of 22 RF models can reach 99.5% (Fig. 4). Model RF-EM1-1320 can distinguish 11 origins with 100% accuracy (Fig. 5), and model RF-EM12-135 can achieve the accuracy of 99.1% using only 5 elements. Simultaneously, the decoding of high-dimensional element information by the models indicates that V/Cr and alkali metal elements (Li, Na, Rb, Cs) are strongly related to origin, with their characteristic weights ranking at the top in all models, indicating the composition and source of the mother fluid in the ore-forming process.These findings underscore the significant advantages of big data and machine learning technologies in the field of gemstone origin determination, characterized by high accuracy, efficiency, and versatility. The application of these technologies for gemstone origin researche represents a groundbreaking expansion of origini dentification technology, and offers a novel perspective for gemmological research.