Development of applications has become a major source of profit for IT companies that produce and update them, as well as a key convenience and an accessible solution for users.
Intro to monetization models
There are various ways to generate this profit through different types of monetization used in applications. Let’s consider the existing monetization models:
- free: no limited functionality or content, with the profit generated through in-app ads;
- freemium: with some premium features or content that can be unlocked after payment. This can be either one-off or regular payments with different frequency;
- trial: offering limited time for users to check whether an app meets their needs;
- paid: payment is required before downloading.
However, there is sometimes no clear distinction among these categories because they can be combined to generate the maximum profit.
When it comes to users, they expect fair, transparent, and complete information on the presence of in-app purchases or other hidden paywalls. It follows that there is a need for an automated algorithm that distinguishes apps according to the monetization model used.
Marketplace guidelines and practical ways to add in-app purchases
Mac App Store
Mac App Store is an official Apple marketplace for macOS apps. Each app to be published on this platform needs to be reviewed for a number of requirements to protect user data and provide a high-quality user experience.
One of the requirements is for each app to include a link to the Privacy Policy which must contain information about the collection of sensitive user data and its usage. A similar legal instrument that protects a company's intellectual property rights is the Terms of Use, which often include a dedicated section on Payment Terms. It governs the financial aspects of the service, such as pricing, refund period, payment currency, information about additional charges, and penalties in case of a late payment.
Besides, Apple requires the use of a system API for processing payments, which is provided by the StoreKit framework or its convenient wrappers. Payment validation can be performed locally using a MAS receipt file or a server-side technique.
MacUpdate
Unlike MAS, MacUpdate doesn’t have such strict requirements. Therefore the in-app purchases can be integrated through aforementioned methods or via:
- SDK from third-party payment services such as Paddle, Stripe, Paypal;
- requests to REST API from the aforementioned services;
- checkout web pages to perform or confirm transactions.
From suspicion to detection: how it is done
To develop the algorithm capable of detecting the presence of in-app purchases and their further classification based on monetization models, we used technical and language analyses. The technical analysis was applied to local app data, i.e. bundle and sandbox. The bundle includes the application’s executable code and resources, while the sandbox provides a secure environment that restricts the app’s access to system resources or interference with the processes of other programs. The language analysis included examining not only local data but also external app information retrieved from the marketplace. After extracting all these features, they were processed to construct the input to the machine learning classifier.
Here you can see a complete algorithm pipeline.
Technical analysis
Technical analysis was based on examining such bundle components as the app’s binary, frameworks, and MAS receipt file.
Payment-related frameworks, along with their respective classes and methods, were extracted from the application binary. Additionally, the same set of features was retrieved from all integrated third-party frameworks to identify those functioning as intermediaries to simplify interaction with the system payment API. Endpoint URLs and checkout page addresses were identified by applying regular expressions to all constant strings in the codebase, targeting patterns associated with popular payment services. Finally, the records of in-app purchases and active/inactive subscriptions were obtained from the MAS receipt file.
Language analysis
Language analysis was based on retrieving the purchase-related keywords found in bundle and sandbox components or via the Internet. The local text sources included: Info.plist, UserDefaults, logs, and localizations. The external text sources contained the app’s description, user reviews, Privacy Policy, and Terms of Use.
Since the analyzed apps were taken from MAS and MacUpdate, different approaches were applied to retrieving information from each platform. The information from the MAS apps was obtained through iTunes. With MacUpdate apps, we used web scraping techniques to extract descriptions and user feedback directly from the marketplace website.
All this textual content was preprocessed using lemmatization to normalize the word forms and improve the identification of target keywords. Additionally, the key identifiers from .plist files (Info.plist and UserDefaults) were separated in the points where the letter case changed.
Data preparation
The results of technical and language analysis were converted into numerical values, which were collected in a table that was used as input to the classification model.
The table included separate columns for: (a) each payment framework with its classes and functions, (b) payment endpoints and checkout page links, (c) the receipt in-app purchase records, and (d) purchase-related keywords.
Framework classes and functions were represented as binary values in which 0 indicated the absence and 1 – their presence. The payment endpoints and links were converted into their respective quantities. For the receipt file, the number of purchase records was used. A separate column was generated for each keyword with its corresponding source for the language data.
Mapping this data into different columns improved the ability of the model to effectively identify hidden dependencies between specific frameworks, classes, functions, keywords and their text source.
Training and validation
To evaluate the developed solution, a dataset consisting of 1,219 applications from the Mac App Store and MacUpdate platforms was collected. To label the dataset with the correct monetization classes, we used the data from the app aggregator App Figures collected via web scraping.
Different supervised machine learning algorithms and models were compared to select the best one:
- Decision trees: Decision Tree, Extra Randomized Trees and Random Forest
- K-Nearest Neighbor (KNN): the experiments were performed for k=1, k=3, and k=5 to train the KNN model;
- Naive Bayes: Gaussian Naive Bayes;
- Support Vector Machines (SVM): Support Vector Classifier with polynomial kernel;
- Gradient Boosting: Hist Gradient Boosting Classifier.
You can see the comparison of these models below. It was based on the Accuracy, Precision, Recall, and F1-score metrics.
This diagram illustrates that the best performance was demonstrated by Random Forest and Hist Gradient Boosting classifiers, achieving 90% precision, 90% recall, 89% F1-score, and 90% accuracy.
Final thoughts
The developed algorithm for the classification of applications can become an instrument used for reliably distinguishing apps with different monetization classes. The combination of approaches used to develop this algorithm can be highly adaptive and applicable to a wide range of tasks aimed at classification according to various parameters. In addition, it is suitable not only for macOS but other operating systems.
This algorithm can be further improved by integrating dynamic analysis to observe app’s runtime behaviour, semantic analysis of text content, and unsupervised machine learning algorithms for clustering the extracted features and recognizing dependencies between the selected set of features and model performance.