SoFunction
Updated on 2025-03-08

Java implementation reads files that exceed memory size

Read the file content and then process it. In Java, we usually use the methods in the Files class to load the file content into memory and process it smoothly. However, in some scenarios, the files we need to process may be larger than the memory our machine has. At this point, we need to adopt another strategy: partially read it and have other structures to compile only the required data.

Next, let’s talk about this scenario: what to do when encountering a large file and cannot be loaded into memory at once.

Simulation scenario

Suppose, currently we need to develop a program to analyze log files from the server and generate a report listing the top 10 most commonly used applications.

Every day, a new log file is generated containing information such as timestamps, host information, duration, service calls, and other information, as well as other data that may not be related to our specific scenario.

2024-02-25T00:00:00.000+GMT host7 492 products 0.0.3 PUT 73.182.150.152 eff0fac5-b997-40a3-87d8-02ff2f397b44
2024-02-25T00:00:00.016+GMT host6 123 logout 2.0.3 GET 34.235.76.94 8b97acae-dd36-4e83-b423-12905a4ab38d
2024-02-25T00:00:00.033+GMT host6 50 payments/:id 0.4.6 PUT 148.241.146.59 ac3c9064-4782-46d9-a0b6-69e4d55a5b38
2024-02-25T00:00:00.050+GMT host2 547 orders 1.5.0 PUT 6.232.116.248 2285a81e-c511-41b9-b0ea-a475a0a45805
2024-02-25T00:00:00.067+GMT host4 400 suggestions 0.8.6 DELETE 149.138.227.154 8031b639-700e-4a7c-b257-fcbed0d029ce
2024-02-25T00:00:00.084+GMT host2 644 login 6.90 GET 208.158.145.204 3906a28c-56e4-4e5f-b548-591eab737aa7
2024-02-25T00:00:00.101+GMT host5 339 suggestions 0.8.9 PUT 173.109.21.97 c7dfec8a-5ca8-4d0d-b903-aaf65629fdd0
2024-02-25T00:00:00.118+GMT host9 87 products 2.6.3 POST 220.252.90.140 e5ceef67-2f0f-4c2d-a6d2-c698598aaef2
2024-02-25T00:00:00.134+GMT host0 845 products 9.4.6 GET 136.79.178.188 f28578c1-c37c-47a3-a473-4e65371e0245
2024-02-25T00:00:00.151+GMT host4 675 login 0.89 DELETE 32.159.65.239 d27ff353-e501-43e6-bdce-680d79a07c36

Our code will receive a list of log files and our goal is to compile a report listing the 10 most commonly used services. However, to be included in the report, the service must have at least one entry in each log file provided. In short, a service must be used daily to be eligible for inclusion in the report.

Basic implementation

The initial way to solve this problem is to consider business needs and create the following code:

public void processFiles(final List<File> fileList) {
  final Map<LocalDate, List<LogLine>> fileContent = getFileContent(fileList);
  final List<String> serviceList = getServiceList(fileContent);
  final List<Statistics> statisticsList = getStatistics(fileContent, serviceList);
  final List<Statistics> topCalls = getTop10(statisticsList);

  print(topCalls);
}

This method receives a file list as a parameter, and the core process is as follows:

  • Create a map containing each file entry, where Key is LocalDate and Value is a file line list.
  • Create a list of strings using unique service names in all files.
  • Generate a list of statistics for all services and organize the data in the file into a structured map.
  • Filter statistics to get the top 10 service calls.
  • Print the result.

It can be noted that this method loads too much data into memory, which inevitably leads toOutOfMemoryError

Improved implementation

As mentioned at the beginning of the article, we need to adopt another strategy: the pattern of processing files line by line.

private void processFiles(final List<File> fileList) {
  final Map<String, Counter> compiledMap = new HashMap<>();

  for (int i = 0; i < (); i++) {
    processFile(fileList, compiledMap, i);
  }

  final List<Counter> topCalls =
      ().stream()
          .filter(Counter::allDaysSet)
          .sorted((Counter::getNumberOfCalls).reversed())
          .limit(10)
          .toList();

  print(topCalls);
}
  • First, it declares a Map (compiledMap), with a String as the key, representing the service name, and a Counter object (explained later), which will store statistics.
  • Next, it processes these files one by one and updates compileMap accordingly.
  • It then utilizes the streaming function to: filter only counters with full-day data; sort by number of calls; and finally, retrieve the top 10 names.

Looking at the core of the whole processprocessFileBefore the method, let's analyze it firstCounterClass, it also plays a crucial role in this process:

public class Counter {
  @Getter private String serviceName;
  @Getter private long numberOfCalls;
  private final BitSet daysWithCalls;

  public Counter(final String serviceName, final int numberOfDays) {
     = serviceName;
     = 0L;
    daysWithCalls = new BitSet(numberOfDays);
  }

  public void add() {
    numberOfCalls++;
  }

  public void setDay(final int dayNumber) {
    (dayNumber);
  }

  public boolean allDaysSet() {
    return ()
        .mapToObj(index -> (index))
        .reduce(, Boolean::logicalAnd);
  }
}
  • It contains three properties: serviceName, numberOfCalls, and daysWithCalls
  • The numberOfCalls property is incremented by the add method, which is called for each processing line of serviceName.
  • The daysWithCalls property is a Java BitSet, a memory efficient structure for storing Boolean properties. It is initialized with the number of days to be processed, each bit represents one day, initialized to false.
  • The setDay method sets the bit in the BitSet corresponding to the given date position to true.

The allDaysSet method is responsible for checking whether all dates in the BitSet are set to true. It does this by converting BitSet to a Boolean stream and then reducing it with the logical AND operator.

private void processFile(final List<File> fileList, 
                         final Map<String, Counter> compiledMap, 
                         final int dayNumber) {
  try (Stream<String> lineStream = ((dayNumber).toPath())) {
    lineStream
        .map(this::toLogLine)
        .forEach(
            logLine -> {
              Counter counter = (());
              if (counter == null) {
                counter = new Counter((), ());
                ((), counter);
              }
              ();
              (dayNumber);
            });

  } catch (final IOException e) {
    throw new RuntimeException(e);
  }
}
  • This procedure uses the lines method of the Files class to read the file line by line and converts it into a stream. The key feature here is that the lines method is lazy, which means it won't read the entire file immediately; instead, it will read the file when the stream is consumed.
  • The toLogLine method converts each string file line into an object with attributes for accessing log line information.
  • The main process of processing file lines is simpler than expected. It retrieves (or creates) the Counter from the compileMap associated with the serviceName, and then calls the Counter's add and setDay methods.

As we can see, it is not a complicated thing to deal with large files in Java without loading the entire file into memory. The Files class provides a way to process files line by line, and we can also use hash to store data during file processing, which helps save memory.

This is the end of this article about Java reading files exceeding the memory size. For more related Java reading files exceeding the memory size, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!