Analysis of common e-book formats and their decompilation ideas page 2/3

2.2.2.1 Web Compiler 1.67
Because the production tools of this format appeared relatively early in China and had very thorough Chinese decrypted versions, it was once more popular. Many e-books provided by E-book Time and Space are in this format. However, it is precisely because of its popularity that many people want to decompile it, which has led to various decompilation tools, so there seems to be few people using it now.

In the decompilation tool, you won’t talk about it for the fee. RMH and Fbilo in China have jointly launched a free unwebcompiler and provide a full set of Delphi source code. If you need it, go to Google or Baidu to search for unwebcompiler. However, most domestic software website administrators may not be from developers and are not interested in source code, so they collect 212 KB EXE, and not many have source code, so they need to look for it carefully.

In the source code of unwebcompiler, RMH and Fbilo have described the file format of the e-book generated by Web Compiler 1.67 in detail. I will not make boring repetitions here. If you are interested, go and read it yourself. The UnEBook I made also uses the source code they provide to implement batch decompilation of e-books generated by Web Compiler 1.67. However, I changed the code from Delphi to C, which seems to have been shortened a little (there is a piece in the original code that converts and converts between a string and a hexadecimal number, which looks weird and saved by me). However, it is too troublesome to modify the LHA decompression part, so I directly found a piece of ready-made C code on the Internet to use.

2.2.2.2 Caislabs eBook Pack Express 1.6
This e-book production tool has also been released in Chinese, so it has a certain impact in China, but this impact does not seem to be large enough to make decompilation tools fly all over the sky, hehe...

When analyzing e-books in this format, I did not use any disassembly tools, but guessed it with UltraEdit32 and system monitoring tools:

File ID: Ending with hexadecimal string 00 F8 03 00. This seems to be a convention, and almost all EXE format e-books have their own special file endings.
Directory block start address pointer: 0003F81C
The directory item structure in the directory block: the file name ending with 0 characters + the 4-byte starting address, and the file name start by FF, the directory block ends.
If the file is stored in a subdirectory, the first character of the file name: 02=../, 01: The first 00 becomes / until 02 is encountered.
The actual starting address of the file content: the 4-byte starting address in the directory item + 9
File content length: The content referred to by the 4-byte starting address in the directory item, DWORD.
After analyzing the directory structure, I once wanted to analyze the file encryption algorithm through debugging tools, and then decompile the specific file contents, but soon I found that it was too tiring to do that, and it was really not worth the effort.

However, after several attempts, I still found a way to be lazy:

By installing hooks, inject a DLL into the process space of the e-book.
In this DLL, you can download the specified file using the Windows standard API function URLDownloadToFile. The URL of the file can be obtained from the directory entry according to the method mentioned above, and then a fixed prefix ("file://Z:\\com_caislabs_ebk\\") constitutes an absolute path.
When UnEbook decompils e-books in batches, it is implemented according to the above analysis results.

However, when the Caislabs eBook Pack Express was later, it seemed that Caislabs company began to realize the importance of file content protection. Therefore, it not only used a stronger encryption algorithm for file content, but also eliminated the vulnerability that can be downloaded with URLDownloadToFile, but even the encryption strength of the directory block is strong enough that I don’t want to analyze it. Fortunately, I already had a better decompilation idea at this time - a general decompilation idea that has nothing to do with the specific file format, specifically targeting e-books using the IE kernel.

2.2.2.3 General decompilation ideas
After analyzing several e-book formats, I began to realize a truth: the changes in the internal file structure of e-books are endless, while my time and energy are limited; if you devote your limited time and energy to fighting against endless variables, you will sooner or later be exhausted.

After understanding this, I began to think about whether there are any general methods that can solve the problem of decompiling most e-books (I am not so naive that I believe there will be a panacea in this world). According to convention (the incurable occupational disease), the first step is of course market research and product positioning. The conclusion is that most e-books are based on the IE kernel at present, but according to my understanding of the IE kernel when developing MyReader, there is a clear misunderstanding here: Microsoft provides the IE kernel in the form of controls, and its purpose is to attract more people to join Microsoft's standard camp through the openness and convenience of the control interface. If you want to add encryption, protection and other content on this basis, it may be inconsistent with Microsoft's original intention (I'm talking about it at that time, and Microsoft may change its mind in the future). Therefore, I believe that the IE kernel must have a backdoor to go! After some effort, I was not disappointed.

1. Basic Principles

It may require some techniques and techniques to implement the general cracking technology for IE kernel e-books, but the principle is very simple. It can be explained in a few words: no matter how the content is encrypted when the e-book is stored, when passing the content to the IE kernel for display, the content must be converted into a standard format that the IE kernel can recognize - HTML format. In order to facilitate display and refresh, the IE kernel parses the HTML code, but does not immediately discard these HTML codes, but saves a backup in memory. Therefore, as long as this backup is created from the IE kernel, the decoded content will be obtained, which is the content you want to decompile.

As for other content on the web page, including pictures, css, js, Flash files, etc., it is even simpler: simulate the IE kernel and just look for e-books directly. If the e-book cannot tell whether the request comes from the IE kernel or from other places, it will naturally serve the things we need with both hands!

Although the principle of decompilation can be explained in a few words, it still takes hard exploration and experimentation to implement it. I have gone through a long period of hard work and read the source code of the IE kernel several times (blaming it, don't take it seriously!). The development of my thoughts has gone through about two stages: the first stage was to obtain a certain legendary source code (yes, that is, the one that has nearly 700MB after the expansion, described by mainstream domestic media as a gimmick, insignificant, and boring garbage), and to be completely based on Microsoft's public IE kernel interface. At that time, I considered classifying the content of e-books according to HTML, images, etc. to solve the acquisition problem separately. The second stage is that after getting the source code, I suddenly found that for all files, I could directly find e-books, just pretend that it is the IE kernel.

Since some things are more sensitive, the following description mainly describes my first stage ideas, some of which are fundamental. Issues inconvenient to the implementation of the second stage.

2. Methods to obtain HTML source code

I am not only thinking about the method of obtaining HTML source code from the IE kernel, from China to abroad, from CSDN (there is a column in the VC/MFC area of CSDN specifically discusses IE kernel programming) to MSDN. In summary, it is generally believed that it can be achieved through the following steps:

Whether it is clicking through the mouse or through EnumChildWindow, first find the display window of the IE kernel, which is the window in which the e-book displays the web content.
Through the handle (HWND) of this window, get the interface pointer of the IE kernel document interface IHTMLDocument2 corresponding to this window. Currently, there are two ways to obtain it. I personally think that these two need to be used in combination, otherwise there will always be some e-books that will not be able to handle: one is through MSAA, and the other is through WM_HTML_GETOBJECT message. As for the specific implementation code, it is almost discussed on CSDN, so I will omit it here. If you need it, go to CSDN by yourself. However, both methods have requirements for the platform: there is no problem under XP, and IE 6 may be installed under 2000. Don’t think about it.
After obtaining the IHTMLDocument2 interface pointer, you can obtain the HTML code of the document according to the standard method provided by this interface. See the example in CSDN for specific implementation code.
In addition to the above method, I have also tried one method myself: using MIME Filter.

For those who have been engaged in online web translation and web content filtering, MIME Filter is the cost of eating. Its role and implementation mechanism should have been familiar with it for a long time, but for others, it may not be very familiar. So here is a brief introduction: In order to facilitate the expansion of the functions of the IE kernel, Microsoft stipulates that before the IE kernel displays content in a certain standard format (HTML, TEXT, etc.), the content will be first passed to the filter of this format, namely MIME Filter, which will preprocess the content (such as translating English to Chinese, replacing the obscene text with asterisk, etc.), and then display it.

According to this principle, if you implement a MIME Filter for HTML format, you can intercept the most authentic HTML code. Unfortunately, after my attempt, this trick is effective for IE itself, and it is also effective for some e-books, but not for others. In addition, the method of using IHTMLDocument2 interface pointer is much simpler and more reliable than this method, so this method was not used later in the decompilation tools I developed in KillEBook, IECracker and CtrlN. However, this method also has one advantage: it has nothing to do with the platform. I have tried it in 98/Me/2000/XP, of course, they all tried it in a virtual machine.

The mechanism of action and implementation methods of MIME Filter are explained in detail in MSDN, and detailed example code is provided. If you need it, you can search for "MIME Filter" on MSDN.

3. Methods to obtain images

Similar to HTML code, the IE kernel also has a process of "download->decoding->display" for processing images. Considering the abstraction of the display code, various image formats, including JPG, GIF, PNG, TIFF, etc., are uniformly represented in bitmap format after decoding, and the original format data is released from memory after decoding, leaving only file backups in IE cache. If you specify that local cache is not allowed to be saved, there is not even this backup. When you select "Save Picture As..." in IE through the right-click menu, you actually copy the backup of the file in the cache. If there is no backup in the cache, you can only save the bitmap (*.bmp) in the memory. Now I understand why some pictures are clearly in jpg format, but can only be saved as "untitled.bmp" in IE, right?

Therefore, it is much harder to get an image file than to get an HTML file. Moreover, it is very clear in MSDN that using the IHTMLDocument2 interface can only get image links, and using MIME Filter cannot get image data on the web page, so you need to find another way. What I have thought and tried include:

First copy the image to the clipboard, then obtain image data from the clipboard, and then encode it into the original image format according to the image file extension (can be parsed from the URL of the image element), including jpg, png, gif, tiff, etc. This method is relatively simple to implement. Search for Q293125 in MSDN KB, and you can copy the image to the clipboard to the ready-made source code. You can refer to cximage for the image encoding source code, which is also available at Google. However, this method is far from perfect: a). For formats such as png and gif that allow transparent backgrounds, it will become opaque after processing using this method. b). After the gif animation is processed, it cannot move, and can only display one of the frames. c). For lossy compression formats like jpg, it will be lost once every time you compress it, and it may be impossible to see if you compress it a few more times. d). In e-books, the clipboard can be invalidated through standard Windows API functions.
Navigate the IE kernel to the picture, and then obtain a copy of the picture through the IViewObject interface. This method is basically the same as the above method, but without passing the clipboard, you can prevent the image from being blocked because the clipboard is blocked.
Use the IE image decoding plugin. After the IE kernel downloads an image file of a certain format, it will call the corresponding decoder to decode the image (similar to MIME Filter). For the sake of expansion, the decoder is made into plug-in form. If you make an image decoder plug-in by yourself and intercept the decoding request, you can obtain the original image format data before decoding. The interface and implementation methods of the decoder are not traced in Microsoft's public documents, but in the legendary source code, there are not only detailed interface specifications, but also several implementation codes for embedded image decoders for reference. Strangely, although it is not found in MSDN, when I searched on Google, I found that a Japanese man had already given a detailed method of implementing image decoder plug-in on his personal website, which was explained clearly step by step, and the signature time was December 2002! It seems that the source code leaked earlier than expected. Of course, this Japanese may also work at Microsoft or have a cooperative relationship with Microsoft, and may even be able to view the decoder source code openly.
4. Implementation of general decompiler

After solving the acquisition methods of HTML, page elements, etc., the implementation of the general decompiler KillEBook is very simple, and its algorithm can be described as follows:

Open the e-book.
Position the display window of the e-book.
Gets the HTML code for the currently displayed page.
Parses the page HTML code and get all the links there.
Get all the contents of elements on the page, including pictures, etc.
Boot the IE kernel to load the HTML link page in turn.
Repeat steps 3 to 6 until all pages and elements in them have been obtained.
5. Further discussion

After completing KillEBook, I found that by actually expanding it, it can become a new offline browser, solving a problem faced by traditional offline browsers (Offline Explorer Pro, Webzip, etc.): Traditional offline browsers mostly have nothing to do with the IE kernel, so there is no problem when catching static web pages, but when catching dynamic web pages maintained by session, there are some problems, let alone grabbing HTTPS websites that require PKI certificate verification.

So I'm considering implementing such an offline browser:

Provides an address bar for users to enter the starting URL.
Embed a Microsoft web browser control (IE kernel) for user interaction, including entering a username/password on the web page and selecting a certificate from the IE certificate library.
After the user logs in successfully and enters the web page that needs to be crawled, set the recursive depth and URL filtering conditions, and click the "Start" button to start crawling.
The offline browser automatically guides the webbrowser to enter each page. Every time you enter a page, the client HTML source code and page elements are obtained through the webbrowser control, including pictures, css, js, flash, etc.
Offline browsers implemented in this way can maintain the client session and capture dynamic web pages because they use the web browser control. Although web pages become static after being crawled, it should not be a problem for offline browsing. It is just right for websites such as paid online education.

2.3 HLP format
This format appeared relatively early, and was once a standard help file format under 16-bit Windows (Windows versions before Windows 95), so it is probably one of the earliest e-book formats that appeared under Windows.

Because this format is quite popular, there are many researches abroad, but I seem to have only seen one HELPDECO v2.1 that has disclosed the source code. This software is a console program, so someone made a GUI shell DuffOS to encapsulate it. Some people in China have Chineseized HELPDECO. You can find it by searching for the Chinese New Century, including all source codes.

The source code of HELPDECO is used in UnEBook to realize batch decompilation of HLP files. However, judging from the situation I use, the original HELPDECO has a small disadvantage: the decompiled RTF file does not have a specified character set. This has no effect on English RTF, but for Chinese RTF, its impact is strong enough to make you see a bunch of garbled code after opening RTF. There are two ways to correct it:

Use a text editor to open the decompiled RTF file and manually specify the Chinese character set. This is a relatively tiring method.
Modifying the HELPDECO source code and adding character set correction is a one-time solution. But for some reason, on the Chinese version launched in the Chinese New Century, I still saw the original HELPDECO. It seems that Chineseizers have only used it to decompile English HLP, but not Chinese HLP.
In addition, there is another serious problem in this source code: variables are not initialized and released uniformly, so not only will VC++ report memory vulnerabilities when the program exits, but like the DOS kernel back then, there is almost no reentrability. I once tried to fix this bug, but after an afternoon of struggle, there were two leaks that couldn't be found. Finally, I decided to learn from DuffOS: encapsulate the HELPDECO code into an independent DLL, and the DLL is loaded and released dynamically for each HLP file to be decompiled. On the one hand, we can use the DLL management mechanism of Windows itself to make up for the memory vulnerability generated by HELPDECO, and on the other hand, we can solve the problem of non-reentry. The HLP decompilation function provided by the paid "Ye Shu Manufacturing" software is also implemented using DLL files, so I seriously suspect that its author may have encountered the same troubles, hehehe...

Previous page123Next pageRead the full text