Analysis of common e-book formats and their decompilation ideas page 3/3

RTF files decompiled from HLP files generally contain a large number of bookmarks, page breaks, and other things that are not related to the actual text content, and it is necessary to convert them into plain text format. This implementation is relatively simple:

To create a Windows-standard RichEdit control, of course there is no need to display it on the user interface.
Press SF_RTF format, StreamIn original RTF file content.
Press SF_TEXT format, StreamOut text content.
The batch conversion function from RTF to TXT provided by UnEBook is implemented according to the above method.

2.4 Novel Network/Novel World (ebx/XReader)
The e-books provided by these two websites use the same reader, but the Novel Network appeared earlier, and most of the e-books provided do not require verification codes, while the novel world appears later, and most of the e-books provided need to enter verification codes.

This kind of e-book is divided into two types: ebx and EXE formats. E-books in ebx format need to be browsed with a dedicated browser XReader. The content of the EXE file is actually composed of XReader + ebx package.

China Cyu has launched a tool to decompile this EXE format - xReader Unpacker. Judging from my trial, the implementation of this tool should be based on hard analysis of the EXE file format. As expected, hardworking and kind Chinese people have it at any time! However, judging from the results of my trial, this tool also has the following problems:

Only one file can be decompiled at a time, but cannot be decompiled in batches, which is slightly inconvenient to use.
The decompiled file is named by the corresponding node in the directory tree on the left, completely losing the order of the files.
When decompiling certain files, such as "The Law of Blood Rewards - The Survival Game in Chinese History", an error will be made and exited. My personal guess may be because of the improper handling of multi-level directories in the book.
It's very strange that you can only decompile EXE files, but not ebx files. In fact, these two files are two-in-one.
Of course, what I tried was only the original version of xReader Unpacker. Later I heard that the author had updated it again, and it might be possible that these problems would be solved.

When considering decompiling e-books in this format, because I was already thinking about the general decompilation method for the IE kernel, I did not plan to analyze the file format from the beginning, but planned to start with the interface elements and see if there is anything to go after:

I first used IECracker to grab the window and found that it was not based on the IE kernel at all. The first thing that comes to mind at this time is: Will the software author learn from Qidian Chinese, convert the content into pictures, and then display it? But this possibility was quickly denied. On the one hand, it was because XReader provided the function of enlarging and reducing text, and on the other hand, it was because after launching Kingsoft PowerWord, it put the cursor on the window, and PowerWord showed the content of the word grabbing. At this time, a thought flashed in my mind, and a decompilation plan appeared: simply learn from Kingsoft PowerWord, make an API hook, and grab the content of it to display, haha...
After confirming that the thing displayed by XReader is not a picture, I started SPY++ and planned to see what the XReader display window uses. But the results of the viewing are amazing: every time XReader is started, the class name of the display window will change once, it is a completely random string, and it is impossible to see what controls the window uses from the above.
After reading a few more e-books, I found that all e-books have one feature: there is no picture at all, and they are all plain text, but when the mouse is placed on the window, the cursor will not become the insertion cursor (a vertical line) of the text window, or the arrow cursor. By this time, I have begun to believe that the software author has completely inherited the glorious tradition of hard work and kindness of the Chinese people, and has written a text output control myself. ...Wait, why does the cursor flash when opening this large file, and it turns from a vertical line to an arrow? Then move the mouse wheel back and forth to see, each time, it is not too much or too little, just scrolling 3 lines. Isn’t this one of the features of the RichEdit control? !
Start SPY++ immediately. This time, I won’t watch the class name anymore, but instead watch the message flow. Sure enough, every time I click on the directory tree on the left, I will send a bunch of RichEdit control messages to the display window on the right: EM_SETBKGNDCOLOR (set the window background color), EM_SETCHARFORMAT (set the cursor shape), EM_SETMARGINS (set the left and right page margins), EM_STREAMIN (import the display content).
Since it has been confirmed that the display area on the right is a standard RichEdit control, and the directory tree on the left is a standard TreeCtrl control, the decompilation solution is actually released: travel around the directory tree on the left, select each node in turn, and then intercept the output of the RichEdit control on the right, and write to the file.
However, after understanding the principle of XReader, I also had a question: the RichEdit control itself can display text and pictures at the same time (RTF format), but why does XReader only display plain text and not pictures? You should know that this will add a lot of color to the e-books you made. At first I thought it was for the sake of confidentiality, and I almost went astray at the beginning? If I hadn't accidentally seen the cursor flashing and then moved the mouse wheel, I might not have remembered that he was using the standard RichEdit control. Later, after seeing the earlier version of XReader, I thought the bigger possibility was to be compatible: the earlier version used WM_SETTEXT to pass the display information, and could only display plain text, and later used EM_STREAMIN.

To sum up, XReader has taken the following measures to prevent copying and decompilation:

Randomly change the class name of the RichEdit control to prevent it from being discovered.
Setting the cursor shape, on the one hand, preventing it from being discovered, using RichEdit, and on the other hand, avoiding using the mouse to select and copy content.
The messages such as WM_COPY, WM_GETTEXT, EM_STREAMOUT are filtered, so don’t think about getting text content directly from the window.
Unfortunately, the RichEdit control provided by Microsoft is used for an open environment. Once it is discovered, using the interface provided by Microsoft itself is enough to get the required content.

Later, I saw the EXE format e-book released by Novel Network in the early days and found that the software XReader is also constantly developing, and the purpose of the version upgrade is to enhance security. The ebx format itself has not changed much and has always been very stable. New ebx files can also be opened with the old XReader:

Early versions of XReader support the use of command line parameters to pass in the path of ebx file that needs to be opened, which is easy to be used by people and realize the automatic opening of files. Later, the version of XReader can only open the file through the menu or toolbar and click "Open eBook". Of course, this limitation is not impossible to break through, but after all, it is not as convenient to pass command line parameters.
Earlier versions of XReader actually used WM_SETTEXT message to display text. If I had seen this version of the e-book earlier, maybe I could have spent less trouble. Later, the version used EM_STREAMIN, which is estimated to be for confidentiality on the one hand, and for speed and performance on the other hand: when displaying large files, EM_STREAMIN is much faster than WM_SETTEXT; EM_STREAMIN can display RTF files, while WM_SETTEXT can only display text files; EM_STREAMIN can display large files, while WM_SETTEXT supports limited file length. 3. Conclusion
Just like offense and defense in information security, the struggle between compilation and decompilation of e-books will be an endless dead cycle. I believe that no matter how e-book decompilation technology develops, it will not lead to the extinction of e-books, after all, there are practical needs. However, the publication of this article will undoubtedly stimulate a new round of upgrades in e-book production software and production technology. So will my articles and software be upgraded? I don’t have much confidence, after all, I have less and less free time. If no one else is willing to study decompilation technology and software like me (free for fees), I think the ultimate victory must be the e-book production software supported by commercial interests.
First analyze the detailed file format of the e-book, and then launch a targeted method of special decompiler. This is indeed a good method in the early stage, but as the number of e-book formats increases, if you have to analyze each one, you will be exhausted sooner or later.
E-book production software is actually developed by humans, and of course developers will also have common human problems - laziness! As long as there are ready-made things available, few people will spend their energy to practice their own unique skills. At present, things under Windows have more openness than security considerations. If you can find a breakthrough in these things, you can break through the same type of e-books that use these things.
Using the interfaces or vulnerabilities of ready-made controls to implement general e-book decompilation is actually a manifestation of programmers' laziness. Although this method is much simpler than honestly analyzing and tracking e-books, it also has its natural flaw: it can only decompile the content displayed in the control. To put it simply, if e-books are password-protected, then this method cannot decompile the content of the e-book without knowing the password.
Appendix Discussion on the implementation method of IE kernel e-book
After reading too many e-books, sometimes I also think, if I were to make an e-book production tool myself, what kind of technology would I use to implement it? Considering the current universality of HTML format documents, before someone opens up a new HTML render, my idea can only revolve around the IE kernel. Here are some ideas that came to my mind.

1. Based on the res protocol

The res protocol is a very simple protocol provided by the IE kernel, which allows the pages that need to be browsed are stored in the resource of the EXE or DLL. IE locates the EXE or DLL according to the URL and loads the resources in it. The following URL is an example of this protocol:

res://C:\WINNT\system32\/http_404.htm

If the page you want to browse in IE does not exist, IE will open C:\WINNT\system32\ through this URL, find the resource named http_404.htm, extract and display it, and you will see a web page that prompts that the page does not exist.

From the source code of the above page, we can see that in addition to HTML code, the res protocol also allows the page to include pictures and other content. For example, the above page displays an image named, and its absolute URL is res://C:\WINNT\system32\/.

Although the res protocol is very simple and basically does not require additional programming, I have not seen anyone using it to make e-books, and at most only see someone using it to display the About information of the software. If you think about it carefully, it may be because this protocol is too unconfidential: just find a resource editor and you can directly obtain and replace the resource content.

2. Based on file method

The idea of this method is actually very simple: when you need to display a web page, first unzip the web page into a temporary directory, then use IE control to display it, and delete the temporary file when exiting.

I have known this method for a long time, but because it is so simple, even I don’t believe that anyone would really use it to make e-books until I saw the e-books of Xiongfeng.com: Although the e-books released by this website were required to enter the password for verification, after the password was entered correctly, the entire content would be decompressed to the temp directory, and then the file would be opened with the IE control to browse. Although the file attributes in the temp directory are set to be hidden, this trick is really not worth mentioning, so as long as the authentication password is cracked, the e-book itself has already provided a complete decompilation function.

Although the e-books released by the website were upgraded, they continued this model, except that the temp directory contained overseen HTML files, but the image files were not encrypted, so I guess they might use MIME Filter technology instead.

3. Based on stream or method

The method of using the flow to write content into IE controls is discussed in detail in MSDN and CSDN, and even the source code is available. If you need it, just search for "Loading HTML content from a Stream" on MSDN.

It is more commonly used in dynamic web pages, and many web page encryption tools use this trick to hide the web page source code. For VC, Delphi, etc., this move is just replaced by IHTMLDocument2::write, and the effect is the same.

Although not many people use this method to make e-books, there are still some after all. The one I have seen is Reading and Write Network. Since the IE homepage will be automatically set to the URL of this website after opening the e-book of this website, the URL of this website will not be given here to avoid accidental damage. Someone has posted the method of cracking the charging verification of this kind of e-book in the technical area of Zichendian Network Forum. If you are interested, you can go and have a look.

The limitations of this stream-based approach are clearly stated in MSDN:

The page cannot be too complicated. If the page contains too many tags, it will not be the page generated after parsing, but the original HTML code. Perhaps because of this reason, all the e-books released by Reading and Write Network have only plain text and background colors.
The URL of the current page will never change (read and write networks will always be about:blank), so the IE kernel cannot automatically construct an absolute URL from the relative URL. It is for this reason that the early e-books of Dixie.com used jpg files as background on the page, so they could only write this background image to the temp directory, and then use an absolute URL to reference this image on the web page. It is precisely for this reason that the page cannot contain links such as "Previous Page", "Next Page", "Back to Directory" and other links. You can only place a directory tree on the left, so that the user can click on one page and one page.
Since the pages of this kind of e-book do not have their own URL, they cannot be decompiled with KillEBook. They can only be manually crawled with IECracker or CtrlN on one page or one page.

4. Use MIME Filter

Compared with the stream-based method, this method not only supports complex HTML pages containing many tags, but also can construct absolute URLs from relative URLs. Therefore, links between pages are supported and implementation is not complicated. There are ready-made examples on MSDN for reference.

However, the disadvantages of this method are also obvious: the image and other content cannot be encrypted. The protocol plug-in method mentioned below is better than this method.

5. Based on web server

For those who don’t know the profession, a “web server” may sound like a great thing, but for those who know the profession, the implementation is actually very simple:

A listening thread is launched to listen on the local 80 or any specified port.
Each time a connection request is heard, a service thread is set up, and the content is returned according to the request content and the HTTP protocol.
There are many ready-made web server codes on codeguru and codeproject, just use them directly, just consider how to fill in the return content. The MSDN CD that comes with VC 6 also comes with an example called HTTPSVR to illustrate how to create a web server using MFC and WinSock.

Although using this method is simple and straightforward, and as long as you want, you can almost simulate the functions of a real web server (even if you want to implement app server, it is not impossible, but it takes some effort), there are problems:

There is basically no confidentiality at all. After the server is installed, other processes in the machine can easily download the required content.
If other processes on this machine also provide TCP/IP services, port conflicts may occur.
6. Protocol plug-ins (Asynchronous Pluggable Protocols)

This is something Microsoft specifically extends for IE.

On the Internet, common application layer protocols include http, FTP, etc. For various reasons, Microsoft allows users to extend their own protocols beyond the standard application layer protocol, called Asynchronous Pluggable Protocol. Searching for these keywords on MSDN, codeguru and codeproject can find a lot from theory to source code, so I won’t go into details here.

Asynchronous Pluggable Protocol can be specified to be valid for all processes. Just register it under HKEY_CLASSES_ROOT\PROTOCOLS\Handler in the registry; it can also be specified to be valid only within a certain process to increase confidentiality. However, at this time, Microsoft will not call it Asynchronous Pluggable Protocol, but Pluggable Namespace Handler.

Since Asynchronous Pluggable Protocol has a certain degree of confidentiality, there are examples to refer to when implementing it. It can provide comprehensive support for web page display almost like setting up a web server, so it has been widely used in e-books. I have seen it include mk (chm), ada99 (eBook Workshop), wc2p (Web Compiler 2000), ic32pp (Web Compiler 2000—exe anti-decompilation format), e-book (E-Book Creator), mec (E-ditor eBook Compiler), etc. However, if this technique is not used well, it may generate garbage in the registry or generate garbage files (the plug-in itself is a COM control, which is generally implemented using DLL and must be registered in the registry before use).

7. The last move

Even if you use Asynchronous Pluggable Protocol, since there is still displayable HTML source code in the IE kernel, there is still a possibility of being exported. This is what has been discussed in the above text for a long time.

The last trick I thought of to create an anti-decompiled e-book is: when making it, convert all the content of the page into pictures and then package it. For the source code for converting web pages into images, please refer to here:

/internet/

Using this method, after getting a produced e-book, there are probably only two ways to get the original text information: OCR and key in. This can also be dealt with by Qidian Chinese website: using handwriting, adding watermarks, deliberately adding typos or replacing punctuation marks, etc. According to legend, the starting point is to generate typos and incorrect punctuation based on the user ID, so if it is key in or OCR, it may be found.

But when you think back, if an e-book production tool really reaches this point, it will probably be not far from dying. Users might as well do PDF directly:

All dynamic effects are not available, and all links on the page are invalid. Perhaps they can only place a directory tree on the left to navigate.
The page size and character size are basically fixed, and it is difficult to zoom in or out when displayed. Especially when zooming in, either the speed is slow or you have to endure ugly jagging.
The file size has increased greatly. For e-books that aim to collect, this is an issue that must be considered seriously with a serious attitude.

Previous page123Read the full text