Sunday, February 20, 2011

Day 1 of a Free Software Project

In yesterday's post I discussed a project I was taking on to create a free software program to read/write RAR version 3 archives. Since the version 3 format isn't documented publically, there are currently (at least to my knowledge) no free software tools available to manage these files. Creating such a tool will require reverse engineering the file format, documenting it, and writing the software itself.

Last night, I began gathering the tools I'd need to begin the work. Not surprisingly, all of the tools I needed are either available as free or open source software so getting everything together was a pretty straightforward process.

TOOLS:

- The GNU/Linux 'strings' program that searches a target file and lists all of the readable strings it finds in it.

- The 'file' program, which tries to identify the format of an unknown filetype. Yes, I already know the filetype in the case, I still wanted to use the tool to see if it might identify the file in some weird way that might give me a clue as to how the data in it might be organized (it didn't).

- The 'Jeex' hex editor. This tool allows me to look directly at the binary content of the file and view it in several different ways, apply structures to it so I can test any ideas I have about file organization, etc.

- The GCC C compiler. This is the standard GNU C compiler which I will use to actually write the software code for the tool. After careful consideration, I'm also looking at potentially using REAL Studio, which implements a version of the BASIC programming language, to write the code but I'm struggling with the wisdom of using a non-free program to create a free program. Ethically, I would be imposing the requirement that anyone compiling the program from scratch use the REAL Studio tool and I'm not sure I'm comfortable with that. I'll probably stick with C but I'm waiting to hear back from the Free Software Foundation on using the non-free tool before I make my final decision. Writing actual code is still a bit down the road.

- Several RAR files. In order to make my analysis easier, I had a friend who uses Windows create several sample RAR files for me. Six to be exact. Because I know the exact contents of these files, I can easily pick out that content while analyzing the RAR file in a hex editor. Here are the details of the six RAR files I'm using:

1) One file contains nothing but a short text file that has a sentence in it.

2) One file contains the same text file with the same content but under a different name.

3) One file contains both text files. This will allow me to look for repeating pattern and make finding the content area of the RAR file a little easier.

4) The next three files are the binary equivilant of the text files so I'm not going to detail them here.

In each case, I know the exact content of the unarchived file and its uncompressed file size. Knowing this allows me to have some idea of where data begins and ends (not exact, but it's a lead).

So let's take a look at one of the RAR files in the hex editor. We're going to pick the file that has two text files that are exact in every way except their names.


By looking at the screenshot above, a few things become very obvious:

1) The very first thing in the RAR file (as we might expect) is a header identifying it as a RAR file. This header is either 3 or 4 bytes long depending on if the ! after the RAR indicates anything (I don't think it does).

2. We can see that there is a part of the file after the header that contains some unknown information followed by the name of the first file in the archive, some random looking characters, then another file name and some more random looking data. If we look at the 'random' data after the file names, we can see it's exactly the same in both areas. This probably is the data within the file in some sort of encrypted format. It could be other file information but I doubt since it's exactly the same in both places.

3) After the second set of 'randon' data which matches with the first set, we see some other random data. I have no idea yet what this is or what it might be.

So, as you can see, I'm really just starting with this journey and I've not gotten very far in a day. But I do have some idea of where I'm going (I think). Since I've never reverse engineered a file format before, I could be totally off but I think my analysis is pretty solid so far. It's not a deep analysis so, if I'm wrong here, I'm likely going to be screwed once the deeper inspection starts.

So there you have it. My work for the day. Hopefully, as I move forward, I'll have more interesting things to share but at least it's a start!

1 comment:

zippy1981 said...

Anthony,

Can't 7-zip, and by extension p7zip unpack Rar version 3 files? Both are open source, and p7zip runs on linux.

Perhaps it would be more efficient to read the sources of those programs to and document the format?

I do salute you for your "big hairy goal" of attempting to reverse engineer a compression algorithm. That is no small task.