Git Product home page Git Product logo

pdb's Introduction

pdb

Build Status

This is a Rust library that parses Microsoft PDB (Program Database) files. These files contain debugging information produced by most compilers that target Windows, including information about symbols, types, modules, and so on.

The PDB format is not documented per sรฉ, but Microsoft has published information in the form of C++ code relating to its use. The PDB format is full of... history, including support for debugging 16-bit executables, COBOL user-defined types, and myriad other features. pdb does not understand everything about the PDB format, but it does cover enough to be useful for typical programs compiled today.

Documentation on docs.rs.

Design

pdb's design objectives are similar to gimli:

  • pdb works with the original data as it's formatted on-disk as long as possible.

  • pdb parses only what you ask.

  • pdb can read PDBs anywhere. There's no dependency on Windows, on the DIA SDK, or on the target's native byte ordering.

Usage Example

use pdb::FallibleIterator;
use std::fs::File;

fn main() -> pdb::Result<()> {
    let file = File::open("fixtures/self/foo.pdb")?;
    let mut pdb = pdb::PDB::open(file)?;

    let symbol_table = pdb.global_symbols()?;
    let address_map = pdb.address_map()?;

    let mut symbols = symbol_table.iter();
    while let Some(symbol) = symbols.next()? {
        match symbol.parse() {
            Ok(pdb::SymbolData::Public(data)) if data.function => {
                // we found the location of a function!
                let rva = data.offset.to_rva(&address_map).unwrap_or_default
                println!("{} is {}", rva, data.name);
            }
            _ => {}
        }
    }

    Ok(())
}

Example Programs

Run with cargo run --release --example <name>:

  • pdb_symbols is a toy program that prints the name and location of every function and data value defined in the symbol table.

  • pdb2hpp is a somewhat larger program that prints an approximation of a C++ header file for a requested type given only a PDB.

  • pdb_lines outputs line number information for every symbol in every module contained in a PDB.

Real-world examples:

  • mstange/pdb-addr2line resolves addresses to function names, and to file name and line number information, with the help of a PDB file. Inline stacks are supported.

  • getsentry/symbolic is a high-level symbolication library supporting most common debug file formats, demangling, and more.

License

Licensed under either of

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

pdb's People

Contributors

calixteman avatar jan-auer avatar jarveson avatar jrmuizel avatar ko1n avatar landaire avatar luser avatar mahkoh avatar mcnulty avatar mitsuhiko avatar mstange avatar nbaksalyar avatar philipc avatar qnighy avatar razrfalcon avatar rlabrecque avatar schultetwin1 avatar swatinem avatar willglynn avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pdb's Issues

CodeView parsing in public crate interface

Hi,
I work on a toy Windows linker written in Rust. I would like to use this crate for creation of PDBs (I am aware that writing PDBs is not currently supported per #16). Before writing PDB I have to parse & dedup .debug$S and .debug$T sections in object files, which contain CodeView-related information.

Would it make sense from your PoV to roll out CodeView parsing as a part of pdb's public API (perhaps gated behind a feature flag)? I understand that's not the main use case of pdb, however IMHO it would make sense for pdb to host this kind of functionality.
I'd be happy to work on API changes and eventually #16 down the road.

Section offsets are 1-indexed

#93 improved module references by using optional 0-based usize indexes instead of requiring to subtract 1 to resolve modules. In #93 (comment), @mstange points out that same is true for PdbInternalSectionOffset::section:

it was a section index, not a module index. sections.get((section_index - 1) as usize)

The most common usage for section offsets so far was to convert them to RVAs via an OMAP. Internally, this ran the checked sub (going back to #87):

https://github.com/willglynn/pdb/blob/e1b86a2c5c8e2c20dc0be9909baecaf1fdf0aa8a/src/omap.rs#L440-L441

However, since there are other direct uses of the section offset, it would be safer to make section: Option<usize> as well.

Add support for LF_TAGGED_UNION, LF_TUCASE

These appear to have been added quite recently to the format; they're present in the latest DIA SDK that comes with Visual Studio 2022 (17.7.6)

LF_TAGGED_UNION = 0x151e
Appears to have the same structure as LF_UNION2
Has a UdtType of UdtTaggedUnion

LF_TUCASE = 0x1520
Corresponds to SymTagTaggedUnionCase in the latest cvconst.h (Visual Studio 2022, 17.7.6)
Unsure of which symbol record that SymTagTaggedUnionCase corresponds to; they don't appear to have added any additional S_ symbol records that you don't already have.
Appears to have a type index and a name as members.

Wrong PDBInformation GUID endianness

The endianness of the guid in PDBInformation seems to be off compared to llvm-pdbutil, specifically its first three fields:

llvm-pdbutil:  98D1EFE8-6EF8-FE45-9DDB-E11382B5D1C9
willglynn/pdb: e8efd198-f86e-45fe-9ddb-e11382b5d1c9

The problem seems to originate in manually parsing the UUID fields with platform-endianness here:
https://github.com/willglynn/pdb/blob/7947bcc3cc1f8f7ab2952155f83ed1bd8dd0eb05/src/pdbi.rs#L159-L164

After changing this to Uuid::from_bytes(buf.take(16)?), the output is correct. I'll submit a patch shortly, after running some more tests. I did not verify that llvm-pdbutil is correct in this case, but I'm assuming so foolishly ๐Ÿ˜„

For testing, I was using the Electron v2.0.11 x64 PDB. You can obtain an archive here (sorry, it's rather large).

Add support for small MSF file format

Currently it's just a hard error.

The specific use case is writing a utility for assisting in decompiling applications compiled with Visual Studio C++ 6.0. Currently it seems the only other Rust option is to parse the output of various Windows-only PDB parser executables without source code available.

Let me know if example files are needed, and what they should contain.

C++ 20's char8_t isn't supported

Hi!

When parsing type declarations containing fields with the primitive type associated with C++ 20's char8_t type, the following error is generated: Type 124 not found.
So the type index appears to be 0x7c, even though it's not been documented in the microsoft-pdb repository.

This comment (and thread) mentions the issue as well: microsoft/microsoft-pdb#51 (comment)

That'd be great if support for this primitive type could be added. I can make the PR if needed, as it shouldn't be too much of a change in the code base.

Add support for FastLink symbols (S_REF_MINIPDB2 / S_FASTLINK)

Visual Studio 2017 added a /DEBUG:FASTLINK option::

The /DEBUG:FASTLINK option is available in Visual Studio 2017 and later. This option leaves private symbol information in the individual compilation products used to build the executable. It generates a limited PDB that indexes into the debug information in the object files and libraries used to build the executable instead of making a full copy. This option can link from two to four times as fast as full PDB generation, and is recommended when you are debugging locally and have the build products available. This limited PDB can't be used for debugging when the required build products are not available, such as when the executable is deployed on another computer. In a developer command prompt, you can use the mspdbcmf.exe tool to generate a full PDB from this limited PDB. In Visual Studio, use the Project or Build menu items for generating a full PDB file to create a full PDB for the project or solution.

PDB files created with this option contain S_REF_MINIPDB2 / S_FASTLINK symbols.

It would be great to add parsing support for this symbol type. #118 adds the constant but not the parsing.

Bad lengths are returned by C13InlineeLineIterator

Currently lengths are calculated by self.code_offset.offset - self.code_offset_base unless there's a ChangeCodeLength. However code_offset_base doesn't change unless there's a ChangeCodeOffsetBase which doesn't seem to be emitted in the wild. Instead, the length should be calculated by subtracting the last code offset.

cc: @jan-auer

PdbInternalSectionOffset should derive Ord

It would be nice to be able to binary search offsets, and to use them as keys in a BTreeMap.

Here's a spot in pdb-addr2line which would be improved by an implementation of Ord on PdbInternalSectionOffset: https://github.com/mstange/pdb-addr2line/blob/f6cd33ab2192c69da6d79f44087ab41bb39804aa/src/lib.rs#L1165-L1168

Here's another spot, and in this case the exact ordering actually makes a difference:
https://github.com/mstange/pdb-addr2line/blob/f6cd33ab2192c69da6d79f44087ab41bb39804aa/src/lib.rs#L742-L756

We should order by section index first, and then by section-internal offset.

Add OMAP-based address translation

There is an error parsing PDB for Windows7 kernel binary, something related to the offset. If I do with a Windows 10 ntoskrnl.exe is OK.

Parsing NtWaitForSingleObject says that is at offset 0x4aeb0 and section C (12), which is wrong. It should say that the offset is 0x000ac8c0. I've tested with other symbols with same success.

Here is the attached files so you can test it:

windows7kernel.zip

I will try to figure out whats hapenning but im not that familiar with the PDB internals.

Document the correct way to look up LineInfo(s) given only RVA

I've been trying to correctly convert RVA to LineInfo struct(s), however it seems to be a quite roundabout process, and I'm not confident that I've arrived at the most correct / efficient solution:

My current approach is, for each RVA I want to look up the line for (cribbing off the example in the repo):

  • Convert the RVA to a PdbInternalSectionOffset
  • Iterate through DebugInformation::modules
  • For each module I am getting a LineProgram
  • Then call lines_for_symbol on the LineProgram with the PdbInternalSectionOffset
  • For each line returned by that I am checking if line.offset == the PdbInternalSectionOffset

This multi-nested loop seems to be inefficient and I think is not even returning entirely correct results. Is there a better way?

Add support for reading public symbol stream

The public symbol stream contains both a sorted list of all public symbols (which can be used to quickly locate a symbol given a section index and address by binary searching) as well as a hash table for looking up symbols by mangled names. These both seem like extremely useful operations and it would be great to support them. The index of this stream is in DBIHeader::ps_symbols_stream.

The Microsoft C++ code for this stream (+ the global symbol stream) is a nightmare:
https://github.com/microsoft/microsoft-pdb/blob/082c5290e5aff028ae84e43affa8be717aa7af73/PDB/dbi/gsi.h
https://github.com/microsoft/microsoft-pdb/blob/082c5290e5aff028ae84e43affa8be717aa7af73/PDB/dbi/gsi.cpp

The implementation of PSGSI1::NearestSym is interesting. That's what implements the binary search to look up a symbol by section index and address.

But LLVM has an implementation that's a lot more readable:
https://github.com/llvm-mirror/llvm/blob/a27c90973c8a1d338f36b8148b687ab078de48aa/include/llvm/DebugInfo/PDB/Native/RawTypes.h
https://github.com/llvm-mirror/llvm/blob/6b547686c5410b7528212e898fe30fc7ee7a70a3/include/llvm/DebugInfo/PDB/Native/PublicsStream.h
https://github.com/llvm-mirror/llvm/blob/6b547686c5410b7528212e898fe30fc7ee7a70a3/lib/DebugInfo/PDB/Native/PublicsStream.cpp
https://github.com/llvm-mirror/llvm/blob/6b547686c5410b7528212e898fe30fc7ee7a70a3/include/llvm/DebugInfo/PDB/Native/GlobalsStream.h
https://github.com/llvm-mirror/llvm/blob/6b547686c5410b7528212e898fe30fc7ee7a70a3/lib/DebugInfo/PDB/Native/GlobalsStream.cpp

Modifying PDBs

Is it possible to modify parts of a PDB or rewrite it entirely with this? Browsing through the docs, it looks like it only reads PDBs.

The reason I ask is that I'm considering rewriting my C++ tool Ducible in Rust using this crate. The tool rewrites PDBs to remove non-deterministic data. By far, the greatest effort was in deciphering the PDB format from Microsoft's pile of ๐Ÿ’ฉ. (It's good that the LLVM guys have documented this a little bit.) So, I'd be happy to switch to a good PDB parsing library and gain Rust's ability to produce rock-solid software.

If you think writing PDBs falls into the purview of this library and it isn't too difficult to add, I could take a stab at implementing it with some guidance.

I'm currently doing it by having an abstract stream type where the stream could either be on-disk or in-memory. Then, an MSF stream can be replaced with in an in-memory stream before writing everything back out to disk. In this way, streams are essentially copy-on-write. Doing it like this in Rust could be difficult with the ownership system, so I don't think this is the best approach. I'm definitely open to any good ideas about how to do this.

P.S. Thanks for writing this library. The PDB format is a real pain in the arse to parse.

Consider adding separate structs for the variants of SymbolData and TypeData

I think this is a bit more ergonomic for variants with large numbers of fields, so that you don't have to type them all out in the match. It also lets you pass the variant data to another function without needing to either copy the fields or do another match in that function.

Also, thanks a lot of creating this crate, it's saved me a lot of work :)

Take advantage of the type index offset data

From LLVM:

/// Type streams in PDBs contain an additional field which is a list of pairs
/// containing indices and their corresponding offsets, roughly every ~8KB of
/// record data.  This general idea need not be confined to PDBs though.  By
/// supplying such an array, the producer of a type stream can allow the
/// consumer much better access time, because the consumer can find the nearest
/// index in this array, and do a linear scan forward only from there.

Question regarding `FieldAttributes::is_intro_virtual`

Hi,

I'm currently trying to write some code based on the pdb2hpp example.
I'm having trouble detecting/reconstructing pure virtual methods correctly. It seems it's not possible to check for the pureintro (0x06) property only, like it's possible with the purevirt property (0x01):
https://github.com/willglynn/pdb/blob/7c35c3c82fe42a0aa505c0715d57f68ee93196fb/src/tpi/data.rs#L627

What's the reason behind this merge (between property 0x04 and 0x06)? Am I missing something?

Integer truncation following `parse_unsigned` in src/tpi/data.rs

There are a few calls to parse_unsigned to parse variable-sized integers from the PDB file, but in a few cases immediately afterwards the u64 result is truncated.
This results in pdb using and returning erroneous data.

For example, here:
https://github.com/willglynn/pdb/blob/7c35c3c82fe42a0aa505c0715d57f68ee93196fb/src/tpi/data.rs#L120

The offset member of that structure should be 64-bits to avoid truncation.

Edit: This is fixed in PR #103

ImageSectionHeader virtual_size is wrongly named physical_address

ImageSectionHeader currently has a physical_address member. This appears to be the wrong name for that field. Instead, that field should be called virtual_size.

Compare the following outputs:

Microsoft (R) Debugging Information Dumper  Version 14.00.23611
Copyright (C) Microsoft Corporation.  All rights reserved.

[...]

SECTION HEADER #1
.textbss name
   10000 virtual size
    1000 virtual address
       0 size of raw data
       0 file pointer to raw data
       0 file pointer to relocation table
       0 file pointer to line numbers
       0 number of relocations
       0 number of line numbers
E00000A0 flags
         Code
         Uninitialized Data
         (no align specified)
         Execute Read Write

SECTION HEADER #2
   .text name
   1C572 virtual size
   11000 virtual address
   1C600 size of raw data
     400 file pointer to raw data
       0 file pointer to relocation table
       0 file pointer to line numbers
       0 number of relocations
       0 number of line numbers
60000020 flags
         Code
         (no align specified)
         Execute Read

SECTION HEADER #3
  .rdata name
    4C64 virtual size
   2E000 virtual address
    4E00 size of raw data
   1CA00 file pointer to raw data
       0 file pointer to relocation table
       0 file pointer to line numbers
       0 number of relocations
       0 number of line numbers
40000040 flags
         Initialized Data
         (no align specified)
         Read Only
&sections = [
    ImageSectionHeader {
        name(): ".textbss",
        physical_address: 0x10000,
        virtual_address: 0x1000,
        size_of_raw_data: 0,
        pointer_to_raw_data: 0x0,
        pointer_to_relocations: 0x0,
        pointer_to_line_numbers: 0x0,
        number_of_relocations: 0,
        number_of_line_numbers: 0,
        characteristics: 0xe00000a0,
    },
    ImageSectionHeader {
        name(): ".text",
        physical_address: 0x1c572,
        virtual_address: 0x11000,
        size_of_raw_data: 116224,
        pointer_to_raw_data: 0x400,
        pointer_to_relocations: 0x0,
        pointer_to_line_numbers: 0x0,
        number_of_relocations: 0,
        number_of_line_numbers: 0,
        characteristics: 0x60000020,
    },
    ImageSectionHeader {
        name(): ".rdata",
        physical_address: 0x4c64,
        virtual_address: 0x2e000,
        size_of_raw_data: 19968,
        pointer_to_raw_data: 0x1ca00,
        pointer_to_relocations: 0x0,
        pointer_to_line_numbers: 0x0,
        number_of_relocations: 0,
        number_of_line_numbers: 0,
        characteristics: 0x40000040,
    },

"Attempt to subtract with overflow" panic when calling `PdbInternalSectionOffset::to_rva`

When parsing private symbols in my application I get the following panic:

thread 'main' panicked at 'attempt to subtract with overflow', C:\Users\user\.cargo\registry\src\github.com-1ecc6299db9ec823\pdb-0.6.0\src\omap.rs:451:32
stack backtrace:
   0: std::panicking::begin_panic_handler
             at /rustc/7f7a1cbfd3b55daee191247770627afab09eece2\/library\std\src\panicking.rs:483
   1: core::panicking::panic_fmt
             at /rustc/7f7a1cbfd3b55daee191247770627afab09eece2\/library\core\src\panicking.rs:85
   2: core::panicking::panic
             at /rustc/7f7a1cbfd3b55daee191247770627afab09eece2\/library\core\src\panicking.rs:50
   3: pdb::omap::get_virtual_address
             at C:\Users\user\.cargo\registry\src\github.com-1ecc6299db9ec823\pdb-0.6.0\src\omap.rs:451
   4: pdb::common::PdbInternalSectionOffset::to_internal_rva
             at C:\Users\user\.cargo\registry\src\github.com-1ecc6299db9ec823\pdb-0.6.0\src\omap.rs:577
   5: pdb::common::PdbInternalSectionOffset::to_rva
             at C:\Users\user\.cargo\registry\src\github.com-1ecc6299db9ec823\pdb-0.6.0\src\omap.rs:570

get_virtual_address is implemented as follows:

fn get_virtual_address(sections: &[ImageSectionHeader], section: u16, offset: u32) -> Option<u32> {
    let section = sections.get(section as usize - 1)?;
    Some(section.virtual_address + offset)
}

I believe that this would indicate section has a value of 0 in this context but I have not dumped it yet. Unfortunately I cannot provide a reproducible testcase considering the PDB is private. I can however provide any additional metadata required to help debug the issue in full.

Expose LineProgram line subsections

At the moment, LineProgram only exposes all line records, or the line records for a certain "line subsection" if you know the line subsection's start offset.

However, I have a case where I do not know the line subsection's start offset, because I'm dealing with a PDB file which has line information but no procedure information (due to /DEBUG:FASTLINK). I would like to enumerate the offsets of all the line subsections.

Can we expose the debug line subsections? We already have an internal iterator type for them but it's not exposed.

Support for anonymous structs/unions in `FieldList`

Currently there doesn't seem to be a way to determine whether a MemberType is directly part of the type it corresponds to or is part of an anonymous struct/union. If the PDB format includes some kind of tag specifying anonymous types, it would be useful to have that included. If not, a heuristic that seems to work is to track the offsets, lengths, and positions of each field, and use them as constraints for determining the anonymous struct/union layout.

UserDefinedTypeSourceId::source_file field is incorrect type

The source_file field in the UserDefinedTypeSourceId struct is of type IdIndex, but it should be of type StringRef. Currently I have to manually de/reconstruct it by hand. Fortunately the IdIndex and StringRef types both have public fields and there is a workaround, but this should not be necessary:

https://github.com/willglynn/pdb/blob/3d394eaf547998eeebc7c08ac5788d83d98a1caa/src/tpi/id.rs#L163

if let Ok(IdData::UserDefinedTypeSource(udt)) = typ.parse() {
    if let Ok(source_file) = StringRef(udt.source_file.0).to_string_lossy(&string_table) {
        println!("{:?}", source_file);
    }
}

How to get unmangled function names?

Hi

I'm trying to get all the functions in a pdb file, their lengths, and their unmangled names (I believe the term used in pdbs might be "unique names") for the cargo-bloat tool.

This crate's ProcedureSymbol type does not have unmangled names. From what I've seen reading the LLVM docs on PDB files and using the llvm-pdbutil, they're not actually included in symbol records. Is there a recommended/reliable way of getting unmangled names? Right now what I'm doing is first collecting all PublicSymbols and then trying to find a matching public symbol. But, at least for rustc/cargo generated PDBs, this seems to miss a lot of functions that have ProcedureSymbol records and do not have a matching PublicSymbol record.

Is this approach fine, and I should try to find a way/file an issue with rust to try to get it to generate better PDBs, or is there some other way I can already use this crate to get these unmangled names, or is there something that can be added to this crate?

The code I'm using follows

use pdb::FallibleIterator;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let dir = std::path::Path::new("D:\\your\\path\\to\\pdb\\folder");
    let file_name = "cargo-bloat";

    let exe_path = dir.join(file_name).with_extension("exe");
    let exe_size = std::fs::metadata(&exe_path)?.len();
    let (_, text_size) = binfarce::pe::parse(&std::fs::read(&exe_path).unwrap())?.symbols()?;

    let pdb_path = dir.join(file_name.replace("-", "_")).with_extension("pdb");
    let file = std::fs::File::open(&pdb_path)?;
    let mut pdb = pdb::PDB::open(file)?;

    let dbi = pdb.debug_information()?;
    let symbol_table = pdb.global_symbols()?;

    let mut total_parsed_size = 0usize;
    let mut demangled_total_parsed_size = 0usize;
    let mut out_symbols = vec![];

    // Collect the PublicSymbols
    let mut public_symbols = vec![];

    let mut symbols = symbol_table.iter();
    while let Ok(Some(symbol)) = symbols.next() {
        match symbol.parse() {
            Ok(pdb::SymbolData::Public(data)) => {
                if data.code || data.function {
                    public_symbols.push((data.offset, data.name.to_string().into_owned()));
                }
                if data.name.to_string().contains("try_small_punycode_decode") {
                    dbg!(&data);
                }
            }
            _ => {}
        }
    }

    let mut modules = dbi.modules()?;
    while let Some(module) = modules.next()? {
        let info = match pdb.module_info(&module)? {
            Some(info) => info,
            None => continue,
        };
        let mut symbols = info.symbols()?;
        while let Some(symbol) = symbols.next()? {
            if let Ok(pdb::SymbolData::Public(data)) = symbol.parse() {
                if data.code || data.function {
                    public_symbols.push((data.offset, data.name.to_string().into_owned()));
                }
                if data.name.to_string().contains("try_small_punycode_decode") {
                    dbg!(&data);
                }
            }
        }
    }

    let cmp_offsets = |a: &pdb::PdbInternalSectionOffset, b: &pdb::PdbInternalSectionOffset| {
        a.section.cmp(&b.section).then(a.offset.cmp(&b.offset))
    };
    public_symbols.sort_unstable_by(|a, b| cmp_offsets(&a.0, &b.0));

    // Now find the Procedure symbols in all modules
    // and if possible the matching PublicSymbol record with the mangled name
    let mut handle_proc = |proc: pdb::ProcedureSymbol| {
        let mangled_symbol = public_symbols
            .binary_search_by(|probe| {
                let low = cmp_offsets(&probe.0, &proc.offset);
                let high = cmp_offsets(&probe.0, &(proc.offset + proc.len));

                use std::cmp::Ordering::*;
                match (low, high) {
                    // Less than the low bound -> less
                    (Less, _) => Less,
                    // More than the high bound -> greater
                    (_, Greater) => Greater,
                    _ => Equal,
                }
            })
            .ok()
            .map(|x| &public_symbols[x]);
        // Uncomment to verify binary search isn't screwing up anything
        /*
        let mangled_symbol = public_symbols
            .iter()
            .filter(|probe| probe.0 >= proc.offset && probe.0 <= (proc.offset + proc.len))
            .take(1)
            .next();
        */

        let demangled_name = proc.name.to_string().into_owned();
        out_symbols.push((proc.len as usize, demangled_name, mangled_symbol));

        total_parsed_size += proc.len as usize;
        if mangled_symbol.is_some() {
            demangled_total_parsed_size += proc.len as usize;
        }
    };

    let mut symbols = symbol_table.iter();
    while let Ok(Some(symbol)) = symbols.next() {
        if let Ok(pdb::SymbolData::Procedure(proc)) = symbol.parse() {
            handle_proc(proc);
        }
    }
    let mut modules = dbi.modules()?;
    while let Some(module) = modules.next()? {
        let info = match pdb.module_info(&module)? {
            Some(info) => info,
            None => continue,
        };

        let mut symbols = info.symbols()?;

        while let Some(symbol) = symbols.next()? {
            if let Ok(pdb::SymbolData::Procedure(proc)) = symbol.parse() {
                handle_proc(proc);
            }
        }
    }

    println!(
        "exe size:{}\ntext size:{}\nsize of fns found: {}\nratio:{}\nsize of fns with mangles found: {}\nratio:{}",
        exe_size,
        text_size,
        total_parsed_size,
        total_parsed_size as f32 / text_size as f32,
        demangled_total_parsed_size,
        demangled_total_parsed_size as f32 / text_size as f32
    );

    Ok(())
}```

API improvements for third-party tools

Sorry for the "mega issue" but I had a couple of questions/comments regarding the current API design and if certain changes would be welcome. I wrote a tool this weekend that I intend to improve upon for exposing some of the interesting PDB metadata that this library exposes: https://github.com/landaire/pdbview

There were two changes/workarounds I had to make in my own local branch for this tool:

  1. CompileFlags fields are not public, essentially making the struct useless: landaire@62aedc0
  2. Some of the data types in the constants module are exposed through public APIs but the module itself is not (e.g. CPUType). This is somewhat awkward as you cannot explicitly import this type into your own code and cannot read documentation for these types (see: https://docs.rs/pdb/0.6.0/pdb/struct.CompileFlagsSymbol.html#structfield.cpu_type). My workaround here was to just format these as Strings since that was all I need anyways.

These two issues I'd be happy to make PRs for if the intent is for these items to be public.

The next issue I encountered was that the ProcedureReferenceSymbol is sort of useless if you actually want to do anything with it. In order to look up the symbol you need to look up the correct module -- the ID of which can be obtained from the module field. From here I'm stumped. There's no way to get a Module from a module index. Other similar scenarios use custom types, such as SymbolIndex, and allow you to use methods such as ModuleInfo::symbols_at(index) to grab the symbol or SymbolIter::seek. As far as I could tell no such APIs exist for modules and none of the examples or internal code reference these fields.

Usage Example throws all kind of errors

this are the errors its throwing and its very confusion for newbies

warning: unused import: `std::fs::File`
 --> src\main.rs:4:5
  |
4 | use std::fs::File;
  |     ^^^^^^^^^^^^^
  |
  = note: `#[warn(unused_imports)]` on by default

error[E0277]: the `?` operator can only be used in a function that returns `Result` or `Option` (or another type that implements `std::ops::Try`)
  --> src\main.rs:7:16
   |
6  | / fn main() {
7  | |     let file = std::fs::File::open("fixtures/self/foo.pdb")?;
   | |                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ cannot use the `?` operator in a function that returns `()`
8  | |     let mut pdb = pdb::PDB::open(file)?;
9  | |
...  |
21 | |     }
22 | | }
   | |_- this function should return `Result` or `Option` to accept `?`
   |
   = help: the trait `std::ops::Try` is not implemented for `()`
   = note: required by `std::ops::Try::from_error`

error[E0277]: the `?` operator can only be used in a function that returns `Result` or `Option` (or another type that implements `std::ops::Try`)
  --> src\main.rs:8:19
   |
6  | / fn main() {
7  | |     let file = std::fs::File::open("fixtures/self/foo.pdb")?;
8  | |     let mut pdb = pdb::PDB::open(file)?;
   | |                   ^^^^^^^^^^^^^^^^^^^^^ cannot use the `?` operator in a function that returns `()`
9  | |
...  |
21 | |     }
22 | | }
   | |_- this function should return `Result` or `Option` to accept `?`
   |
   = help: the trait `std::ops::Try` is not implemented for `()`
   = note: required by `std::ops::Try::from_error`

error[E0277]: the `?` operator can only be used in a function that returns `Result` or `Option` (or another type that implements `std::ops::Try`)
  --> src\main.rs:10:24
   |
6  | / fn main() {
7  | |     let file = std::fs::File::open("fixtures/self/foo.pdb")?;
8  | |     let mut pdb = pdb::PDB::open(file)?;
9  | |
10 | |     let symbol_table = pdb.global_symbols()?;
   | |                        ^^^^^^^^^^^^^^^^^^^^^ cannot use the `?` operator in a function that returns `()`
...  |
21 | |     }
22 | | }
   | |_- this function should return `Result` or `Option` to accept `?`
   |
   = help: the trait `std::ops::Try` is not implemented for `()`
   = note: required by `std::ops::Try::from_error`

error[E0277]: the `?` operator can only be used in a function that returns `Result` or `Option` (or another type that implements `std::ops::Try`)
  --> src\main.rs:13:30
   |
6  | / fn main() {
7  | |     let file = std::fs::File::open("fixtures/self/foo.pdb")?;
8  | |     let mut pdb = pdb::PDB::open(file)?;
9  | |
...  |
13 | |     while let Some(symbol) = symbols.next()? {
   | |                              ^^^^^^^^^^^^^^^ cannot use the `?` operator in a function that returns `()`
...  |
21 | |     }
22 | | }
   | |_- this function should return `Result` or `Option` to accept `?`
   |
   = help: the trait `std::ops::Try` is not implemented for `()`
   = note: required by `std::ops::Try::from_error`

error[E0599]: no variant named `PublicSymbol` found for enum `pdb::SymbolData<'_>`
  --> src\main.rs:15:33
   |
15 |             Ok(pdb::SymbolData::PublicSymbol{function: true, segment, offset, ..}) => {
   |                                 ^^^^^^^^^^^^ variant not found in `pdb::SymbolData<'_>`

error[E0599]: no method named `name` found for struct `pdb::Symbol<'_>` in the current scope
  --> src\main.rs:17:71
   |
17 |                 println!("{:x}:{:08x} is {}", segment, offset, symbol.name()?);
   |                                                                       ^^^^ method not found in `pdb::Symbol<'_>`

error[E0277]: the `?` operator can only be used in a function that returns `Result` or `Option` (or another type that implements `std::ops::Try`)
  --> src\main.rs:17:64
   |
6  | / fn main() {
7  | |     let file = std::fs::File::open("fixtures/self/foo.pdb")?;
8  | |     let mut pdb = pdb::PDB::open(file)?;
9  | |
...  |
17 | |                 println!("{:x}:{:08x} is {}", segment, offset, symbol.name()?);
   | |                                                                ^^^^^^^^^^^^^^ cannot use the `?` operator in a function that returns `()`
...  |
21 | |     }
22 | | }
   | |_- this function should return `Result` or `Option` to accept `?`
   |
   = help: the trait `std::ops::Try` is not implemented for `()`
   = note: required by `std::ops::Try::from_error`

error: aborting due to 7 previous errors

Some errors have detailed explanations: E0277, E0599.
For more information about an error, try `rustc --explain E0277`.
error: could not compile `rustpro`.

To learn more, run the command again with --verbose.

Determining source file and line information for global symbols (UDT, modules)

I'm currently working on a naive pdb-to-cpp decompiler, and there's a number of things that I've run into, but this one in particular doesn't appear to have been addressed in any prior issues, so hopefully someone can point me in the right direction...

Currently there appears to be no way to determine what source file and line a global symbol belongs to, which particularly afflicts decompilation of some user datatypes. Not all PDB files have identifier data in the IPI stream, or any private user datatypes, but may have global symbols which have all of the type information that would typically be present in the IPI stream.

While it is possible to keep track of the previous module of some global symbols by way of iterating over the module contributions, this still does not help with telling us which file and line in the module's include hierarchy they originate from. How are we supposed to retrieve this data?

Any help would be appreciated, thanks!

`#![no_std]` support

It would be great if this crate was usable in a #![no_std] support, as gimli.

Support for .NET PDBs

Out of curiosity I tried using pdb_symbols example to parse a custom-made C# dll, sadly nothing but module addresses could be parsed.

For my example the following symbol kinds appeared as not implemented:

Implementing the above symbols would be a fair step for supporting .NET binaries (at least those built with standard Visual Studio Build Tools).
I'd gladly contribute these though I'm not familiar with the inner PDB format, I assume it's just taken off Microsoft/microsoft-pdb. If there's any other source of guidance on the used format I would appreciate that a lot.

Cannot retrieve existing lines with lines_at_offset

In inspecting a basic PDB (almost a basic helloworld program compiled with clang on windows), there is a proc symbol for _security_check_cookie (the code should be almost this one: https://github.com/adityaramesh/concurrency_runtime/blob/master/VS2013_SP2/crt/src/AMD64/amdsecgs.asm).
The InternalSectionOffset is {s: 0x1, o: 0x8990} but when I try to get line info I got nothing.
The reason is that the InternalSectionOffset for lines is at {s: 0x1, o: 0x8980}.
My hypothesis is that the executable code is really at 0x8990 (line 47) and there are metadata or padding or whatever at 0x8980.
For info, the code_size of the proc is 33 where it appears to be 49 (33 + 16) in DbgLineSub.
So, I'd say that we shouldn't get "lines_section.header.offset == offset" but take the max before offset.
If you need to have the pdb, then I can share it with you.

Document how primitive types are found

Took me a bit to figure this out, so this might be useful for others.

Please consider mentioning something like this in the docstring of the find method of the TPI stream:

Note: if the type index is less than the minimum_index of the TPI stream (usually 0x1000), the type is determined by matching the type index against a predefined mapping of primitive types.
For example, 0x0022 corresponds to a 32-bit unsigned value.

https://github.com/Microsoft/microsoft-pdb/blob/082c5290e5aff028ae84e43affa8be717aa7af73/include/cvinfo.h#L328

This is an implementation detail, but it might save someone time when debugging. Before figuring it out, I suspected a parsing error in my code or an issue with the library, then an error in the internal code.

Note that I'm not using the type finder for the reasons mentioned in this comment, so others are less likely to bump into this. But I still think it might be a good thing to document.

Small files passed into `pdb::PDB::open` return a `std::io::Error`

If a file is less than 4096 bytes in size and past into pdb::PDB::open, the result with be a std::io::Error with ErrorKind UnexpectedEof.

An easy repro of this issue if to run the pdb_symbols example against a source file

./target/debug/examples/pdb_symbols tests/pdb_lines.rs

I would expect to get the output

error dumping PDB: UnrecognizedFileFormat

but instead I get the output

error dumping PDB: IO error while reading PDB: failed to fill whole buffer

Digging into the code a little ways it appears this happens in pdb::msf::open_msf.

https://github.com/willglynn/pdb/blob/ebaba994f28538ff8f04b9f8f5c20466621adaf2/src/msf/mod.rs#L386-L390

The call to pdb::msf::view will eventually lead to a call of read_exact with a length of 4096 which will fail since the file is not 4096 bytes long.

A quick fix to work around this, is back in pdb::msf::open_msf catch the error on the first call to view for std;:io::Error where ErrorKind is UnexpectedEof and turn that into pdb::Error::UnrecognizedFileFormat.

I have a change that does just that and will put it up for code review momentarily.

address_map return with Err AddressMapNotFound

I am encounter some pdb with no omap info ( also none DBIExtraStreams.section_headers ).
When i call pdb.address_map() and Err(AddressMapNotFound) returned.
I wonder if this behavior is expected?
I am trying to dump the symbol tables, using the address_map to translate (metioned in #17 )
What should i do with pdbs without extra streams or omap infos?

Here is the sample pdb i download from microsoft symbol server.
mscorlib.ni.pdb.zip

Not all symbol records have names

I'm looking at adding support for parsing more symbol records.

Currently, Symbol assumes that every symbol has a name, which is returned by Symbol::name. The name isn't stored in the symbol variants. Instead, it is found at a fixed offset given by Symbol::data_length().

This is fine for most symbols. However, some symbols consist of multiple records, and the additional records don't always have a name, such as the S_DEFRANGE records for local symbols, or S_BLOCK and S_END within local procedures.

The simplest fix is to change Symbol::name to return an Option. This still relies on the name being at a fixed offset in records, but so far that seems to be a valid assumption. However, I was wondering if it would be better to add the name to the Symbol variants, in the same manner as TypeData. This avoids callers needing to unwrap the Option for records that always have a name.

Stream + ParseBuffer lifetimes are awkward to work with

I keep running into this when hacking on this crate. I think the root of the problem is that SourceView::as_slice returns a slice whose lifetime matches the lifetime of the reference to the SourceView, not the contained lifetime 's, and then Stream::parse_buffer returns a ParseBuffer with a lifetime derived from a slice returned from SourceView::as_slice. This means that in practice you have to keep both the Stream and ParseBuffer alive to do anything where you want to hand out references to data borrowed from the stream. This seems to mean that you always need an intermediate type in this situation to own the ParseBuffer.

I understand why you've written things this way--you want to potentially support zero-copy operation by having a SourceView that points into a memory-mapped file. However, since ReadView owns its data, it can't actually hand out a reference that outlives itself, so you can't just change SourceView::as_slice to make the slice lifetime match the SourceView lifetime.

I don't know if there's a straightforward way to fix this, but it's frustrated me many times so I figured I'd at least write it down to get it out of my head.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.