Extract Text from PDF Documents in .NET Applications

With HiQPdf Library you can extract the text from PDF documents to a .NET System. String object using the PdfTextExtract class. You can set the text extraction mode with PdfTextExtract.TextExtractMode property and choose to keep the original positioning of the text in the PDF document or you can choose to extract the text in a layout more suitable for reading.

The C# sample code below shows how easy you can extract the text from existing PDF documents. With just a few lines of code you can obtain the text representation of a PDF document:

// get the PDF file
string pdfFile = Server.MapPath("~") + @"\DemoFiles\Pdf\InputPdf.pdf";

// create the PDF text extractor
PdfTextExtract pdfTextExtract = new PdfTextExtract();

// set the text extraction mode
pdfTextExtract.TextExtractMode = GetTextExtractMode();

int fromPdfPageNumber = int.Parse(textBoxFromPage.Text);
int toPdfPageNumber = textBoxToPage.Text.Length > 0 ? int.Parse(textBoxToPage.Text) : 0;

// extract the text from a range of pages of the PDF document
string text = pdfTextExtract.ExtractText(pdfFile, fromPdfPageNumber, toPdfPageNumber);

// get UTF-8 bytes
byte[] utf8Bytes = Encoding.UTF8.GetBytes(text);

// the UTF-8 marker
byte[] utf8Marker = new byte[] { 0xEF, 0xBB, 0xBF };

// the text document bytes with UTF-8 marker followed by UTF-8 bytes
byte[] bytes = new byte[utf8Bytes.Length + utf8Marker.Length];
Array.Copy(utf8Marker, 0, bytes, 0, utf8Marker.Length);
Array.Copy(utf8Bytes, 0, bytes, utf8Marker.Length, utf8Bytes.Length);

// inform the browser about the data format
HttpContext.Current.Response.AddHeader("Content-Type", "text/plain; charset=UTF-8");

// let the browser know how to open the text document and the text document name
HttpContext.Current.Response.AddHeader("Content-Disposition",
    String.Format("{0}; filename=ExtractedText.txt; size={1}", "attachment", bytes.Length.ToString()));

// write the text buffer to HTTP response
HttpContext.Current.Response.BinaryWrite(bytes);

// call End() method of HTTP response to stop ASP.NET page processing
HttpContext.Current.Response.End();

See also the live demo for Text Extraction from PDF documents for a fully functional example.

 

Search Text In PDF Using HiQPdf Library

With HiQPdf Library for .NET you can search a text in a PDF document using the SearchText() method of the PdfTextExtract class. You can choose to match the case or to match the whole word only when searching using this method parameters.

In the C# code sample below you can see how to search for a text in an existing PDF document. The found text is then highlighted in the original PDF.

C# Code Sample to Search and Highlight Text in PDF

// get the PDF file
string pdfFile = Server.MapPath("~") + @"\DemoFiles\Pdf\InputPdf.pdf";

// get the text to search
string textToSearch = textBoxTextToSearch.Text;

// create the PDF text extractor
PdfTextExtract pdfTextExtract = new PdfTextExtract();

int fromPdfPageNumber = int.Parse(textBoxFromPage.Text);
int toPdfPageNumber = textBoxToPage.Text.Length > 0 ? int.Parse(textBoxToPage.Text) : 0;

// search the text in PDF document
PdfTextSearchItem[] searchTextInstances = pdfTextExtract.SearchText(pdfFile, textToSearch,
            fromPdfPageNumber, toPdfPageNumber, checkBoxMatchCase.Checked, checkBoxMatchWholeWord.Checked);

// load the PDF file to highlight the searched text
PdfDocument pdfDocument = PdfDocument.FromFile(pdfFile);

// highlight the searched text in PDF document
foreach (PdfTextSearchItem searchTextInstance in searchTextInstances)
{
    PdfRectangle pdfRectangle = new PdfRectangle(searchTextInstance.BoundingRectangle);

    // set rectangle color and opacity
    pdfRectangle.BackColor = Color.Yellow;
    pdfRectangle.Opacity = 30;

    // highlight the text
    pdfDocument.Pages[searchTextInstance.PdfPageNumber - 1].Layout(pdfRectangle);
}

// write the modified PDF document
try
{
    // write the PDF document to a memory buffer
    byte[] pdfBuffer = pdfDocument.WriteToMemory();

    // inform the browser about the binary data format
    HttpContext.Current.Response.AddHeader("Content-Type", "application/pdf");

    // let the browser know how to open the PDF document and the file name
    HttpContext.Current.Response.AddHeader("Content-Disposition", String.Format("attachment; filename=SearchText.pdf; size={0}",
                pdfBuffer.Length.ToString()));

    // write the PDF buffer to HTTP response
    HttpContext.Current.Response.BinaryWrite(pdfBuffer);

    // call End() method of HTTP response to stop ASP.NET page processing
    HttpContext.Current.Response.End();
}
finally
{
    pdfDocument.Close();
}

You can find a live demo for searching and highlighting the text in PDF on product website.

Partially Convert a HTML Page to PDF

The HiQPdf HTML to PDF converter allows you to convert only a selected HTML element from the HTML document. The selected element can be for example a TABLE element or a DIV element containing other HTML elements.

This feature is useful when you want to convert only a part of the HTML document. For example, a web page usually has a header with menu and logo and a footer with contact information and copyright notice besides the main HTML content you want to convert to PDF. In order to convert only the main content of the document you can place the main content in a block element like a DIV or a TABLE and configure the converter to convert only that block element.

The HTML element to be converted is selected by the ConvertedHtmlElementSelector property. This property can be set with a value representing the CSS selector of the HTML element to be converted. For example, the #MyHtmlElement CSS selector will select the HTML element having the ‘MyHtmlElement‘ ID from document and the the *[class=”ConvertibleElementStyle”] CSS selector will select only the HTML element having the ‘ConvertibleElementStyle‘ CSS class. If many elements in the HTML document are selected by a CSS selector, only the the first one will be converted. The values of the attributes in the CSS selectors are case sensitive. If this property is not set then the whole HTML document is converted.

C# Code Sample for Partially Converting a HTML to PDF

// create the HTML to PDF converter
HtmlToPdf htmlToPdfConverter = new HtmlToPdf();

// convert only the HTML element having the MyHtmlElement ID 
htmlToPdfConverter.ConvertedHtmlElementSelector = "#MyHtmlElement";

You can test this feature live in Convert Only a Selected Region of HTML Page demo.

Convert HTML with Web Fonts to PDF

The Web Fonts offer a great flexibility to web designers to create special effects on text in a HTML document because they are not limited anymore to a small set of fonts installed on the client computers displaying the HTML document. The Web Fonts can be downloaded on the fly by the modern web browsers and used to render the HTML document without installing those fonts on the local machine. The location from where they can be downloaded is given in a CSS3 @font-face rule.

The HiQPdf HTML to PDF Converter has the capacity to convert HTML documents with Web Fonts. It offers support for TrueType fonts in .ttf files, OpenType fonts with TrueType Outlines in .otf files and Web Open Font Format (WOFF) fonts with TrueType Outlines in .woff files.

The Web Open Font Format (WOFF), as its name suggests, was designed to be used with web pages. It is based on a compression algorithm which makes the fonts file smaller and more appropriate for distribution over a network. The WOFF format is currently supported by all major browsers (Firefox 3.6 and later versions, Google Chrome 6.0 and later versions, Internet Explorer 9 and later versions, Opera 11.10 and later versions, Safari 5.1 and later versions).

In the live demo for Converting HTML with Web Fonts to PDF you learn how to define the web fonts in HTML using the @font-face rules and the C# code to convert such a HTML document to PDF.