HTML을 C #에서 텍스트로 어떻게 변환 할 수 있습니까?

IT Share you

HTML을 C #에서 텍스트로 어떻게 변환 할 수 있습니까?

shareyou 2020. 11. 11. 20:57

HTML을 C #에서 텍스트로 어떻게 변환 할 수 있습니까?

HTML 문서를 일반 텍스트로 변환하는 C # 코드를 찾고 있습니다.

간단한 태그 제거를 찾는 것이 아니라 원본 레이아웃을 합리적으로 보존 하여 일반 텍스트를 출력하는 것 입니다.

출력은 다음과 같아야합니다.

W3C의 Html2Txt

HTML Agility Pack을 살펴 보았지만 그것이 필요한 것 같지 않습니다. 누구든지 다른 제안이 있습니까?

편집 : CodePlex 에서 HTML Agility Pack을 다운로드 하고 Html2Txt 프로젝트를 실행했습니다. 실망 스럽습니다 (적어도 html에서 텍스트로 변환하는 모듈)! 태그를 제거하고 테이블을 평평하게 만드는 등의 작업을 수행했습니다. 출력은 Html2Txt @ W3C가 생성 한 것과 같지 않았습니다. 소스를 사용할 수없는 것 같아 안타깝습니다. 더 많은 "미리 준비된"솔루션이 있는지 찾고있었습니다.

편집 2 : 귀하의 제안에 감사드립니다. FlySwat 는 내가 가고 싶은 방향으로 나에게 팁을 주었다. 내가 사용할 수있는 System.Diagnostics.Process표준 출력에 텍스트를 보낼 수있는 "-dump"스위치 lynx.exe를 실행하는 클래스와 함께 표준 출력을 캡처 ProcessStartInfo.UseShellExecute = false하고 ProcessStartInfo.RedirectStandardOutput = true. 이 모든 것을 C # 클래스로 래핑하겠습니다. 이 코드는 가끔 호출 될 것이므로 새 프로세스를 생성하는 것과 코드에서 수행하는 것에 대해 너무 걱정하지 않습니다. 게다가 Lynx는 빠릅니다 !!

당신이 찾고있는 것은 Lynx 또는 다른 텍스트 브라우저와 같이 텍스트를 출력하는 텍스트 모드 DOM 렌더러입니다. 이것은 예상보다 훨씬 어렵습니다.

후손을위한 HtmlAgilityPack에 대한 참고 사항입니다. 이 프로젝트에는 텍스트를 html로 구문 분석 하는 예제가 포함되어 있습니다 . OP에서 언급했듯이 HTML을 작성하는 사람이 상상하는 것처럼 공백을 전혀 처리하지 않습니다. 이 질문에 대해 다른 사람들이 지적한 전체 텍스트 렌더링 솔루션이 있습니다. 이것은 그렇지 않습니다 (현재 형식의 테이블도 처리 할 수 없음). 가볍고 빠르기 때문에 간단한 텍스트를 만드는 데 필요한 전부입니다. HTML 이메일 버전.

using System.IO;
using System.Text.RegularExpressions;
using HtmlAgilityPack;

//small but important modification to class https://github.com/zzzprojects/html-agility-pack/blob/master/src/Samples/Html2Txt/HtmlConvert.cs
public static class HtmlToText
{

    public static string Convert(string path)
    {
        HtmlDocument doc = new HtmlDocument();
        doc.Load(path);
        return ConvertDoc(doc);
    }

    public static string ConvertHtml(string html)
    {
        HtmlDocument doc = new HtmlDocument();
        doc.LoadHtml(html);
        return ConvertDoc(doc);
    }

    public static string ConvertDoc (HtmlDocument doc)
    {
        using (StringWriter sw = new StringWriter())
        {
            ConvertTo(doc.DocumentNode, sw);
            sw.Flush();
            return sw.ToString();
        }
    }

    internal static void ConvertContentTo(HtmlNode node, TextWriter outText, PreceedingDomTextInfo textInfo)
    {
        foreach (HtmlNode subnode in node.ChildNodes)
        {
            ConvertTo(subnode, outText, textInfo);
        }
    }
    public static void ConvertTo(HtmlNode node, TextWriter outText)
    {
        ConvertTo(node, outText, new PreceedingDomTextInfo(false));
    }
    internal static void ConvertTo(HtmlNode node, TextWriter outText, PreceedingDomTextInfo textInfo)
    {
        string html;
        switch (node.NodeType)
        {
            case HtmlNodeType.Comment:
                // don't output comments
                break;
            case HtmlNodeType.Document:
                ConvertContentTo(node, outText, textInfo);
                break;
            case HtmlNodeType.Text:
                // script and style must not be output
                string parentName = node.ParentNode.Name;
                if ((parentName == "script") || (parentName == "style"))
                {
                    break;
                }
                // get text
                html = ((HtmlTextNode)node).Text;
                // is it in fact a special closing node output as text?
                if (HtmlNode.IsOverlappedClosingElement(html))
                {
                    break;
                }
                // check the text is meaningful and not a bunch of whitespaces
                if (html.Length == 0)
                {
                    break;
                }
                if (!textInfo.WritePrecedingWhiteSpace || textInfo.LastCharWasSpace)
                {
                    html= html.TrimStart();
                    if (html.Length == 0) { break; }
                    textInfo.IsFirstTextOfDocWritten.Value = textInfo.WritePrecedingWhiteSpace = true;
                }
                outText.Write(HtmlEntity.DeEntitize(Regex.Replace(html.TrimEnd(), @"\s{2,}", " ")));
                if (textInfo.LastCharWasSpace = char.IsWhiteSpace(html[html.Length - 1]))
                {
                    outText.Write(' ');
                }
                    break;
            case HtmlNodeType.Element:
                string endElementString = null;
                bool isInline;
                bool skip = false;
                int listIndex = 0;
                switch (node.Name)
                {
                    case "nav":
                        skip = true;
                        isInline = false;
                        break;
                    case "body":
                    case "section":
                    case "article":
                    case "aside":
                    case "h1":
                    case "h2":
                    case "header":
                    case "footer":
                    case "address":
                    case "main":
                    case "div":
                    case "p": // stylistic - adjust as you tend to use
                        if (textInfo.IsFirstTextOfDocWritten)
                        {
                            outText.Write("\r\n");
                        }
                        endElementString = "\r\n";
                        isInline = false;
                        break;
                    case "br":
                        outText.Write("\r\n");
                        skip = true;
                        textInfo.WritePrecedingWhiteSpace = false;
                        isInline = true;
                        break;
                    case "a":
                        if (node.Attributes.Contains("href"))
                        {
                            string href = node.Attributes["href"].Value.Trim();
                            if (node.InnerText.IndexOf(href, StringComparison.InvariantCultureIgnoreCase)==-1)
                            {
                                endElementString =  "<" + href + ">";
                            }  
                        }
                        isInline = true;
                        break;
                    case "li": 
                        if(textInfo.ListIndex>0)
                        {
                            outText.Write("\r\n{0}.\t", textInfo.ListIndex++); 
                        }
                        else
                        {
                            outText.Write("\r\n*\t"); //using '*' as bullet char, with tab after, but whatever you want eg "\t->", if utf-8 0x2022
                        }
                        isInline = false;
                        break;
                    case "ol": 
                        listIndex = 1;
                        goto case "ul";
                    case "ul": //not handling nested lists any differently at this stage - that is getting close to rendering problems
                        endElementString = "\r\n";
                        isInline = false;
                        break;
                    case "img": //inline-block in reality
                        if (node.Attributes.Contains("alt"))
                        {
                            outText.Write('[' + node.Attributes["alt"].Value);
                            endElementString = "]";
                        }
                        if (node.Attributes.Contains("src"))
                        {
                            outText.Write('<' + node.Attributes["src"].Value + '>');
                        }
                        isInline = true;
                        break;
                    default:
                        isInline = true;
                        break;
                }
                if (!skip && node.HasChildNodes)
                {
                    ConvertContentTo(node, outText, isInline ? textInfo : new PreceedingDomTextInfo(textInfo.IsFirstTextOfDocWritten){ ListIndex = listIndex });
                }
                if (endElementString != null)
                {
                    outText.Write(endElementString);
                }
                break;
        }
    }
}
internal class PreceedingDomTextInfo
{
    public PreceedingDomTextInfo(BoolWrapper isFirstTextOfDocWritten)
    {
        IsFirstTextOfDocWritten = isFirstTextOfDocWritten;
    }
    public bool WritePrecedingWhiteSpace {get;set;}
    public bool LastCharWasSpace { get; set; }
    public readonly BoolWrapper IsFirstTextOfDocWritten;
    public int ListIndex { get; set; }
}
internal class BoolWrapper
{
    public BoolWrapper() { }
    public bool Value { get; set; }
    public static implicit operator bool(BoolWrapper boolWrapper)
    {
        return boolWrapper.Value;
    }
    public static implicit operator BoolWrapper(bool boolWrapper)
    {
        return new BoolWrapper{ Value = boolWrapper };
    }
}

예를 들어, 다음 HTML 코드는 ...

<!DOCTYPE HTML>
<html>
    <head>
    </head>
    <body>
        <header>
            Whatever Inc.
        </header>
        <main>
            <p>
                Thanks for your enquiry. As this is the 1<sup>st</sup> time you have contacted us, we would like to clarify a few things:
            </p>
            <ol>
                <li>
                    Please confirm this is your email by replying.
                </li>
                <li>
                    Then perform this step.
                </li>
            </ol>
            <p>
                Please solve this <img alt="complex equation" src="http://upload.wikimedia.org/wikipedia/commons/8/8d/First_Equation_Ever.png"/>. Then, in any order, could you please:
            </p>
            <ul>
                <li>
                    a point.
                </li>
                <li>
                    another point, with a <a href="http://en.wikipedia.org/wiki/Hyperlink">hyperlink</a>.
                </li>
            </ul>
            <p>
                Sincerely,
            </p>
            <p>
                The whatever.com team
            </p>
        </main>
        <footer>
            Ph: 000 000 000<br/>
            mail: whatever st
        </footer>
    </body>
</html>

... 다음으로 변환됩니다.

Whatever Inc. 


Thanks for your enquiry. As this is the 1st time you have contacted us, we would like to clarify a few things: 

1.  Please confirm this is your email by replying. 
2.  Then perform this step. 

Please solve this [complex equation<http://upload.wikimedia.org/wikipedia/commons/8/8d/First_Equation_Ever.png>]. Then, in any order, could you please: 

*   a point. 
*   another point, with a hyperlink<http://en.wikipedia.org/wiki/Hyperlink>. 

Sincerely, 

The whatever.com team 


Ph: 000 000 000
mail: whatever st

... 반대 :

        Whatever Inc.


            Thanks for your enquiry. As this is the 1st time you have contacted us, we would like to clarify a few things:

                Please confirm this is your email by replying.

                Then perform this step.


            Please solve this . Then, in any order, could you please:

                a point.

                another point, with a hyperlink.


            Sincerely,


            The whatever.com team

        Ph: 000 000 000
        mail: whatever st

이것을 사용할 수 있습니다.

 public static string StripHTML(string HTMLText, bool decode = true)
        {
            Regex reg = new Regex("<[^>]+>", RegexOptions.IgnoreCase);
            var stripped = reg.Replace(HTMLText, "");
            return decode ? HttpUtility.HtmlDecode(stripped) : stripped;
        }

업데이트 됨

이 기능을 개선하기 위해 업데이트 한 의견에 감사드립니다.

.Net에서 HTML 구문 분석을 수행하는 경우 HTML 민첩성 팩을 다시 확인해야한다는 신뢰할 수있는 소스로부터 들었습니다.

http://www.codeplex.com/htmlagilitypack

SO에 대한 일부 샘플 ..

HTML Agility Pack-테이블 구문 분석

LF와 글 머리 기호를 사용하여 일반 텍스트로 변환하기를 원했기 때문에 코드 프로젝트에서이 예쁜 솔루션을 발견했습니다.이 솔루션은 많은 변환 사용 사례를 다룹니다.

HTML을 일반 텍스트로 변환

네, 너무 커 보이지만 잘 작동합니다.

당신이 노력이 http://www.aaronsw.com/2002/html2text/ 그것의 파이썬하지만, 오픈 소스.

잘 구성된 html이 있다고 가정하면 XSL 변환을 시도 할 수도 있습니다.

예를 들면 다음과 같습니다.

using System;
using System.IO;
using System.Xml.Linq;
using System.Xml.XPath;
using System.Xml.Xsl;

class Html2TextExample
{
    public static string Html2Text(XDocument source)
    {
        var writer = new StringWriter();
        Html2Text(source, writer);
        return writer.ToString();
    }

    public static void Html2Text(XDocument source, TextWriter output)
    {
        Transformer.Transform(source.CreateReader(), null, output);
    }

    public static XslCompiledTransform _transformer;
    public static XslCompiledTransform Transformer
    {
        get
        {
            if (_transformer == null)
            {
                _transformer = new XslCompiledTransform();
                var xsl = XDocument.Parse(@"<?xml version='1.0'?><xsl:stylesheet version=""1.0"" xmlns:xsl=""http://www.w3.org/1999/XSL/Transform"" exclude-result-prefixes=""xsl""><xsl:output method=""html"" indent=""yes"" version=""4.0"" omit-xml-declaration=""yes"" encoding=""UTF-8"" /><xsl:template match=""/""><xsl:value-of select=""."" /></xsl:template></xsl:stylesheet>");
                _transformer.Load(xsl.CreateNavigator());
            }
            return _transformer;
        }
    }

    static void Main(string[] args)
    {
        var html = XDocument.Parse("<html><body><div>Hello world!</div></body></html>");
        var text = Html2Text(html);
        Console.WriteLine(text);
    }
}

가장 쉬운 방법은 목록 요소 (li)에 대한 대시와 br 및 p에 대한 줄 바꿈과 같은 텍스트 레이아웃 요소로 일부 태그를 대체하는 태그 제거와 결합 된 것입니다. 이것을 테이블로 확장하는 것이 너무 어렵지 않아야합니다.

HtmlAgility에 몇 가지 디코딩 문제가 있었고 조사하는 데 시간을 투자하고 싶지 않았습니다.

Instead I used that utility from the Microsoft Team Foundation API:

var text = HtmlFilter.ConvertToPlainText(htmlContent);

Another post suggests the HTML agility pack:

This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).

I have used Detagger in the past. It does a pretty good job of formatting the HTML as text and is more than just a tag remover.

This function convert "What You See in the browser" to plain text with line breaks. (If you want to see result in the browser just use commented return value)

public string HtmlFileToText(string filePath)
{
    using (var browser = new WebBrowser())
    {
        string text = File.ReadAllText(filePath);
        browser.ScriptErrorsSuppressed = true;
        browser.Navigate("about:blank");
        browser?.Document?.OpenNew(false);
        browser?.Document?.Write(text);
        return browser.Document?.Body?.InnerText;
        //return browser.Document?.Body?.InnerText.Replace(Environment.NewLine, "<br />");
    }   
}

I don't know C#, but there is a fairly small & easy to read python html2txt script here: http://www.aaronsw.com/2002/html2text/

I have recently blogged on a solution that worked for me by using a Markdown XSLT file to transform the HTML Source. The HTML source will of course need to be valid XML first

Try the easy and usable way: just call StripHTML(WebBrowserControl_name);

 public string StripHTML(WebBrowser webp)
        {
            try
            {
                doc.execCommand("SelectAll", true, null);
                IHTMLSelectionObject currentSelection = doc.selection;

                if (currentSelection != null)
                {
                    IHTMLTxtRange range = currentSelection.createRange() as IHTMLTxtRange;
                    if (range != null)
                    {
                        currentSelection.empty();
                        return range.text;
                    }
                }
            }
            catch (Exception ep)
            {
                //MessageBox.Show(ep.Message);
            }
            return "";

        }

In Genexus You can made with Regex

&pattern = '<[^>]+>'

&TSTRPNOT=&TSTRPNOT.ReplaceRegEx(&pattern,"")

In Genexus possiamo gestirlo con Regex,

If you are using .NET framework 4.5 you can use System.Net.WebUtility.HtmlDecode() which takes a HTML encoded string and returns a decoded string.

Documented on MSDN at: http://msdn.microsoft.com/en-us/library/system.net.webutility.htmldecode(v=vs.110).aspx

You can use this in a Windows Store app as well.

You can use WebBrowser control to render in memory your html content. After LoadCompleted event fired...

IHTMLDocument2 htmlDoc = (IHTMLDocument2)webBrowser.Document;
string innerHTML = htmlDoc.body.innerHTML;
string innerText = htmlDoc.body.innerText;

This is another solution to convert HTML to Text or RTF in C#:

    SautinSoft.HtmlToRtf h = new SautinSoft.HtmlToRtf();
    h.OutputFormat = HtmlToRtf.eOutputFormat.TextUnicode;
    string text = h.ConvertString(htmlString);

This library is not free, this is commercial product and it is my own product.

참고URL : https://stackoverflow.com/questions/731649/how-can-i-convert-html-to-text-in-c

'IT Share you' 카테고리의 다른 글

디버깅 옵션 -g는 바이너리 실행 파일을 어떻게 변경합니까? (0)	2020.11.11
XMLHttpRequest를 사용하여 JSON 파일을로드 할 때 Firefox에서 "잘 구성되지 않음"오류 (0)	2020.11.11
환경 변수 값의 최대 크기는 얼마입니까? (0)	2020.11.11
MEF에 대해 어디서 배울 수 있습니까? (0)	2020.11.11
로드 후 평가 대 모드 후크 (0)	2020.11.11

현재글HTML을 C #에서 텍스트로 어떻게 변환 할 수 있습니까?

shareyou

HTML을 C #에서 텍스트로 어떻게 변환 할 수 있습니까?

HTML을 C #에서 텍스트로 어떻게 변환 할 수 있습니까?

'IT Share you' 카테고리의 다른 글

'IT Share you'의 다른글

티스토리툴바

HTML을 C #에서 텍스트로 어떻게 변환 할 수 있습니까?

HTML을 C #에서 텍스트로 어떻게 변환 할 수 있습니까?

'IT Share you' 카테고리의 다른 글

'IT Share you'의 다른글

관련글

티스토리툴바