Do you specifically want to achieve the same with sed or awk ? With sed and awk the output is completely dependent on the formatting of the HTML file. For example, since I have fixed formatting for the HTML file in your question, the output would change drastically.
Commented Oct 31, 2020 at 9:05How about using specialised tools such as html2text ? html2text would provide the output like this paste which can further be modified using sed and other command line tools to get the desired output such as this another paste. Of course you can modify further. Also note the indentation and the position of elements depends on the HTML source and the width of the screen (in characters). Width can be supplied in html2text .
Commented Oct 31, 2020 at 9:05Try to open your html-document in a browser and then select all (Ctrl A), copy and paste the result in a text-editor
Commented Oct 31, 2020 at 10:04 sed or awk anything should be fine need to convert html to text column wise Commented Oct 31, 2020 at 12:36 html2text tool not there in server Commented Oct 31, 2020 at 12:37With sed and awk the output is completely dependent on the formatting of the HTML file. For example, the HTML source in revision #3 would yield different result when compared to that of revision #4.
Alternatively, you can use specific tools such as html2text . html2text would format the resultant HTML page into plain text characters. Of course you can further manipulate the output with other command-line tools such as sed and awk .
To install html2text , simply run:
sudo apt install html2text
To get started simply run:
html2text file.html
By default, html2text formats the HTML documents for a screen width of 79 characters. So, the result would look like this:
___________________________________Additional_M2Ms_&_Standalone_DataMasking_List_for_09_10_2020_PST____________________________________ |Start|FM CR |CR Type |Customer|Source|Target|DM |Release |Data |CDB |FreeSpace|TDE/DV|M2M |M2M |Database Reorg |Operations | |Time_|________|__________|Name____|Pod___|Pod___|Flag|________|Center|Sync|Check____|Check_|Optin|Type|Details_____________|Team_______| |09/ | | | | | | | | | | | | | |Reclaimable| | | |10/ |11124482|M2M |TCS |KCLB- |EGLG- |N |Revision|ks8- |Yes |Passed |Passed|Y |sDC |Space: 3532|Reorg | RAMU | |2020-| | | |CDB |TEST | |13.20.07|US-OCC| | | | | |GB |Required| | |19:00|________|__________|________|______|______|____|________|______|____|_________|______|_____|____|___________|________|___________| |09/ | |Standalone| | | | | | | | | | | | | | | |10/ |11170981|Data |Wipro, | |LMNO- | |Revision|ns2-US|NA |NA |NA |NA |NA | NA | NA |DataMasking| |2020-| |Masking |Inc. | |TEST | |13.20.07| | | | | | | | | | |19:00|________|__________|________|______|______|____|________|______|____|_________|______|_____|____|___________|________|___________| Thanks, M2M Ops Note: This is a system generated email, still you can reply with queries/ suggestions.
However, you can change width to the desired number of characters. For example, in your question the width is of 261 characters. Thus, you can also use
html2text -width 261 file.html
which would yield:
_________________________________________________________________________________________________Additional_M2Ms_&_Standalone_DataMasking_List_for_09_10_2020_PST_________________________________________________________________________________________________ |Start_Time______|FM_CR___|CR_Type________________|Customer_Name|Source_Pod|Target_Pod|DM_Flag|Release__________|Data_Center|CDB_Sync|FreeSpace_Check|TDE/DV_Check|M2M_Optin|M2M_Type|Database_Reorg_Details____________________________________|Operations_Team___| |09/10/2020-19:00|11124482|M2M____________________|TCS__________|KCLB-CDB__|EGLG-TEST_|N______|Revision_13.20.07|ks8-US-OCC_|Yes_____|Passed_________|Passed______|Y________|sDC_____|Reclaimable_Space:_3532_GB|Reorg_Required_________________|_______RAMU_______| |09/10/2020-19:00|11170981|Standalone_Data_Masking|Wipro,_Inc.__|__________|LMNO-TEST_|_______|Revision_13.20.07|ns2-US_____|NA______|NA_____________|NA__________|NA_______|NA______|____________NA____________|______________NA_______________|DataMasking_______| Thanks, M2M Ops Note: This is a system generated email, still you can reply with queries/suggestions.
Now, to manipulate things, for example, removing glyph( | ), underscore( _ ), empty lines, the very first line and last 3 lines, you can use any command line tools as per your requirement. An ugly method would look like
html2text -width 200 file.html | sed 's/|/\ /g;s/\_/\ /g;/^$/d'| head -n -3 | tail -n +2
This would produce
Start Time FM CR CR Type Customer Name Source Pod Target Pod DM Flag Release Data Center CDB Sync FreeSpace TDE/DV Check M2M Optin M2M Type Database Reorg Details Operations Team Check 09/10/2020-19: 11124482 M2M TCS KCLB-CDB EGLG-TEST N Revision ks8-US-OCC Yes Passed Passed Y sDC Reclaimable Reorg Required RAMU 00 13.20.07 Space: 3532 GB 09/10/2020-19: 11170981 Standalone Wipro, Inc. LMNO-TEST Revision ns2-US NA NA NA NA NA NA NA DataMasking 00 Data Masking 13.20.07