This page last changed on Dec 08, 2013 by iloveautomation.

I recently started trying different things using Regex to extract from a html file that is on my network webserver. The output is as below and I am trying to extract the program name which is the information after MediaName "Fareed Zakaria GPS"

HTML FILE

204 Connected (Build: 1.0.0 Clients: 1) Volume=50 Mute=False FS_Unknown=True TVTuner=True Play=True MediaName=Fareed Zakaria GPS MediaTime=323 TrackNumber=817 TrackTime=320

REGEX CODE

(?<=MediaName=)(.*?)(?=MediaTime)

For some reason I am not getting the information when I tie the command to a sensor. I am using VMC Controller for Windows Media Player to the information from WMCIP:40400

I was wondering if somebody could provide some help.

This regular expression should work

(.*MediaName=)(.*)(MediaTime.*)

And you'll have the desired string in matched group 2

Posted by ebariaux at Dec 09, 2013 08:49

Thanks Eric but unfortunately it did not work. I am getting N/A on the UI where I have a label linked to a sensor. I tried it on both iOS and Webconsole.

Secondly I tried this on an online regex builder (http://gskinner.com/RegExr/) and interestingly it was picking up all the text from the HTML file and not just the MediaName.

Not sure what is going on.

Posted by iloveautomation at Dec 09, 2013 17:34

Try this...more or less the same, but with slightly more location qualifiers...

^(?s)(.*MediaName=)(.*)(MediaTime).*$

It works on the simulator at the link you provided. Just make sure you set the "Read Regex Group" in the configuration of your open-remote command to "2"

Posted by holeymoley at Dec 09, 2013 21:04

Thanks Paddy. I updated the regex but I am not sure where to set the "Read Regex Group" to "2".
My command is a HTTP GET command and scans through a WMCStatus.html file which is sitting on my webserver.
I see the options when using a Telnet command.

Sorry for the ignorance.

Posted by iloveautomation at Dec 09, 2013 21:23

Sorry - you are correct, using a HTTP command does not give you the option of specifying the matched group! Apologies.... I automatically think in terms of TCP/IP and Telnet when regex is mentioned...
Maybe one of the Open Remote Devs could help to clarify if this is possible?

The only other way I can think of is on the web-server end, have some kind of script that can parse the html into a simple XML file and then you could easily use an XPATH expression to look it up. I have never used WMC controller, but if you can have a snoop around on there, it may even be the case that an XML file already exists with your required information in there... a lot of webpages I glean info from are scraped from XML in the first place by the server...

Posted by holeymoley at Dec 09, 2013 21:42

Just had a thought, It might be because your expression works as a 'Find' but not as a 'Match'...

Might be worth trying

^.*(?<=MediaName=)(.*)(?=MediaTime).*$

It will form a match as it includes wildcards to match the start and end of your file. It should also return only one match group, which hopefully the OR Http will pick up on...

My first regex in the post above returns multiple match groups, which probably is too much for the HTTP command to work with...

Hope you get it sorted anyways.

Posted by holeymoley at Dec 09, 2013 23:13

Hi Paddy...Tried it but no luck.
What surprises me is that for the code (?<=TrackNumber=)(\w*) I am correctly getting the TrackNumber (817 is the above example).
Unfortunately no luck with MediaName since it is a string and not one word.

Not being a programmer and limited knowledge on Java I am just trying different things. Thanks for your help so far.

Posted by iloveautomation at Dec 09, 2013 23:48

Weird...

\w
is going to match a word (but only one as far as I know), and as there is a space after the 817 that probably explains that... If you put in
.*
instead I wonder will it pull back everything to the end of the string..track-time etc..If it does, I really cant see why the full matching command I gave you wont work...

Can you post the actual html file you are trying to parse? The line in your first post appears to be from the controller debug log... The actual HTML might have formatting or unusual characters that's not being matched somehow..

Regex is not normally used to parse HTML as by its nature, HTML is not very structured. Don't let that deter you though, if its a simple/trivial file it should be possible..

Posted by holeymoley at Dec 10, 2013 00:12

Hi Paddy... What I have posted above is the html file. I have a batch script that telnets into the WMC system and spits out the output as a html file on to my xampp webserver.
So for example when I type in http://192.168.24.161/WMC_NowPlaying.html in my network I get the webpage with the following output.

204 Connected (Build: 1.0.0 Clients: 1) Volume=50 Mute=False FS_Unknown=True TVTuner=True Play=True MediaName=Fareed Zakaria GPS MediaTime=323 TrackNumber=817 TrackTime=320

Now I could have used Telnet command directly from Openremote but the problem is that it is a live stream of output and so never closes the telnet connection. Using a batch script I run the telnet connection and then kill the telnet connection after 1 second. I agree it is a roundabout way but not having any software background I am trying different things with my limited background.

Posted by iloveautomation at Dec 10, 2013 12:33

So for example when I type in http://192.168.24.161/WMC_NowPlaying.html in my network I get the webpage with the following output.

I guess what Paddy means is that he wants to see the source of that page. So once it is displayed in your browser, press <CTRL>+U to get the source. Copy that in your response and enclose that pasted text in between {code}...{code} tags

Posted by pz1 at Dec 10, 2013 15:07

Thanks for clarification Pieter, but that is what I get from the html source. There are no html tags

As mentioned earlier, I am using a batch script to write the telnet output to an html file and then kill the telnet window since it is a streaming output that does not stop hence does not exit the telnet connection.

Posted by iloveautomation at Dec 10, 2013 16:07

I think this will work for you:

(?<=MediaName=)(.*)(?=MediaTime)

We were close yesterday...seems like in the HTTP regex function, you can only have one return group and it is not necessary to match the entire string. My previous use of regex in OR was with telnet commands, which stumped me for a few days until I figured out (in my use case at least) that OR seemed to need to match the entire string and extract content from selectable groups..

I realise its incredibly close to what you had originally, but I have tested on a test webserver with a html file exactly like yours, with no html tags... I have in working on my Android touch interface (Galaxy S4)

Please let me know if this works for you?

Posted by holeymoley at Dec 10, 2013 21:04

Hi Paddy, I was finally able to getting it working with the code
(?<=MediaName=)[\s\S]*(?=MediaTime).
Seems to be a little crude but works for me and now I am able to get all the text following MediaName until MediaTime.

Posted by iloveautomation at Dec 27, 2013 01:16
Document generated by Confluence on Jun 05, 2016 09:39