Regex to find season info in imported xmltv file supported?

Added by Ludi K. about 1 month ago

Hi,

For OpenTV, it is possible to to define regular expressions in share/tvheadend/data/conf/epggrab/opentv/prov/skyit to tell tvheadend how to find the season and episode information in the description field of the epg.

Could anybody tell me if something similar is possible with epg data imported with tv_grab_file from an xmltv file? If so, what is the path and the filename I should use for the file with the regex information.

Thanks in advance for any help.


Replies (23)

RE: Regex to find season info in imported xmltv file supported? - Added by Em Smith about 1 month ago

Which xmltv generator are you using to generate your file?
It should generate entries such as:

  <episode-num system="xmltv_ns"> 2 / 11 . 7 / 13 .  </episode-num>                                                                                                                             

I'm guessing your generator is a website scraper? Ideally you should request they add the functionality since then it would be available to all the PVRs rather than being tvheadend-specific.

RE: Regex to find season info in imported xmltv file supported? - Added by Ludi K. about 1 month ago

Hi,

Thanks for your reply.

Your assumption is probably right; I switched temporarily to webgrab+, as the skyit opentv stopped adding the season and episode information to many events. In the meantime, I managed to create a work around by editing the xmltv file before importing it with sed and xslt to put the information in an episode-num tag as you discribed above.

But I understand why the regex is limited to the OTA sources: the user does not have the possibility to manipulate the epg before the import, while it is possible to do it with an xmltv source.

RE: Regex to find season info in imported xmltv file supported? - Added by Em Smith about 1 month ago

I noticed there is tv_augment as part of the xmltv package which suggests it does fixups on xmltv files such as episode numbers ("check_potential_numbering_in_text" and "extract_numbering_from_episode") and also genre fixups and other fixups. I couldn't determine how it all works but it might be useful.

It's a shame the webgrab+ is closed-source so it can't be fixed to generate good episode data.

RE: Regex to find season info in imported xmltv file supported? - Added by saen acro about 1 month ago

When channel is mapped you can attach EPG source to it

RE: Regex to find season info in imported xmltv file supported? - Added by Ludi K. about 1 month ago

Thanks Em for pointing me to the tv_augment tool; I did not know it existed. I don't know if it would have been helpful, but it is probably worth a try. I nevertheless hope that the season and episode information will be available over OTA soon again.

In fact, I chose webgrab+, because I already tried it not long ago and it did contain the information I was looking for. Part of it seems to be public:
https://github.com/silentbuteo2/webgrabplus-siteinipack

RE: Regex to find season info in imported xmltv file supported? - Added by Em Smith about 1 month ago

I thought the Italy OTA was fixed recently? (Sep 29). Or is there an outstanding problem?

I've never used webgrab+, but looking at a couple of the configuration files, I noticed uk ones had what I guess is season/episode extraction, whereas the Italian ones I looked at didn't have it.

index_episode.modify {substring(type=regex pattern="(S'S1' Ep'E1'/'Et1')""(Ep'E1'/'Et1')""(S'S1' Ep'E1')""(Ep'E1')""'S1'/'E1'.")|'index_description' "^(\d+\/\d+\.)"}
index_description.modify {remove('index_episode' not "" type=regex)|^\d+\/\d+\.\s}

[[https://github.com/SilentButeo2/webgrabplus-siteinipack/blob/master/siteini.pack/UK/freesat.co.uk.ini]]

So it seems to me that webgrab+ supports generating season/episode info., it's just not been written for Italy yet.

RE: Regex to find season info in imported xmltv file supported? - Added by Ludi K. about 1 month ago

The "Italy OTA" or more precisely the OTA EPG broadcasted on the "Giallo" channel per Satellite (I think that it is the opentv epg from skyit) was indeed fixed. But at least for the Rai4 channel, they now stopped sending the episode information; they are only sending the season information embedded into the description field (at least for the epg I am able to see in the tvheadend webui; see attached picture). That's why I looked for another epg source.

I don't really understand the two lines:

index_episode.modify {substring(type=regex pattern="(S'S1' Ep'E1'/'Et1')""(Ep'E1'/'Et1')""(S'S1' Ep'E1')""(Ep'E1')""'S1'/'E1'.")|'index_description' "^(\d+\/\d+\.)"}
index_description.modify {remove('index_episode' not "" type=regex)|^\d+\/\d+\.\s}

The first line seems to modify the episode-num tag of the xmltv file. What system is it using; xmltv_ns or onscreen? How does the single regex after the pipe working probably on the description field match with the different regular expressions before the pipe, as some have only episode information; other have season and episode; etc. I don't really expect an answer on this, as this is not the right place the dissect the webgrab+ ini files...

ouat2.png (16.2 KB)

RE: Regex to find season info in imported xmltv file supported? - Added by Em Smith about 1 month ago

Ah, I see. That's bad about the Rai4.

I took a look and SchedulesDirect say they support Italy. They give a free one-week trial and the data they provide (at least for my channels) really is amazing and costs 20 euro a year.

[[http://www.schedulesdirect.org/regions]]

As an example, whereas OTA gives me basic description, SD gives full cast, crew, keywords to describe the programme, rating, age limits, other programmes you might like, etc. I don't know how it is for non-English shows.

Use "tv_grab_zz_sdjson_sqlite".

You can always trial it by giving the grabber a higher priority that OTA or disabling OTA and removing ~/.hts/tvheadend/epgdb.v2 when tvheadend is stopped.

RE: Regex to find season info in imported xmltv file supported? - Added by Ludi K. about 1 month ago

Schedules Direct seems interesting.

However, I don't really understand how it works:
- tv_grab_zz_sdjson_sqlite is a grabber from the xmltv package; correct?

As far as I could understand, it downloads the epg data into a lacol database. But how does the epg data get from the local database into tvheadend?

Thanks in advance for any reply.

RE: Regex to find season info in imported xmltv file supported? - Added by Em Smith about 1 month ago

In Configuration->Channel->Grabber, enable Internal XMLTV: Multinational (Schedules Direct JSON).

I actually run "Internal XMLTV: Combine Data from several other grabbers" which calls SD json a couple of times and combines the results, but that is very memory intensive.

The SD sqlite database is just used as a cache, similar to the files that other grabbers use as a cache but faster.

This is because often you have the same programmes shown several times over the weeks, and the database contains information about series (such as total number of episodes) as well as episodes (specific information about one episode). The grabber automatically purges the cache of programmes that are no longer broadcast.

The SD downloads data, inserts in to its DB and then generates an xmltv file similar to webgrab+ which then appears in Tvheadend (Tvheadend launches the grabber via EPG Grabber tab as an internal crontab).

However, first you need to configure the grabber outside tvheadend though. I forget the exact steps. It was a bit tedious. I think you have to run it first with "--manage-lineups" where you enter your region and then it gives you your satellites that broadcast to that region and then it adds that to your database.

Then I think you run with "--configure" where it then maps the lineup to a config file.

Then it should all just work. You can test it via:

tv_grab_zz_sdjson_sqlite --config-file italy.xml --days 1 --output-file out.xml

Then your out.xml should contain data on all your programmes and channels.

Then you can enable it in tvheadend and it re-run grabber.

It might just work since it links channel name on tvheadend to channel name in xmltv file. But sometimes it doesn't match all of them correctly so you have to go to the epg tab and add missing ones.

I submitted a pull request to the SD grabber to include images in the downloaded data so that should be available soon.

In Tvheadend grabber, I'd tick "scrape credits" if you have a machine with decent memory. That allows you to schedule recordings based on extra criteria in the autorec advanced tab such as "comedy" or "sumo-wrestling", etc.

RE: Regex to find season info in imported xmltv file supported? - Added by Em Smith about 1 month ago

I also add "--days 10" under "extra arguments" in Tvheadend, mainly because although SD has listings out to 14 days, it does get a bit vague after a week on some channels with generic programmes such as "film to be announced", or "Simpsons" but no specifics about the actual episode. Then when it gets to, say, 12 days it will update to be the correct episode details. So I just cap my downloads to a maximum of ten days in the future to avoid seeing them.

RE: Regex to find season info in imported xmltv file supported? - Added by Ludi K. about 1 month ago

Thanks for the detailed explanation. :-)

RE: Regex to find season info in imported xmltv file supported? - Added by Em Smith about 1 month ago

No problem. It occurred to me that maybe you run webgrab+ externally (crontab perhaps) and then tv_grab_file or "nc" to import the file to Tvheadend. In which case you can do the same with SD by using the --output-file option as in the example in earlier post. This is useful if your Tvheadend machine doesn't support the SD grabber but you have another machine that does.

Also, FYI, here's some of the info it has for the programme you gave, which I think is relatively readable.

Don't worry about the obscure format for episode-num, Tvheadend parses it in to S7E4. I know the SD API support multiple languages for the same programme so I'd expect this occurs automatically for the generated file to get Italian descriptions since the API suggests Italian "dd_progid" would be different to English and might have English title and Italian description.

  <episode-num system="dd_progid">EP01419478.0152</episode-num>
  <episode-num system="xmltv_ns"> 6 / 7 . 3 .  </episode-num>
  <previously-shown start="20171027 +0000" />
...
  <title>Once Upon a Time</title>
  <sub-title>Beauty</sub-title>
  <desc lang="en">After being contacted by the child she gave up a decade earlier, a bail bond collector must fight for the present-day world, in which fairy-tale characters now live.</desc>
...
   <actor role="Evil Queen/Regina/Roni">Lana Parrilla</actor>
   <actor role="Rumplestiltskin/Mr. Gold">Robert Carlyle</actor>
   <actor role="Hook/Rogers">Colin O'Donoghue</actor>
   <actor role="Henry Mills">Andrew J. West</actor>
   <actor role="Lady Tremaine/Victoria Belfrey">Gabrielle Anwar</actor>
   <actor role="Cinderella/Jacinda">Dania Ramirez</actor>
   <actor role="Lucy">Alison Fernandez</actor>
...
  <category>series</category>
  <category>Drama</category>
  <category>Fantasy</category>
  <category>Series</category>
  <category>Episode</category>
  <category>Show</category>
  <keyword>2010s</keyword>
  <keyword>Alternate reality</keyword>
  <keyword>Dark</keyword>
  <keyword>Daughter</keyword>
  <keyword>Enchanted forest</keyword>
  <keyword>Evil queen</keyword>
  <keyword>Fairy tales</keyword>
  <keyword>Fantasy world</keyword>
  <keyword>Mother</keyword>
  <keyword>Mother/son relationship</keyword>
  <keyword>Offbeat</keyword>
  <keyword>Prince Charming</keyword>
  <keyword>Snow White</keyword>
  <keyword>Son</keyword>
  <keyword>Witty</keyword>
  <icon src="https://s3.amazonaws.com/schedulesdirect/assets/p8678796_l_h9_ab.jpg" width="1440" height="1080" />
...
  <rating system="VCHIP">
   <value>TV-PG</value>
  </rating>
...

RE: Regex to find season info in imported xmltv file supported? - Added by Ludi K. about 1 month ago

You guessed right: I create the xmltv file on another machine and push it to the low spec NAS, where tvheadend is running, which in turn gets the tv channels from a sat>ip server.

By the way, I am wondering already for some time, why there is a configure switch for hdhomerun in tvheadend. I always assumed that hdhomerun worked similar to sat>ip!? Do you know the reason for it? But let's get back to the epg:

At least for the beginning, I will continue with using tv_grab_file to import the xmltv file, with the differnce that it will be generated by the tv_grab_zz_sdjson_sqlite tool.

Thanks for the quote of the the epg. There is a small mistake into the episode number; it says 6/7, but a season of Once upon a time has more than 20 episodes, not 7. I looked up the xmltv_ns when I adapted the output of webgrab+; it is zerobased episode-number / number of episode in the season . zerobased season number . part number

I can cope if the description is not in italian; for tv shows, I am mainly interested in the season and episode information. I will probably getting around trying the test account tomorrow or the day after it.

Finally, thank you also for the tip in your previous message about downloading only 10 days in advance because some events do not have all the details yet 14 days in advance.

RE: Regex to find season info in imported xmltv file supported? - Added by Em Smith about 1 month ago

I believe hdhomerun uses its own proprietary protocol instead of satip. So there's a library you can link against to get the data (and SD supports some APIs to automatically setup guide data). A number of projects support hdhomerun but don't support satip.

I think the episode number is correct and saying season 7 of a total of 7 seasons, episode 4 of unknown total number of episodes.

Looking at my xml file, other episodes for other series have for example:

  <episode-num system="xmltv_ns"> 4 / 8 . 8 / 21 .  </episode-num>

Each programme has an associated "change id hash" so grabbers re-download the programme details if it is changes. I expect at the end of the season they update the episodes to have the maximum episodes set so on the re-run you get more complete details of the total number of episodes in that season (in case it gets cancelled mid-season).

Remember that on day 6 hour 23 of your trial you can always download the next ten days of data so you effectively have a further couple of weeks to try things out before paying or reverting to webgrab+.

RE: Regex to find season info in imported xmltv file supported? - Added by Ludi K. about 1 month ago

I realise, that the episode-num format I gave above was wrong: it confused the season and episode fields.

But this is not the purpose of this message. I am trying schedules direct and I have a problem with the season and episode information on at least one channel. Here is an event of the channel:

 <programme channel="I99981.json.schedulesdirect.org" start="20171106233500 +0000" stop="20171107002500 +0000">
  <title>Blue Bloods</title>
  <desc lang="it">Le vicende della famiglia Reagan, composta interamente da poliziotti.</desc>
  <credits>
   <actor role="Frank Reagan">Tom Selleck</actor>
   <actor role="Danny Reagan">Donnie Wahlberg</actor>
   <actor role="Erin Reagan">Bridget Moynahan</actor>
   <actor role="Jamie Reagan">Will Estes</actor>
   <actor role="Henry Reagan">Len Cariou</actor>
   <actor role="Linda Reagan">Amy Carlson</actor>
   <actor role="Detective Maria Baez">Marisa Ramirez</actor>
   <actor role="Nicky Reagan-Boyle">Sami Gayle</actor>
   <actor role="Officer Eddie Janko">Vanessa Ray</actor>
   <producer>Leonard Goldberg</producer>
   <producer>Kevin Wade</producer>
  </credits>
  <category>Crime drama</category>
  <category>Series</category>
  <category>Show</category>
  <keyword>Assistant district attorney</keyword>
  <keyword>Police detective</keyword>
  <keyword>Police work</keyword>
  <keyword>Family traditions</keyword>
  <keyword>2010s</keyword>
  <episode-num system="dd_progid">SH01637741.0000</episode-num>
  <episode-num system="xmltv_ns">  / 8 .  / 154 .  </episode-num>
  <audio>
   <stereo>stereo</stereo>
  </audio>
  <previously-shown start="20100924 +0000" />
  <rating system="CHVRS">
   <value>14+</value>
  </rating>
  <rating system="ClassInd">
   <value>14</value>
  </rating>
 </programme>

As you can see, the episode-num tag in xmltv_ns format only contains the total number of season and episodes; so tvheadend cannot get the season and episode number from that tag. But there is also an episode-num tag in the dd_progid format. However, tvheadend is not showing any season and episode information for nearly all the events of that channel, even though the xmltv file has the two episode-num tags as described above. (Only giving the total number of season and episodes in the xmltv_ns format.)

I read that the episode-num tag in the dd_progid format is mainly used in North America; but this does not explain why tvheadend does not show the season and episode information for that channel. Do you have an idea, what is going on? I am particularly asking in order to determine whether it is a tvheadend problem or a schedule direct problem.

Thanks in advance for any reply.

RE: Regex to find season info in imported xmltv file supported? - Added by Em Smith about 1 month ago

Sorry for delay.

The xmltv_ns is strange. I ran my test code against that episode and get the same results, so it is being converted correctly from the SD data.

I have the same series and get proper episode information. But mine are called EP01420516.0142. The "SH" that you have are "generic information" whereas "EP" is episode-specific information.

So, if I query for "SH01420516.0000" (the generic information for the episode) then I get totalEpisodes: 155, totalSeasons: 8, which is what you are getting.

It's quite possible that since the date was the 6th which is t+8 then if you re-grab now (when it is t+6 days) that the generic episode has now become a specific episode. By which I mean that instead of SH you now get EP and specific information.

So that channel was saying "In eight days I will be broadcasting some episode of Blue Bloods (but I don't know which episode yet)" and then as the day gets nearer it then updates the information to be "in two days I will be showing this specific episode of Blue Bloods". That's why I have --days 10, but perhaps your channels need --days 7 to limit to just a week ahead?

Let me know if the schedule has updated to be more specific.

The dd_progid unfortunately does not allow extraction of season/episode from the string. It refers to a specific episode (if it is an EP "episode") or a generic episode (if it's an SH "show"). So for example a Blue Bloods episode is EP01420516.0142, but you can't break-out the .0142 in to season/episode. That information is sent separately as part of the API.

RE: Regex to find season info in imported xmltv file supported? - Added by Ludi K. about 1 month ago

Thanks for confirming that it is not a problem on my side.

Unfortunately, that channel has a lot of "SH", up to the current time. I have not looked into many channels yet, but it seems that the quality of the epg depends on the channel. For example, Rai 4, which is broadcasting many US TV Shows has a good epg with season and episode information.

I now have to activate the schedules direct epg for more channels to get a better overview of the offer.

Thanks again for all your help.

RE: Regex to find season info in imported xmltv file supported? - Added by Em Smith about 1 month ago

That's a shame that the guide data isn't accurate for some channels.

On SD support forum there is a post "What do do about wrong guide data". Although it's a bit US-specific, It mentions to check if anyone else has better guide data, if so then report the issue to them and they will look in to it but don't expect great success. I don't know if that will apply for those channels.

I took a look at mythtv FAQ and it just says use tv_grab_it. That seems to be basically a scraper so probably no better than webgrab+.

You can use a mixture of grabbers, so if webgrab+ works for channels 1,2,3 then use that and use SD for channels 4,5,6.

Otherwise I can't think of much else to try.

RE: Regex to find season info in imported xmltv file supported? - Added by Ludi K. about 1 month ago

For SD, I hope that, if they are giving information, it is at least correct. (I have already seen epg data on other providers continuing to announce a specific show at its usual timeslot, though it was not broadcasted anymore).

Concerning other providers: I don't know for tv_grab_it, but webgrab+ offers several providers for most channels. The user has to create a configuration file that tells webgrab+ what channels to download; it is possible to freely mix providers in the configuration file.

But for the moment, I am trying to find an epg source with accurate data and working over a long period without having to continuously do adjustments.

RE: Regex to find season info in imported xmltv file supported? - Added by Em Smith about 1 month ago

That is bad that SD is returning shows that are not broadcast. I've not seen that for the channels I watch. I guess the best you can hope is to file a bug on SD and see if they can source better data and give you a new trial to evaluate it. No point in paying for wrong data. Sorry it didn't work better for you.

What I meant by mixing grabbers is that you could always use SD for Rai to get good data for US shows and then use webgrab+ (and even OTA) for the rest. The tv_grab_combiner combines multiple xmltv files in to one file, but not in an efficient way, and is what I use since I have to combine satellite and aerial channels.

In your screenshot for Once Upon a Time, I couldn't see a sub-title so you can't even use a post-processor script that looks up episode numbers from episode names.

RE: Regex to find season info in imported xmltv file supported? - Added by Ludi K. about 1 month ago

Sorry for not having been clear: the shows I saw that were wrongly announced were not in the epg data from SD; but from another provider. I adjusted my previous message to make this clear.

Concerning the merging of xmltv files, there is another tool you might try:
tv_cat

It is to early for me to come to a conclusion about SD. There are still a lot of channels, whose data I have not looked at, yet.

RE: Regex to find season info in imported xmltv file supported? - Added by Em Smith about 1 month ago

You're right. But I really should just write a script to grab each one and send it to xmltv.sock directly and avoid all the combining. IWBNI the grabber allowed all SD lineups to be retrieved at the same time.

(1-23/23)