教程：任意网页输出为RSS

发现写满了注释就没啥好说的了，姑且称为教程吧。

需要自己操作的核心只是$regex_item和$regex_item2这两个正则的修改。如有兴趣，可另开教程。

聪明的你可能会发现这两个匹配的内容不是一样的吗，那为什么不合二为一呢？

自己体会。

例一：微博热搜

<?php

include "gethtml.php";

$title_find[0]=$link_find[0]=$des_find[0]='#/[\x00-\x08\x0b-\x0c\x0e-\x1f]/s#';

$title_replace[0]=$link_replace[0]=$des_replace[0]='';

$footer='</channel></rss>';

//以上为一些初始工作，不用管。

$header='<?xml version="1.0" encoding="utf-8"?><rss version="2.0"><channel><title>热搜</title>';//修改RSS名称

$html=gethtml('https://s.weibo.com/top/summary');//要操作的网页。

$html=strpos($html,'charset=gb')===false?$html:iconv('GB2312','UTF-8//IGNORE',$html);//古董网页专用:(

$regex_item = '#<td class="td-02">.+?td>#s';//正则规则在井号中间，s代表可匹配多行，否则只匹配单行，具体要看网页源代码。

$regex_item2 = '#.*?<a href="(?<link>.+?)" target="_blank">(?<title>.+?)</a>.*#s';

if(preg_match_all($regex_item, $html, $items)){//在已经抓下来的网页中匹配"代码块"，每一"块"的原则是结构高度统一且必须包含链接、标题。概要可无，因为可以用标题替代。

//print_r($items[0]);//调试用，如需调试，把最前面的双斜杠去掉。

foreach($items[0] as $item){

if(preg_match($regex_item2,$item)){

$rss.=preg_replace_callback(

$regex_item2,//对"块"分组捕获链接、标题、摘要并命名。

function ($matches) {

global $title_find,$title_replace,$link_find,$link_replace,$des_find,$des_replace;

//以下可对title进行替换操作，酌情增减。

$title_find[1]='#的#';

$title_replace[1]='の';

$title_find[2]='#百度#';

$title_replace[2]='一个无耻的网站';

//以下对link进行替换操作，酌情增减。此例增加网站域名，否则链接无效。

$link_find[1]='#(.+)#';

$link_replace[1]='https://s.weibo.com$1';

//以下可对description进行替换操作，酌情增减。此例并无任何操作。

$title=preg_replace($title_find,$title_replace,$matches['title']);//根据上面规则替换后输出title。$matches['title']为上面分组捕获的内容。

$link=preg_replace($link_find,$link_replace,$matches['link']);//根据上面规则替换后输出link。

$des=preg_replace($des_find,$des_replace,$matches['title']);//根据上面规则替换后输出description。

//以下就是一条最基本的RSS内容了。\n和\t只是格式化了代码，并无大用，方便查错。

return "<item>\n\t<title><![CDATA[".$title."]]></title>\n\t<link><![CDATA[".$link."]]></link>\n\t<description><![CDATA[".$des."]]></description>\n</item>\n";

$item

);

}

//echo $rss;//调试用

//大功告成，输出收工。

file_put_contents('weibotop.xml',$header.$rss.$footer);

}

图一，此为表达式$regex_item匹配内容，即所谓结构统一且必包含所需元素。

例二：百度知道9图轮播

<?php

include "gethtml.php";

$title_find[0]=$link_find[0]=$des_find[0]='#/[\x00-\x08\x0b-\x0c\x0e-\x1f]/s#';

$title_replace[0]=$link_replace[0]=$des_replace[0]='';

$footer='</channel></rss>';

//以上为一些初始工作，不用管。

$header='<?xml version="1.0" encoding="utf-8"?><rss version="2.0"><channel><title>百度知道</title>';//修改RSS名称

$html=gethtml('https://zhidao.baidu.com/');//要操作的网页。

$html=strpos($html,'charset=gb')===false?$html:iconv('GB2312','UTF-8//IGNORE',$html);//古董网页专用:(

$regex_item = '#<a href="(.+?)" target="_blank" class="banner-card-item".*?a>#s';//正则规则在井号中间，s代表可匹配多行，否则只匹配单行，具体要看网页源代码。

$regex_item2 = '#<a href="(?<link>.+?)".+?class="title">(?<title>.+?)</div>.+?class="intro">(?<des>.+?)<.*#s';

//print_r($items[0]);//调试用。如需调试，把最前面的双斜杠去掉。