There are many ways you can use Apache Tika for this kind of work.
As Gagravarr says, the ContentHandler you use is the key here. There are are number of useful ones that could help here, or you can build your own custom one.
Since I'm not sure what you are looking to do, I've tried to share some examples of common approaches, particularly for HTML content.
MatchingContentHandler
A common route is to use a MatchingContentHandler to filter the content you are interested in:
URL url = new URL("http://tika.apache.org");
InputStream is = url.openStream();
// Only select <a> tags to be output
XPathParser xhtmlParser = new XPathParser("xhtml", XHTMLContentHandler.XHTML);
Matcher divContentMatcher = xhtmlParser.parse("//xhtml:a/descendant::node()");
MatchingContentHandler handler = new MatchingContentHandler(new ToHTMLContentHandler(), divContentMatcher);
// Parse based on original question
HtmlParser parser = new HtmlParser()
Metadata metadata = new Metadata();
parser.parse(is, handler, metadata, new ParseContext());
System.out.println("Links: " + handler.toString());
It's worth noting this is for inclusion only and only supports a sub-set of XPath. See XPathParser for details.
LinkContentHandler
If you just want to extract links, the LinkContentHandler is a great option:
URL url = new URL("http://tika.apache.org");
InputStream is = url.openStream();
LinkContentHandler linkHandler = new LinkContentHandler();
Metadata metadata = new Metadata();
HtmlParser parser = new HtmlParser();
parser.parse(is, linkHandler, metadata, new ParseContext());
System.out.println("Links: " + linkHandler.getLinks());
It's code is also a great example of how to build a custom handler.
BoilerpipeContentHandler
The BoilerpipeContentHandler uses the Boilerpipe library underneath allowing you to use one of it's defined extractors to process the content.
URL url = new URL("http://tika.apache.org");
InputStream is = url.openStream();
ExtractorBase extractor = ArticleSentencesExtractor.getInstance();
BoilerpipeContentHandler textHandler = new BoilerpipeContentHandler(new BodyContentHandler(), extractor);
Metadata metadata = new Metadata();
HtmlParser parser = new HtmlParser();
parser.parse(is, textHandler, metadata, new ParseContext());
System.out.println(textHandler.getTextDocument().getTextBlocks());
These can be really useful if you are really interesting in the content inside, as the extractors can help you focus on the content.
Custom ContentHandler or ContentHandlerDecorator
You can build your own ContentHandler to do custom processing and get exactly what you want out from a file.
In some cases this could be to write out specific content, such in the example below, or in other cases it could be processing such as collecting and making available links, as in the LinkHandler.
Using custom ContentHandler instances is really powerful and there are a ton of examples available in the Apache Tika code base, as well other open source projects too.
Below is a bit of a contrived example, just trying to emit part of the HTML:
URL url = new URL("http://tika.apache.org");
InputStream is = url.openStream();
StringWriter sw = new StringWriter();
SAXTransformerFactory factory = (SAXTransformerFactory) SAXTransformerFactory.newInstance();
TransformerHandler handler = factory.newTransformerHandler();
handler.getTransformer().setOutputProperty(OutputKeys.METHOD, "html");
handler.setResult(new StreamResult(sw));
ContentHandlerDecorator h2Handler = new ContentHandlerDecorator(handler) {
private final List<String> elementsToInclude = List.of("h2");
private boolean processElement = false;
@Override
public void startElement(String uri, String local, String name, Attributes atts)
throws SAXException {
if (elementsToInclude.contains(name)) {
processElement = true;
super.startElement(uri, local, name, atts);
}
}
@Override
public void ignorableWhitespace(char[] ch, int start, int length) {
// Skip whitespace
}
@Override
public void characters(char[] ch, int start, int length) throws SAXException {
if (!processElement) {
return;
}
super.characters(ch, start, length);
}
@Override
public void endElement(
String uri, String local, String name) throws SAXException {
if (elementsToInclude.contains(name)) {
processElement = false;
super.endElement(uri, local, name);
}
}
};
HtmlParser parser = new HtmlParser();
parser.parse(is, h2Handler, new Metadata(), new ParseContext());
System.out.println("Heading Level 2s: " + sw.toString());
As you can see in some of these examples one or more ContentHandler instances can be chained together but it's worth noting some expect the output to be well-formed, so check the Javadocs. XHTMLContentHandler is also useful if you want to map different files into your own common format.
JSoup :)
Another route is JSoup, either directly (skip the Tika part and use Jsoup.connect()) if you are processing HTML directly, or chained with Apache Tika, if you are wanting to read from HTML generated from different file types.
URL url = new URL("http://tika.apache.org");
InputStream is = url.openStream();
ToHTMLContentHandler html = new ToHTMLContentHandler();
HtmlParser parser = new HtmlParser();
parser.parse(is, html, new Metadata(), new ParseContext());
Document doc = Jsoup.parse(html.toString());
Elements h2List = doc.select("h2");
for (Element headline : h2List) {
System.out.println(headline.text());
}
Once you've parsed it, you can query the document with Jsoup. Not the most efficient vs a ContentHandler built for the job, but can be useful for messy content sets.